TRICKYwalkthrough
Crawler Traps
The constraint: some websites create infinite URL spaces that can exhaust our crawler indefinitely, and we have no way to know in advance which sites contain them. A calendar page generates links for every future date: /calendar/2025/03/15, /calendar/2025/03/16, continuing forever.
Symbolic links on file servers create directory loops where /a/b/c/a/b/c repeats endlessly. These spider traps waste bandwidth, storage, and crawler time.
“Session IDs appended to URLs create unique addresses for identical content: /page?sid=abc123 and /page?sid=def456 serve the same HTML but look like different pages to our Bloom filter.”
Without detection, a single trap domain can consume an entire crawler node's capacity for days. We defend against traps using multiple heuristics working together because no single rule catches all trap types.
A URL depth limit of 15 path segments catches deep loops. Pattern detection identifies repeated path segments like /a/b/a/b. A per-domain cap of 10,000 pages prevents any single site from consuming excessive resources.
We chose a 10,000-page cap (not 100,000) because even large legitimate sites like wikipedia.org have most of their link value concentrated in the top 10,000 most-linked pages. Trade-off: we gave up completeness on very large sites in exchange for protection against traps, and we whitelist known large domains that legitimately exceed the cap.
Google maintains a manually curated blacklist of known trap domains alongside automated detection. Cloudflare's Bot Management documentation reveals that some sites intentionally deploy traps (honeypots) to identify and block crawlers.
What if the interviewer asks: 'How do we distinguish a legitimate deep site from a trap?' We compare the SimHash fingerprints of pages at increasing depths. In a trap, pages at depth 15 and depth 50 have near-identical content (same template, different parameters).
In a legitimate deep site, content diversity increases with depth.
Related concepts