Whiteboard Scale›Web Crawler›Anti-Patterns

Anti-Patterns

Web Crawler Anti-Patterns

Common design mistakes candidates make. Learn what goes wrong and how to avoid each trap in your interview.

No Politeness Controls (IP Banned)

Very CommonFORMULA

Blasting hundreds of requests per second at a single domain without rate limiting results in IP bans, legal threats, and blocked crawls.

Why: Candidates focus on throughput and forget that each domain is someone else's server. They design the fetcher to maximize pages/sec globally without per-domain throttling. In testing against their own servers, it works fine. In production, hammering a small blog at 100 req/sec brings it down and gets the crawler's entire IP range blacklisted by Cloudflare.

WRONG: Fetch URLs in pure priority order with no per-domain throttling. A popular domain has 50K URLs in the frontier. The fetcher drains 200 pages/sec from that domain. The site's Nginx returns 429 Too Many Requests, then 503, then the domain's firewall blocks the crawler's IP range. All future crawls to that domain fail.

RIGHT: We enforce 1 request/sec/domain via back queues in the URL frontier. Each domain gets its own queue with a rate limiter. A fetcher thread picks a domain queue, fetches one page, waits the politeness delay, then moves to another domain. We cache and respect robots.txt Crawl-delay directives. We chose per-domain back queues (not a global rate limiter) because a global limiter cannot distinguish between 100 requests to 100 different domains (fine) and 100 requests to one domain (problematic). Trade-off: per-domain queues consume more memory, but the alternative is IP bans that halt the entire crawl.

DFS-Only Crawling (Stuck in One Domain)

Very CommonFORMULA

Using depth-first search follows one link chain deep into a single domain, starving all other domains of crawl budget and missing important pages.

Why: DFS is simpler to implement: a stack. Push new URLs, pop the latest. The crawler goes 50 levels deep into one domain before touching any other. With 15B pages to crawl in 4 weeks, spending hours on a single domain's deep pages (privacy policies, terms of service, paginated archives) wastes precious crawl budget on low-value content.

WRONG: Use a stack-based frontier. The crawler follows links from cnn.com 200 levels deep, crawling 50K pages from one domain before visiting any other seed URL. After 4 weeks, only 10% of domains have been touched. High-value pages on smaller domains are never crawled.

RIGHT: We chose BFS with priority queues (not DFS) because BFS gives broad coverage across domains while priority ordering surfaces high-value pages first. Front queues rank URLs by domain authority, freshness, and link depth. Back queues enforce per-domain fairness. The crawler alternates across thousands of domains per second. Trade-off: BFS with priority requires a more complex data structure than a stack, but the coverage gain is the entire point of a web crawler.

No URL Deduplication (Re-crawling Pages)

Very CommonFORMULA

Without tracking visited URLs, the crawler fetches the same page repeatedly, wasting bandwidth and crawl budget on content it already has.

Why: The web is a graph with cycles. Page A links to page B, which links to page C, which links back to A. Without dedup, the crawler loops forever. Candidates sometimes mention dedup but do not size it. With 15B URLs, a naive HashSet of full URLs costs 1.5 TB of RAM. They realize it does not fit in memory and skip the problem instead of using a Bloom filter.

WRONG: No visited-URL tracking. The crawler re-fetches the same pages every time it encounters their links. A site with 1,000 pages and 10 internal links per page generates 10K frontier entries for 1,000 unique pages. Multiply across 15B pages: the crawler does 10x the work and never finishes the crawl cycle.

RIGHT: We chose a Bloom filter (not a full-URL hash set) for the hot in-memory check: 18 GB for 15B URLs at 1% false positive rate. Before adding any URL to the frontier, we query the Bloom filter. The 1% false positives (150M URLs skipped) are acceptable because the crawl repeats every 4 weeks. We back it with a disk-based checksum store (120 GB) for exact verification when needed. Trade-off: the Bloom filter cannot delete entries, so if we need to re-crawl a specific URL, we must bypass it manually.

Full URLs in Visited Set (TBs of Memory)

Very CommonFORMULA

Storing complete URL strings in the visited set consumes 1.5 TB for 15 billion URLs, far beyond what fits in memory on any reasonable machine.

Why: The straightforward approach is a HashSet<String>. It works in development with 10K test URLs. But average URL length is 100 bytes. At 15B URLs: $15B \times 100\text{B} = 1.5\text{ TB}$ . Plus Java/Go object overhead (24-40 bytes per entry), the real cost is closer to 2 TB. Even sharded across 10 machines, that is 200 GB per machine for the visited set alone.

WRONG: Store full URL strings in an in-memory hash set. Memory grows to 1.5 TB+. The set cannot fit on any single machine. Sharding adds network round-trips for every dedup check, adding 5-10ms latency per URL and slowing the entire crawl pipeline.

RIGHT: We chose a Bloom filter at 10 bits per element (not full URL strings, not checksums alone). The Bloom filter:

15B \times 10 / 8 = 18\text{ GB}

, fits on a single machine with RAM to spare. For exact verification, we back it with a 64-bit checksum store:

15B \times 8\text{B} = 120\text{ GB}

on disk. We chose the Bloom filter for the hot path and the checksum store for the cold path. Trade-off: 1% false positive rate means 150M URLs are unnecessarily skipped, but re-crawling every 4 weeks makes that negligible.

DNS Lookup Every Fetch (50-200ms per Page)

CommonFORMULA

Resolving DNS for every single page fetch adds 50-200ms of latency per page, effectively halving the crawler's throughput.

Why: DNS resolution is invisible in normal web development because browsers and OS-level resolvers cache aggressively. But a web crawler hits millions of unique domains. The OS DNS cache has limited size (typically 500-1,000 entries). Once the cache is full, every new domain triggers a cold DNS lookup. At 6,200 pages/sec, even 100ms average DNS latency means 620 fetcher threads are blocked on DNS at any given moment.

WRONG: Rely on the OS DNS cache with its default 500-entry limit. The crawler hits 100K+ unique domains per hour. Cache misses trigger cold DNS lookups at 100-200ms each. Effective crawl rate drops from 6,200 to 3,100 pages/sec because fetcher threads spend half their time waiting for DNS.

RIGHT: We run a local DNS cache (Unbound or a custom in-process cache, not the OS resolver) holding the top 1M domains (58 MB). Cache hit rate exceeds 90%. For cache misses, we batch DNS pre-resolution: look up domains 30 seconds before fetching their pages. We chose local caching over relying on external resolvers because external resolvers rate-limit us at 1,000 queries/sec. Trade-off: we must manage cache invalidation and respect DNS TTL, but the latency savings (50-200ms per fetch) outweigh the operational cost.

No Content Deduplication (Duplicate Storage)

CommonFORMULA

Without content fingerprinting, the crawler stores duplicate and near-duplicate pages, wasting storage and polluting downstream indexes.

Why: URL dedup catches exact URL matches, but the web has massive content duplication: mirror sites, syndicated articles, printer-friendly versions, and pages that differ only in headers or ads. Studies show 30-40% of web pages are near-duplicates of another page. Without content dedup, 30% of the 1.5 PB crawl store is wasted on redundant content.

WRONG: Only deduplicate by URL. Two different URLs (cnn.com/article and cnn.com/amp/article) serve identical content. The crawler stores both. With 30% duplication:

0.3 \times 1.5\text{ PB} = 450\text{ TB}

of wasted storage. Downstream search indexing processes all duplicates, wasting compute.

RIGHT: We compute a SimHash (64-bit fingerprint) for each page's content. We chose SimHash (not MD5 or SHA-256) because SimHash is locality-sensitive: similar pages produce similar fingerprints. If fewer than 3 bits differ, we mark the page as near-duplicate and skip storage. SimHash storage:

15B \times 8\text{B} = 120\text{ GB}

. Trade-off: SimHash can produce false positives on template-heavy sites, so we run a two-phase check for borderline cases.

No Checkpointing (Crash = Full Restart)

CommonFORMULA

Without periodic checkpoints, a crash after 3 weeks of crawling means restarting the entire 4-week crawl from scratch.

Why: Checkpointing adds complexity: you need to serialize the URL frontier (100M+ URLs), the Bloom filter (18 GB), and the crawl configuration to durable storage. Candidates skip it because the crawl 'should not crash.' But over a 4-week run across 50 machines, hardware failures, network partitions, and OOM kills are guaranteed. Without checkpointing, the only option after a crash is to re-crawl everything from the seed URLs.

WRONG: Keep all crawler state in memory only. After 3 weeks (75% complete), a node failure corrupts the in-memory frontier. The crawler restarts from seed URLs, re-crawling 11.25B pages already fetched. The 4-week deadline becomes 7 weeks.

RIGHT: We snapshot crawler state every 10 minutes to HDFS or S3 (not local disk, which dies with the node): the URL frontier, Bloom filter, and per-domain crawl position. We chose 10-minute intervals (not 1 minute) because more frequent snapshots add I/O overhead that slows the crawl. On crash, we restore from the latest checkpoint. Maximum lost work: 10 minutes = 3.72M pages = 10 minutes to re-crawl. We keep the last 3 snapshots for rollback if the latest is corrupted. Trade-off: 10 minutes of lost work is acceptable for a 4-week crawl.

No Trap Detection (Infinite Crawl)

CommonFORMULA

Without crawler trap detection, a single malicious or misconfigured site generates infinite URLs that consume the entire crawl budget.

Why: Traps are not obvious. A calendar page at /events?date=2024-01-01 links to /events?date=2024-01-02, which links to the next day, forever. Session IDs in URLs (/page?sid=abc123) create a new 'unique' URL for every visit. Intentional spider traps use JavaScript or server-side tricks to generate endless link trees. The crawler's Bloom filter sees each URL as unique and adds all of them to the frontier.

WRONG: No depth limit, no per-domain page cap. A calendar trap generates 365 URLs/page. After 3 levels:

365^3 = 48.6M

URLs from one domain. The frontier fills with trap URLs, starving legitimate domains. The crawler spends days on a single site.

RIGHT: We built a three-layer defense (not a single depth limit): (1) depth limit of 15 hops from any seed URL, (2) max 10K pages per domain per crawl cycle, (3) URL pattern detection that blacklists patterns generating 100+ similar URLs (e.g., /calendar?date=*). We chose three layers because each catches a different trap type: depth limits catch deep chains, domain caps catch wide traps, and pattern detection catches parameterized infinite URLs. Trade-off: we may miss legitimate deep content, but we log blacklisted patterns for human review.