Whiteboard Scale›Web Crawler›Cheat Sheet

Cheat Sheet

Web Crawler Cheat Sheet

Key concepts, trade-offs, and quick-reference notes for your interview prep.

BFS vs DFS Trade-off

We chose BFS with priority (not DFS) because DFS follows one link chain deep into a single domain, starving all other domains of crawl budget. BFS explores pages level by level: start from seed URLs, fetch all their outlinks, then fetch all those outlinks. This gives even coverage across domains and finds high-PageRank pages early (closer to root = higher authority). DFS is simpler (a stack), but it gets trapped in deep chains on one domain. We add a priority queue where each URL is scored by domain authority, freshness, and depth. Trade-off: BFS with priority requires a more complex frontier data structure (front queues + back queues), but the coverage gain is worth it.

💡 BFS with priority is the standard. DFS wastes crawl budget by going deep before going wide. Always ask: which page should we fetch next to maximize coverage?

Politeness: 1 req/sec/domain, Cache robots.txt 24h

We enforce 1 request per second per domain (not unlimited throughput) because hammering a domain at 100 pages/sec looks like a DDoS attack and gets our IP banned. robots.txt tells us which paths are off-limits and specifies a Crawl-delay (often 1-10 seconds). We cache robots.txt for 24 hours (not per-request) to avoid re-fetching it before every page. At 6,200 pages/sec across millions of domains, politeness is rarely the bottleneck because most seconds we hit different domains. Trade-off: respecting Crawl-delay of 10 seconds on some sites means we crawl them 10x slower, but violating it risks IP bans and legal issues.

⚠ Ignoring Crawl-delay directives. Some sites set Crawl-delay: 10, meaning one request every 10 seconds. Ignoring this risks IP bans and legal issues.

URL Dedup: Bloom Filter 18 GB vs Checksum 120 GB

We chose a Bloom filter (not a full-URL hash set) for in-memory URL dedup because it fits the entire 15B URL space in 18 GB. The alternatives: storing the full URL (avg 100 bytes) costs

15B \times 100\text{B} = 1.5\text{ TB}

. Using a checksum (8-byte hash per URL):

15B \times 8\text{B} = 120\text{ GB}

. The Bloom filter at 10 bits per element and 1% false positive rate:

15B \times 10\text{ bits} / 8 = 18.75\text{ GB}

. It fits in RAM on a single machine. Trade-off: 1% false positive rate means 150M URLs are skipped unnecessarily per cycle, but that is acceptable when we re-crawl every 4 weeks anyway.

💡 Bloom filter for the hot path (in-memory, 18 GB). If we need zero false positives for the final visited set, we back it with a disk-based checksum store (120 GB).

Content Dedup: SimHash 64-bit, < 3 Bit Diff

We chose SimHash (not MD5 or SHA-256) for content deduplication because SimHash is a locality-sensitive hash: similar pages produce similar fingerprints. MD5 is a cryptographic hash where changing one character produces a completely different hash, so it misses near-duplicates like mirror sites and printer-friendly versions. SimHash generates a 64-bit fingerprint from page content. Two pages with fewer than 3 bits different in their SimHash are considered near-duplicates. Storage:

15B \times 8\text{B} = 120\text{ GB}

for all fingerprints. Trade-off: SimHash can have false positives on template-heavy sites, but a two-phase check (SimHash then exact comparison) catches those.

⚠ Using MD5 or SHA-256 for content dedup. These are cryptographic hashes: a single character difference produces a completely different hash. We miss near-duplicates like printer-friendly versions or pages with different ads.

DNS Cache: Top 1M Domains, Saves 50-200ms

We run a local DNS cache (not relying on the OS resolver) because the OS cache holds only 500-1,000 entries, but our crawler hits millions of unique domains. A cold DNS lookup takes 50-200ms. At 6,200 pages/sec, spending 100ms on DNS per fetch wastes

6{,}200 \times 0.1\text{s} = 620

seconds of fetcher time per second. That is 620 fetcher threads blocked on DNS alone. We cache the top 1 million domains (covers 90%+ of crawled pages). Cache entry size: domain (50B) + IP (4B) + TTL (4B) = 58B. Total:

1M \times 58\text{B} = 58\text{ MB}

. Trade-off: we must respect DNS TTL values because some domains rotate IPs for load balancing, so we set a max cache TTL of 24 hours.

💡 We respect DNS TTL values. Some domains rotate IPs for load balancing. We set a max cache TTL of 24 hours to catch IP changes.

URL Frontier: Front Queues + Back Queues

We designed the URL frontier as a two-layer structure (not a single priority queue) because a single queue cannot enforce both priority and politeness simultaneously. Front queues (priority): multiple queues sorted by priority (PageRank, freshness, domain authority). A prioritizer assigns each URL to the right queue. Higher-priority queues are polled more often. Back queues (politeness): one queue per domain. A fetcher thread is assigned to one back queue at a time and respects the per-domain rate limit. The front-to-back mapping routes URLs from priority queues to the correct domain queue. Trade-off: the two-layer structure is more complex to implement than a single queue, but it prevents priority-based crawling from violating politeness constraints.

💡 Front queues answer 'what is most important?' Back queues answer 'when can we fetch from this domain without being rude?' Both are needed.

Storage: 100KB Avg Page, 15B Pages = 1.5 PB

We chose distributed blob storage (HDFS or S3, not a traditional database) because 1.5 PB of page data is far beyond what any relational database can hold. An average web page (HTML + text content) is about 100 KB after stripping images and scripts. With 15 billion pages:

15B \times 100\text{ KB} = 1.5\text{ PB}

. We compress with gzip (typical 5:1 ratio):

1.5\text{ PB} / 5 = 300\text{ TB}

compressed. We store metadata (URL, timestamp, SimHash, status code) separately in a database and page content in blob storage keyed by URL hash. Trade-off: blob storage has higher read latency than a database, but downstream pipelines batch-process pages anyway.

💡 We store raw HTML, not rendered pages. Rendering requires a browser engine and 10-100x more compute. We index the raw HTML and render only when needed for specific analysis.

Crawl Rate: 15B / 4 Weeks = 6,200 Pages/sec

We target 15 billion pages in 4 weeks. The math:

15B / (4 \times 7 \times 24 \times 3{,}600) = 15B / 2{,}419{,}200\text{s} = 6{,}200\text{ pages/sec}

. Each fetch takes about 200-500ms (DNS + TCP + HTTP + download). A single fetcher thread handles

1{,}000\text{ms} / 300\text{ms avg} \approx 3.3

pages/sec. Required threads:

6{,}200 / 3.3 \approx 1{,}880

threads. We spread this across 50 machines: 38 fetcher threads per machine. Network bandwidth:

6{,}200 \times 100\text{ KB} = 620\text{ MB/sec}

incoming. That is about 5 Gbps sustained, well within a data center's capacity. Trade-off: we size for 50 machines (not fewer, bigger machines) to distribute DNS and politeness state across more endpoints.

⚠ Not accounting for failed fetches, retries, and timeouts. We assume 20% overhead: target 7,440 fetches/sec to net 6,200 successful pages/sec.

Checkpointing: Snapshot Every 10 Minutes

We chose to checkpoint every 10 minutes to HDFS/S3 (not local disk, not every 1 minute) because a 4-week crawl cannot restart from scratch after a crash. We snapshot the URL frontier (pending URLs), the visited set (Bloom filter or checksum store), and the current crawl position. At 6,200 pages/sec, 10 minutes of lost work = 3.72 million pages to re-crawl. That is

3.72M / 6{,}200 = 600

seconds (10 minutes) to recover. Acceptable. Trade-off: checkpointing every 1 minute would reduce loss to 372K pages but adds I/O overhead that slows the crawl pipeline. We keep the last 3 checkpoints for rollback safety in case the latest is corrupted.

💡 We write checkpoints to distributed storage, not local disk. A node failure should not lose the checkpoint. We keep the last 3 checkpoints for rollback safety.

Crawler Traps: Depth Limit 15, Max 10K Pages/Domain

#10

We defend against crawler traps with a three-layer approach (not a single depth limit alone) because traps come in multiple forms: calendar pages with endless date links, session IDs in URLs, and intentional spider traps. Without protection, a single trap domain consumes the entire crawl budget. Our defenses: (1) depth limit of 15: we stop following links more than 15 hops from any seed URL, since most useful pages are within 5-6 hops. (2) Max 10K pages per domain: we cap the number of pages crawled from any single domain per cycle. (3) URL pattern detection: if more than 100 URLs match the same regex pattern (e.g., /calendar?date=YYYY-MM-DD), we blacklist the pattern. Trade-off: we may miss legitimate deep content (wikis with 20+ hops), but protecting the crawl budget outweighs that risk.

⚠ No depth limit at all. A calendar page with date links generates 365 new URLs per page, each linking to 365 more. After 3 levels: $365^3 = 48.6M$ URLs from one domain.