Web Crawler Cheat Sheet
Key concepts, trade-offs, and quick-reference notes for your interview prep.
BFS vs DFS Trade-off
#1💡 BFS with priority is the standard. DFS wastes crawl budget by going deep before going wide. Always ask: which page should we fetch next to maximize coverage?
Politeness: 1 req/sec/domain, Cache robots.txt 24h
#2URL Dedup: Bloom Filter 18 GB vs Checksum 120 GB
#3💡 Bloom filter for the hot path (in-memory, 18 GB). If we need zero false positives for the final visited set, we back it with a disk-based checksum store (120 GB).
Content Dedup: SimHash 64-bit, < 3 Bit Diff
#4DNS Cache: Top 1M Domains, Saves 50-200ms
#5💡 We respect DNS TTL values. Some domains rotate IPs for load balancing. We set a max cache TTL of 24 hours to catch IP changes.
URL Frontier: Front Queues + Back Queues
#6💡 Front queues answer 'what is most important?' Back queues answer 'when can we fetch from this domain without being rude?' Both are needed.
Storage: 100KB Avg Page, 15B Pages = 1.5 PB
#7💡 We store raw HTML, not rendered pages. Rendering requires a browser engine and 10-100x more compute. We index the raw HTML and render only when needed for specific analysis.
Crawl Rate: 15B / 4 Weeks = 6,200 Pages/sec
#8Checkpointing: Snapshot Every 10 Minutes
#9💡 We write checkpoints to distributed storage, not local disk. A node failure should not lose the checkpoint. We keep the last 3 checkpoints for rollback safety.