Cheat Sheet

Web Crawler Cheat Sheet

Key concepts, trade-offs, and quick-reference notes for your interview prep.

BFS vs DFS Trade-off

#1
We chose BFS with priority (not DFS) because DFS follows one link chain deep into a single domain, starving all other domains of crawl budget. BFS explores pages level by level: start from seed URLs, fetch all their outlinks, then fetch all those outlinks. This gives even coverage across domains and finds high-PageRank pages early (closer to root = higher authority). DFS is simpler (a stack), but it gets trapped in deep chains on one domain. We add a priority queue where each URL is scored by domain authority, freshness, and depth. Trade-off: BFS with priority requires a more complex frontier data structure (front queues + back queues), but the coverage gain is worth it.

💡 BFS with priority is the standard. DFS wastes crawl budget by going deep before going wide. Always ask: which page should we fetch next to maximize coverage?

Politeness: 1 req/sec/domain, Cache robots.txt 24h

#2
We enforce 1 request per second per domain (not unlimited throughput) because hammering a domain at 100 pages/sec looks like a DDoS attack and gets our IP banned. robots.txt tells us which paths are off-limits and specifies a Crawl-delay (often 1-10 seconds). We cache robots.txt for 24 hours (not per-request) to avoid re-fetching it before every page. At 6,200 pages/sec across millions of domains, politeness is rarely the bottleneck because most seconds we hit different domains. Trade-off: respecting Crawl-delay of 10 seconds on some sites means we crawl them 10x slower, but violating it risks IP bans and legal issues.
Ignoring Crawl-delay directives. Some sites set Crawl-delay: 10, meaning one request every 10 seconds. Ignoring this risks IP bans and legal issues.

URL Dedup: Bloom Filter 18 GB vs Checksum 120 GB

#3
We chose a Bloom filter (not a full-URL hash set) for in-memory URL dedup because it fits the entire 15B URL space in 18 GB. The alternatives: storing the full URL (avg 100 bytes) costs 15B×100B=1.5 TB15B \times 100\text{B} = 1.5\text{ TB}. Using a checksum (8-byte hash per URL): 15B×8B=120 GB15B \times 8\text{B} = 120\text{ GB}. The Bloom filter at 10 bits per element and 1% false positive rate: 15B×10 bits/8=18.75 GB15B \times 10\text{ bits} / 8 = 18.75\text{ GB}. It fits in RAM on a single machine. Trade-off: 1% false positive rate means 150M URLs are skipped unnecessarily per cycle, but that is acceptable when we re-crawl every 4 weeks anyway.

💡 Bloom filter for the hot path (in-memory, 18 GB). If we need zero false positives for the final visited set, we back it with a disk-based checksum store (120 GB).

Content Dedup: SimHash 64-bit, < 3 Bit Diff

#4
We chose SimHash (not MD5 or SHA-256) for content deduplication because SimHash is a locality-sensitive hash: similar pages produce similar fingerprints. MD5 is a cryptographic hash where changing one character produces a completely different hash, so it misses near-duplicates like mirror sites and printer-friendly versions. SimHash generates a 64-bit fingerprint from page content. Two pages with fewer than 3 bits different in their SimHash are considered near-duplicates. Storage: 15B×8B=120 GB15B \times 8\text{B} = 120\text{ GB} for all fingerprints. Trade-off: SimHash can have false positives on template-heavy sites, but a two-phase check (SimHash then exact comparison) catches those.
Using MD5 or SHA-256 for content dedup. These are cryptographic hashes: a single character difference produces a completely different hash. We miss near-duplicates like printer-friendly versions or pages with different ads.

DNS Cache: Top 1M Domains, Saves 50-200ms

#5
We run a local DNS cache (not relying on the OS resolver) because the OS cache holds only 500-1,000 entries, but our crawler hits millions of unique domains. A cold DNS lookup takes 50-200ms. At 6,200 pages/sec, spending 100ms on DNS per fetch wastes 6,200×0.1s=6206{,}200 \times 0.1\text{s} = 620 seconds of fetcher time per second. That is 620 fetcher threads blocked on DNS alone. We cache the top 1 million domains (covers 90%+ of crawled pages). Cache entry size: domain (50B) + IP (4B) + TTL (4B) = 58B. Total: 1M×58B=58 MB1M \times 58\text{B} = 58\text{ MB}. Trade-off: we must respect DNS TTL values because some domains rotate IPs for load balancing, so we set a max cache TTL of 24 hours.

💡 We respect DNS TTL values. Some domains rotate IPs for load balancing. We set a max cache TTL of 24 hours to catch IP changes.

URL Frontier: Front Queues + Back Queues

#6
We designed the URL frontier as a two-layer structure (not a single priority queue) because a single queue cannot enforce both priority and politeness simultaneously. Front queues (priority): multiple queues sorted by priority (PageRank, freshness, domain authority). A prioritizer assigns each URL to the right queue. Higher-priority queues are polled more often. Back queues (politeness): one queue per domain. A fetcher thread is assigned to one back queue at a time and respects the per-domain rate limit. The front-to-back mapping routes URLs from priority queues to the correct domain queue. Trade-off: the two-layer structure is more complex to implement than a single queue, but it prevents priority-based crawling from violating politeness constraints.

💡 Front queues answer 'what is most important?' Back queues answer 'when can we fetch from this domain without being rude?' Both are needed.

Storage: 100KB Avg Page, 15B Pages = 1.5 PB

#7
We chose distributed blob storage (HDFS or S3, not a traditional database) because 1.5 PB of page data is far beyond what any relational database can hold. An average web page (HTML + text content) is about 100 KB after stripping images and scripts. With 15 billion pages: 15B×100 KB=1.5 PB15B \times 100\text{ KB} = 1.5\text{ PB}. We compress with gzip (typical 5:1 ratio): 1.5 PB/5=300 TB1.5\text{ PB} / 5 = 300\text{ TB} compressed. We store metadata (URL, timestamp, SimHash, status code) separately in a database and page content in blob storage keyed by URL hash. Trade-off: blob storage has higher read latency than a database, but downstream pipelines batch-process pages anyway.

💡 We store raw HTML, not rendered pages. Rendering requires a browser engine and 10-100x more compute. We index the raw HTML and render only when needed for specific analysis.

Crawl Rate: 15B / 4 Weeks = 6,200 Pages/sec

#8
We target 15 billion pages in 4 weeks. The math: 15B/(4×7×24×3,600)=15B/2,419,200s=6,200 pages/sec15B / (4 \times 7 \times 24 \times 3{,}600) = 15B / 2{,}419{,}200\text{s} = 6{,}200\text{ pages/sec}. Each fetch takes about 200-500ms (DNS + TCP + HTTP + download). A single fetcher thread handles 1,000ms/300ms avg3.31{,}000\text{ms} / 300\text{ms avg} \approx 3.3 pages/sec. Required threads: 6,200/3.31,8806{,}200 / 3.3 \approx 1{,}880 threads. We spread this across 50 machines: 38 fetcher threads per machine. Network bandwidth: 6,200×100 KB=620 MB/sec6{,}200 \times 100\text{ KB} = 620\text{ MB/sec} incoming. That is about 5 Gbps sustained, well within a data center's capacity. Trade-off: we size for 50 machines (not fewer, bigger machines) to distribute DNS and politeness state across more endpoints.
Not accounting for failed fetches, retries, and timeouts. We assume 20% overhead: target 7,440 fetches/sec to net 6,200 successful pages/sec.

Checkpointing: Snapshot Every 10 Minutes

#9
We chose to checkpoint every 10 minutes to HDFS/S3 (not local disk, not every 1 minute) because a 4-week crawl cannot restart from scratch after a crash. We snapshot the URL frontier (pending URLs), the visited set (Bloom filter or checksum store), and the current crawl position. At 6,200 pages/sec, 10 minutes of lost work = 3.72 million pages to re-crawl. That is 3.72M/6,200=6003.72M / 6{,}200 = 600 seconds (10 minutes) to recover. Acceptable. Trade-off: checkpointing every 1 minute would reduce loss to 372K pages but adds I/O overhead that slows the crawl pipeline. We keep the last 3 checkpoints for rollback safety in case the latest is corrupted.

💡 We write checkpoints to distributed storage, not local disk. A node failure should not lose the checkpoint. We keep the last 3 checkpoints for rollback safety.

Crawler Traps: Depth Limit 15, Max 10K Pages/Domain

#10
We defend against crawler traps with a three-layer approach (not a single depth limit alone) because traps come in multiple forms: calendar pages with endless date links, session IDs in URLs, and intentional spider traps. Without protection, a single trap domain consumes the entire crawl budget. Our defenses: (1) depth limit of 15: we stop following links more than 15 hops from any seed URL, since most useful pages are within 5-6 hops. (2) Max 10K pages per domain: we cap the number of pages crawled from any single domain per cycle. (3) URL pattern detection: if more than 100 URLs match the same regex pattern (e.g., /calendar?date=YYYY-MM-DD), we blacklist the pattern. Trade-off: we may miss legitimate deep content (wikis with 20+ hops), but protecting the crawl budget outweighs that risk.
No depth limit at all. A calendar page with date links generates 365 new URLs per page, each linking to 365 more. After 3 levels: $365^3 = 48.6M$ URLs from one domain.