Whiteboard ScaleTopicsWeb Crawler

Web Crawler

VERY COMMON

Web crawler is a hard-difficulty topic asked at Google, Amazon, and Microsoft. It is how Googlebot discovers and indexes 100 billion web pages. You will design a two-level URL frontier that balances priority with politeness, deduplicate 15 billion URLs using Bloom filters, and handle crawler traps that generate infinite URL spaces.

  • Design a two-level URL frontier for priority and politeness
  • Deduplicate 15 billion URLs in 18 GB using Bloom filters
  • Detect and escape crawler traps that generate infinite URLs
GoogleAmazonMicrosoftAppleMetaLinkedIn
8
Concepts
Deep dives
10
Cheat Items
Quick ref
Elevator Pitch3-minute interview summary

Web crawler for 15 billion pages in a 4-week cycle at 6,200 pages/sec. Two-level URL frontier: front queues prioritize by PageRank and freshness, back queues enforce politeness at 1 req/sec/domain. Bloom filter deduplicates URLs at 18 GB for 15B entries with 1% false positive rate. SimHash fingerprints detect near-duplicates (pages differing by less than 3 of 64 bits). Consistent hashing distributes domains across nodes. Checkpointing snapshots the frontier every 10 minutes. Storage: 1.5 PB for content, 120 GB each for URL and document checksums.

Concepts Unlocked8 concepts in this topic

BFS Crawl Strategy

EASY

We chose BFS (not DFS) to explore the web level by level from seed URLs. BFS finds high-value pages early because pages closer to the root have higher PageRank, and it gives even coverage across domains instead of diving deep into one site.

Core Feature Design

URL Frontier (Front + Back Queues)

EASY

We designed the frontier as two layers (not a single queue) because priority and politeness conflict. Front queues rank URLs by priority (PageRank, freshness). Back queues enforce politeness (one queue per domain, 1 req/sec).

Core Feature Design

Bloom Filter for URL Dedup

STANDARD

We chose a Bloom filter (not a full-URL hash set at 1.5 TB) because 18 GB fits in RAM on one machine for 15B URLs with 1% false positive rate. The 150M false positives per cycle are acceptable because we re-crawl every 4 weeks.

High Level Design

SimHash Content Fingerprinting

STANDARD

We chose SimHash (not MD5) because it is a locality-sensitive hash: similar pages produce similar 64-bit fingerprints. Pages with fewer than 3 bit differences are near-duplicates. This catches mirror sites and syndicated content that URL dedup misses.

Core Feature Design

robots.txt and Politeness

EASY

We fetch robots.txt before crawling any domain and cache rules for 24 hours (not per-request). We respect Crawl-delay directives because violating politeness risks IP bans across CDN networks that protect 20%+ of the web.

Fault Tolerance

DNS Caching

STANDARD

We run a local DNS cache (not the OS resolver) for the top 1M domains in 58 MB of RAM. We chose this because the OS cache holds only 500 entries, and cold lookups at 50-200ms would block 620 fetcher threads at any moment.

High Level Design

Crawl Checkpointing

STANDARD

We checkpoint every 10 minutes to HDFS/S3 (not local disk, not every minute). Maximum lost work: 3.72M pages (10 minutes to re-crawl). We chose 10-minute intervals to balance I/O overhead against the cost of a 4-week crawl restarting from scratch.

Fault Tolerance

Crawler Trap Detection

TRICKY

We defend against traps with three layers (not a single depth limit): depth limit 15, max 10K pages/domain, and URL pattern detection. We chose three layers because each catches a different trap type: deep chains, wide traps, and parameterized infinite URLs.

Fault Tolerance