Web Crawler
VERY COMMONWeb crawler is a hard-difficulty topic asked at Google, Amazon, and Microsoft. It is how Googlebot discovers and indexes 100 billion web pages. You will design a two-level URL frontier that balances priority with politeness, deduplicate 15 billion URLs using Bloom filters, and handle crawler traps that generate infinite URL spaces.
- Design a two-level URL frontier for priority and politeness
- Deduplicate 15 billion URLs in 18 GB using Bloom filters
- Detect and escape crawler traps that generate infinite URLs
Visual Solutions
Step-by-step animated walkthroughs with capacity estimation, API design, database schema, and failure modes built in.
Cheat Sheet
Key concepts, trade-offs, and quick-reference notes for Web Crawler. Everything you need at a glance.
Anti-Patterns
Common design mistakes candidates make. Wrong approaches vs correct approaches for each trap.
Failure Modes
What breaks in production, how to detect it, and how to fix it. Detection metrics, mitigations, and severity ratings.
Start simple. Build to staff-level.
“Web crawler for 15 billion pages in a 4-week cycle at 6,200 pages/sec. Two-level URL frontier: front queues prioritize by PageRank and freshness, back queues enforce politeness at 1 req/sec/domain. Bloom filter deduplicates URLs at 18 GB for 15B entries with 1% false positive rate. SimHash fingerprints detect near-duplicates (pages differing by less than 3 of 64 bits). Consistent hashing distributes domains across nodes. Checkpointing snapshots the frontier every 10 minutes. Storage: 1.5 PB for content, 120 GB each for URL and document checksums.”
BFS Crawl Strategy
EASYWe chose BFS (not DFS) to explore the web level by level from seed URLs. BFS finds high-value pages early because pages closer to the root have higher PageRank, and it gives even coverage across domains instead of diving deep into one site.
Core Feature DesignURL Frontier (Front + Back Queues)
EASYWe designed the frontier as two layers (not a single queue) because priority and politeness conflict. Front queues rank URLs by priority (PageRank, freshness). Back queues enforce politeness (one queue per domain, 1 req/sec).
Core Feature DesignBloom Filter for URL Dedup
STANDARDWe chose a Bloom filter (not a full-URL hash set at 1.5 TB) because 18 GB fits in RAM on one machine for 15B URLs with 1% false positive rate. The 150M false positives per cycle are acceptable because we re-crawl every 4 weeks.
High Level DesignSimHash Content Fingerprinting
STANDARDWe chose SimHash (not MD5) because it is a locality-sensitive hash: similar pages produce similar 64-bit fingerprints. Pages with fewer than 3 bit differences are near-duplicates. This catches mirror sites and syndicated content that URL dedup misses.
Core Feature Designrobots.txt and Politeness
EASYWe fetch robots.txt before crawling any domain and cache rules for 24 hours (not per-request). We respect Crawl-delay directives because violating politeness risks IP bans across CDN networks that protect 20%+ of the web.
Fault ToleranceDNS Caching
STANDARDWe run a local DNS cache (not the OS resolver) for the top 1M domains in 58 MB of RAM. We chose this because the OS cache holds only 500 entries, and cold lookups at 50-200ms would block 620 fetcher threads at any moment.
High Level DesignCrawl Checkpointing
STANDARDWe checkpoint every 10 minutes to HDFS/S3 (not local disk, not every minute). Maximum lost work: 3.72M pages (10 minutes to re-crawl). We chose 10-minute intervals to balance I/O overhead against the cost of a 4-week crawl restarting from scratch.
Fault ToleranceCrawler Trap Detection
TRICKYWe defend against traps with three layers (not a single depth limit): depth limit 15, max 10K pages/domain, and URL pattern detection. We chose three layers because each catches a different trap type: deep chains, wide traps, and parameterized infinite URLs.
Fault Tolerance