Web Crawler System Design Walkthrough
Complete design walkthrough with animated diagrams, capacity math, API design, schema, and failure modes.
Solution PathTarget: 30 min
We designed a web crawler processing 15 billion pages over a 4-week cycle at 6,200 pages/sec. Two-level URL Frontier (front queues for priority, back queues for 1 req/sec/domain politeness) solves the priority-vs-politeness tension. 18 GB Bloom filter for URL dedup (not 120 GB hash set, 85% savings at 1% FP). 1.5 PB page storage on S3/HDFS compressed to 300 TB. 10-minute checkpointing prevents losing weeks of crawl progress on node failure.
1/10
1.
What is Web Crawler?
A web crawler starts from seed URLs, fetches each page, extracts links, and adds them to a queue. It is BFS at internet scale. Google indexes over 100 billion web pages, all discovered by crawlers.
Search engines use them for indexing, the Internet Archive for preservation, price comparison sites for product data, and SEO tools for link auditing. The real challenge is crawling billions of pages while being polite to servers, avoiding duplicate work, and finishing before content goes stale.
How do we fetch 6,200 pages per second without hammering any single domain? How do we track 15 billion visited URLs without a database lookup on every check?
What happens when a node crashes two weeks into a four-week crawl? What about a website generating infinite URLs from a calendar page?
Every design decision flows from these tensions.
Google indexes 100B+ pages, all discovered by crawlers. BFS at internet scale: start from seed URLs, fetch pages, extract links, repeat. The real challenge: billions of pages with politeness, dedup, fault tolerance, and freshness constraints.
Google's Googlebot indexes over 100 billion pages across the web. A web crawler systematically browses the internet by fetching pages, extracting links, and adding them to a URL frontier for future fetching. The cycle repeats: fetch, parse, extract, enqueue. This is how Google, Bing, Common Crawl, and the Internet Archive discover and catalog the web. The system sounds simple, but the real challenge is crawling 15 billion pages in 4 weeks while being polite to servers, avoiding duplicate content, and recovering from crashes without starting over.
- We use BFS with priority queues (not DFS) to decide what to fetch next. The URL frontier has two layers: front queues rank URLs by importance (PageRank, freshness), and back queues enforce politeness (one request per second per domain). We chose this dual structure because a single queue cannot enforce both priority and politeness simultaneously.
- We deduplicate URLs with a Bloom filter at 18 GB for 15B URLs (1% false positive rate). We chose the Bloom filter over a full-URL hash set (1.5 TB) because it fits in RAM on one machine. For content dedup, we use SimHash: a 64-bit locality-sensitive hash where fewer than 3 bit differences means near-duplicate. SimHash catches mirror sites and syndicated content that URL dedup misses.
- At 6,200 pages/sec, the crawler generates 620 MB/sec of incoming data. Total storage: 1.5 PB for 15B pages at 100 KB average. Compressed with gzip at 5:1 ratio: 300 TB. We cache DNS for the top 1M domains (58 MB) to eliminate 50-200ms of latency per fetch.
- This topic tests your ability to reason about distributed state management, probabilistic data structures, and fault tolerance over long-running jobs. The patterns apply to any large-scale data ingestion pipeline: ETL systems, log collectors, and content aggregation platforms.