STANDARDwalkthrough

URL Dedup / Bloom Filter

3 of 8

3 related

The constraint: we have already seen 15 billion URLs, and before adding any discovered URL to the frontier, we must check whether we have visited it before. The naive approach stores every visited URL in a hash set.

Implication: that exceeds the RAM of any single commodity machine, forcing us to either shard the hash set or find a more memory-efficient structure. We chose a Bloom filter (not a distributed hash set) because it achieves the same deduplication goal with 18 GB at a 1% false positive rate, an 85% memory reduction.

“At 15 billion URLs with an 8-byte checksum per entry, that requires 120 GB of memory for deduplication alone.”

Trade-off: we gave up zero false positives in exchange for fitting the entire dedup structure on a single machine. A Bloom filter allows small false positives, meaning some valid URLs will never be crawled because the filter incorrectly reports them as already visited.

But zero false negatives guarantees no URL is crawled twice. The filter sits at the frontier's entrance: before any discovered URL enters the queue, we check it against the Bloom filter.

Google and Common Crawl both use Bloom filter variants for URL deduplication. The alternative we rejected is a disk-backed hash table, which eliminates false positives but adds disk I/O latency on every URL check.

At 6,200 pages per second, each page averaging 50 outbound links, we perform roughly 310,000 dedup lookups per second. Disk seeks at that volume become a serious bottleneck.

What if the interviewer asks: 'What about the 1% of valid URLs we miss due to false positives?' At 15 billion URLs, 1% means 150 million URLs incorrectly marked as visited. We accept this because these URLs will surface again in subsequent crawl cycles.

The web is re-crawled continuously, so a missed URL in cycle N gets picked up in cycle N+1.

Why it matters in interviews

This concept tests our ability to reason about memory trade-offs at scale. Calculating 120 GB vs 18 GB and explaining why a 1% false positive rate is acceptable shows the quantitative thinking interviewers value.

Related concepts

← PreviousPoliteness / robots.txt Next →Content Fingerprinting / SimHash