STANDARDwalkthrough
Consistent Hashing for Distribution
The constraint: a single machine cannot crawl the entire web. We need a cluster of crawler nodes, but naive modulo hashing (hash(domain) % N) reshuffles almost every domain assignment when a node joins or leaves.
We chose consistent hashing (not modulo hashing) because it assigns each hostname to a crawler node on a ring, and when a node dies, only 1/N of domains migrate to neighboring nodes. The remaining domains stay on their current nodes with their dedup state and rate limiters intact.
“With 20 nodes and one failure, roughly 95% of domains get reassigned, breaking our Bloom filter locality and politeness state.”
Trade-off: we gave up uniform load distribution (which modulo hashing guarantees) in exchange for minimal disruption on node changes. We fix the uniformity gap with virtual nodes: typically 150 to 200 per physical node, spreading each machine's responsibility across multiple points on the ring.
Why does consistent hashing exist in our crawler design specifically? Because hashing by domain name means all URLs for one domain land on the same node.
This gives us two properties for free: politeness enforcement without cross-node coordination (the node's local rate limiter handles it), and local deduplication (the node's Bloom filter sees every URL for its assigned domains). We chose hostname-based hashing (not URL-based) because URL-based hashing would scatter URLs from the same domain across all nodes, requiring a distributed rate limiter and a shared Bloom filter.
This is the same consistent hashing pattern used in DynamoDB and Cassandra for partition assignment. Common Crawl distributes its crawl across hundreds of nodes using hostname-based partitioning.
What if the interviewer asks: 'What about hot domains like wikipedia.org with millions of pages?' We split hot domains by path prefix: hash('wikipedia.org/wiki/A') and hash('wikipedia.org/wiki/B') land on different nodes. Each sub-partition enforces its own rate limit, and we set the aggregate limit across partitions to match the domain's politeness constraint.