TRICKYwalkthrough
Content Fingerprinting / SimHash
Two different URLs can serve identical or near-identical content. News syndication, mirror sites, and content farms create millions of duplicates across the web.
We need a fingerprint where similar documents produce similar hashes. We chose SimHash (not MD5 or MinHash) because it generates a 64-bit fingerprint where documents with fewer than 3 differing bits are classified as near-duplicates.
“The constraint: an MD5 checksum catches exact duplicates but misses pages that differ by a single ad banner or timestamp.”
The algorithm converts each page into a set of weighted token hashes, then collapses them into a single 64-bit value through a column-wise majority vote. The key property: SimHash preserves locality, so similar documents produce similar hashes.
Cryptographic hashes like MD5 do the opposite: a single-bit input change produces a completely different output, which is exactly what we do not want for near-duplicate detection. At 15 billion pages, the fingerprint store requires 15B x 8 bytes = 120 GB.
Implication: this fits on a single high-memory machine or we can partition it across nodes using consistent hashing. We chose a 3-bit Hamming distance threshold (not 5 or 7 bits) because Google's research showed that 3 bits captures 95%+ of near-duplicates while keeping false positive rates below 0.1%.
Trade-off: we gave up detection of loosely similar pages (which would need a higher threshold) in exchange for precision. Google invented SimHash for this exact purpose, and it remains their primary near-duplicate detection method. What if the interviewer asks: 'Why not use MinHash with Jaccard similarity instead?' MinHash works well for set similarity (like detecting plagiarism) but requires storing multiple hash values per document (typically 128-256), multiplying our storage by 16x-32x compared to SimHash's single 64-bit value.
At 15 billion pages, that difference is the gap between 120 GB and 2-4 TB.