TRICKYwalkthrough

Content Fingerprinting / SimHash

4 of 8

3 related

Two different URLs can serve identical or near-identical content. News syndication, mirror sites, and content farms create millions of duplicates across the web.

We need a fingerprint where similar documents produce similar hashes. We chose SimHash (not MD5 or MinHash) because it generates a 64-bit fingerprint where documents with fewer than 3 differing bits are classified as near-duplicates.

“The constraint: an MD5 checksum catches exact duplicates but misses pages that differ by a single ad banner or timestamp.”

The algorithm converts each page into a set of weighted token hashes, then collapses them into a single 64-bit value through a column-wise majority vote. The key property: SimHash preserves locality, so similar documents produce similar hashes.

Cryptographic hashes like MD5 do the opposite: a single-bit input change produces a completely different output, which is exactly what we do not want for near-duplicate detection. At 15 billion pages, the fingerprint store requires 15B x 8 bytes = 120 GB.

Implication: this fits on a single high-memory machine or we can partition it across nodes using consistent hashing. We chose a 3-bit Hamming distance threshold (not 5 or 7 bits) because Google's research showed that 3 bits captures 95%+ of near-duplicates while keeping false positive rates below 0.1%.

Trade-off: we gave up detection of loosely similar pages (which would need a higher threshold) in exchange for precision. Google invented SimHash for this exact purpose, and it remains their primary near-duplicate detection method. What if the interviewer asks: 'Why not use MinHash with Jaccard similarity instead?' MinHash works well for set similarity (like detecting plagiarism) but requires storing multiple hash values per document (typically 128-256), multiplying our storage by 16x-32x compared to SimHash's single 64-bit value.

At 15 billion pages, that difference is the gap between 120 GB and 2-4 TB.

Why it matters in interviews

Interviewers test whether we understand the difference between exact and near-duplicate detection. Explaining why MD5 fails for mirrors and syndicated content while SimHash catches them with a 3-bit Hamming distance threshold demonstrates depth beyond surface-level answers.

Related concepts

← PreviousURL Dedup / Bloom Filter Next →DNS Resolution Cache