Video Deduplication

6 of 8

3 related

The same movie trailer gets uploaded 10,000 times by different users. Without deduplication, we transcode all 10,000 copies at 6 resolutions each, wasting 60,000 transcoding jobs worth of compute.

We need content-aware matching that recognizes visual similarity despite binary differences. We chose perceptual hashing (pHash) as our primary detection method (not cryptographic hashing like MD5 or SHA-256) because pHash generates a compact 64-bit fingerprint from the DCT coefficients of each frame.

“The constraint: exact byte-for-byte matching fails because re-encodings, watermarks, and resolution changes produce different binaries from visually identical content.”

Visually similar frames produce hashes with a Hamming distance under 5, even if the underlying bytes are completely different. MD5 or SHA-256 would produce entirely different hashes for the same video re-encoded at a different bitrate, making them useless for visual deduplication.

Trade-off: we accept a small false positive rate (visually different frames that happen to hash similarly) in exchange for detecting re-encoded duplicates that cryptographic hashes would miss. We complement pHash with block matching, which divides frames into 16x16 pixel blocks and compares motion vectors to find visually identical sequences across longer clips.

YouTube's Content ID system scans every upload against a reference database of over 100 million files, and roughly 30% of uploads match existing content. Implication: 30% dedup rate means we save 30% of our transcoding compute budget.

Skipping one redundant 1080p transcode saves approximately 20 CPU-minutes. The dedup check runs before transcoding, gating the most expensive pipeline step.

Trade-off on false positives: overly aggressive matching can flag legitimate fair-use content, so we combine automated hashing with manual review queues, accepting slower processing for flagged content in exchange for accuracy. What if the interviewer asks: why not deduplicate after transcoding instead of before?

Because transcoding is the most expensive step. Deduplicating before transcoding saves compute.

Deduplicating after wastes the compute and only saves storage.

Why it matters in interviews

Interviewers test whether we think about compute savings before the expensive transcoding step. Explaining why we chose pHash over cryptographic hashing and showing the Hamming distance threshold demonstrates we understand content-aware deduplication, not naive byte comparison.

Related concepts

← PreviousResumable Upload Next →Thumbnail Generation