STANDARDwalkthrough

Data Deduplication

2 of 8
3 related
When 50 people on the same team upload the same onboarding PDF, storing 50 copies wastes petabytes. Data deduplication stores each unique chunk exactly once, tracked by its SHA-256 hash as the primary key in a chunk index table. A reference count (ref_count) tracks how many file versions point to each chunk.
Dropbox reports 40-60% storage savings across its entire user base from dedup alone. At 10 PB gross storage, that saves 4-6 PB, worth millions in S3 costs annually.
When a new upload produces a chunk hash that already exists, we skip the upload and increment the ref_count.
The deletion path is where dedup gets tricky: we cannot delete a chunk when one user removes their file because other users may still reference it. Instead, a garbage collector periodically scans for chunks where ref_count drops to zero and deletes them after a grace period.
We chose reference counting (not mark-and-sweep GC) because ref_count updates are O(1) per chunk operation, while mark-and-sweep requires a full graph traversal of all file versions. Trade-off: ref_count can drift if a crash occurs between decrementing and confirming the delete, so we run a weekly reconciliation job to correct drift.
The catastrophic failure is a hash collision: two different chunks producing the same SHA-256 hash. The probability is 1/22561/2^{256}, effectively zero, but we store chunk size alongside the hash and verify on mismatch as defense-in-depth.
Why it matters in interviews
This is the biggest cost lever in the system. Interviewers expect us to explain reference counting for safe deletion, the garbage collection lifecycle, and why the hash collision risk is negligible but still worth defending against.
Related concepts