EASYwalkthrough

File Chunking

1 of 8

3 related

Why transfer a 1 GB file when only one paragraph changed? We split every file into fixed-size 4 MB blocks, each identified by its SHA-256 hash.

Only blocks with changed hashes get uploaded. Dropbox pioneered this approach, chunking 600 billion content blocks across 700 million users.

“When a user edits a document, the client recomputes hashes for each block and compares them against the server's chunk manifest.”

We chose 4 MB (not 64 KB or 64 MB) because smaller chunks detect finer-grained changes but explode the chunk index to hundreds of entries per file, while larger chunks miss partial edits and waste bandwidth. Trade-off: we accepted the CPU cost of hashing 256 chunks per GB file (roughly 200 ms on a modern laptop).

To offset this, we use a rolling hash (Rabin fingerprint) for fast boundary detection before falling back to SHA-256 for verification. The result: 99% bandwidth reduction on incremental edits.

What if the interviewer asks: 'Why not use content-defined chunking?' Content-defined chunking (CDC) with Rabin fingerprinting produces variable-size chunks whose boundaries shift with content, improving dedup for inserted data. We chose fixed-size for simpler offset math in the chunk manifest, accepting slightly lower dedup on insertions.

Why it matters in interviews

Interviewers want to hear us reason about the chunk size trade-off and explain why block-level sync beats file-level sync. Describing the two-hash approach (rolling hash for speed, SHA-256 for correctness) shows we understand the production implementation.

Related concepts

Next →Data Deduplication