Whiteboard ScaleTopicsCloud Storage

Google Drive / Dropbox

COMMON

Cloud file storage design comes up at Google, Dropbox, and Microsoft interviews because it tests file chunking, sync protocols, and conflict resolution in one problem. It is how Dropbox syncs 600 billion files across 700 million users with sub-second change propagation. You will design a block-level chunking system that transfers only modified 4 MB chunks, a sync service that detects conflicts before they corrupt data, and a metadata database sharded to handle 100K file operations per second.

  • Design block-level file chunking that transfers only modified 4 MB chunks
  • Build a sync service with conflict detection across 3+ devices
  • Avoid the split-brain sync bug that corrupts shared folders
GoogleDropboxMicrosoftAmazonMetaApple
8
Concepts
Deep dives
10
Cheat Items
Quick ref
Elevator Pitch3-minute interview summary

Cloud file storage for 100M DAU syncing 600 billion content blocks across 500M users. Files split into 4 MB chunks identified by SHA-256. On edit, client uploads only changed chunks, cutting bandwidth by 75%. Dedup via hash matching saves 40-60% storage, bringing 10 PB gross to 5 PB net. Sync service uses long polling with cursor-based deltas for sub-second propagation. Metadata in MySQL sharded by user_id across 16 shards at 35K sync ops/sec peak. Chunks in S3 with 11 nines durability. Conflicts detected via version vectors.

Concepts Unlocked8 concepts in this topic

File Chunking

EASY

We split files into 4 MB blocks and compute SHA-256 per chunk. We chose 4 MB (not 64 KB or 64 MB) because smaller chunks explode the index while larger chunks miss partial edits. A 1 GB file with 2 changed chunks transfers 8 MB, not 1 GB. Trade-off: CPU cost of hashing 256 chunks per GB.

Core Feature Design

Data Deduplication

STANDARD

We store each unique chunk once, keyed by SHA-256 hash. We chose reference counting (not mark-and-sweep GC) for deletion tracking because ref_count updates are O(1). Saves 40-60% of total storage. Trade-off: ref_count can drift on crashes, requiring weekly reconciliation.

Core Feature Design

Sync Protocol

STANDARD

We chose long polling (not WebSockets, not SSE) with cursor-based delta sync because long polling is stateless and firewall-friendly. Each device holds an open HTTP request for up to 60 seconds. Trade-off: up to 1 second latency versus WebSocket's near-instant push, but we eliminate connection affinity overhead.

High Level Design

Conflict Resolution

TRICKY

We chose fork-and-surface (not LWW, not OT) because LWW silently discards edits and OT requires an always-online server that fails for binary files. We detect concurrent edits via version vectors and create a conflicted copy. Trade-off: users see occasional conflict files, but zero data loss.

Fault Tolerance

Resumable Upload

EASY

We chose the tus protocol (not custom) because it is an open standard with battle-tested client libraries. PUT 4 MB chunks sequentially, server tracks byte offset in Redis. Trade-off: one extra HEAD request per retry, but maximum wasted transfer per drop is 4 MB instead of the full file.

Core Feature Design

Metadata Sharding

STANDARD

We shard 100B file records across 16 MySQL shards by user_id hash (not file_id) because the dominant access pattern is user-scoped. Sharding by file_id would scatter every 'list my files' query across all 16 shards. Trade-off: cross-user queries (shared files) require scatter-gather, but these are 100x less frequent.

Database Design

Block Storage

EASY

We separate metadata in MySQL from chunks in S3/GCS because metadata is 200B/row and read-heavy while chunks are 4 MB and write-heavy. We chose S3 (not HDFS, not local disk) for 11 nines durability with zero disk management. Trade-off: crash between S3 write and MySQL update creates orphaned chunks, cleaned by reconciliation.

High Level Design

Notification Service

STANDARD

We chose long polling (not WebSockets) for sync notifications because it is stateless, firewall-friendly, and needs no connection affinity. We fan out through Kafka partitioned by user_id. We rate-limit per-folder at 1/sec to prevent notification storms from viral shared folders. Trade-off: up to 1 second latency vs. WebSocket push.

High Level Design