Cloud Storage Anti-Patterns

Common design mistakes candidates make. Learn what goes wrong and how to avoid each trap in your interview.

Uploading Entire File on Every Edit

Very CommonFORMULA

The constraint: at 100M DAU, naive full-file uploads would consume 1 exabyte/day of bandwidth. Without block-level chunking, every file save triggers a full re-upload. A 1 GB file edited once re-uploads all 1 GB instead of the 8 MB that actually changed.

Why: Chunking adds client-side complexity: hash computation, chunk manifest diffing, and partial upload logic. The naive approach is simpler to implement. Early file sync tools like early versions of Google Drive used file-level sync before switching to block-level.

WRONG: Upload the entire file on every save. A user editing a 1 GB presentation 10 times per day transfers 10 GB/day instead of 80 MB/day. Multiply by 100M DAU: 1 exabyte/day versus 8 PB/day. The bandwidth cost alone makes this infeasible at scale.

RIGHT: We split files into 4 MB chunks with SHA-256 hashes. On each edit, we recompute hashes and diff against the server manifest. We upload only the 2-3 changed chunks. Bandwidth drops by 99% for typical edits. We chose 4 MB (not smaller) to keep the chunk index manageable: 40 bytes per chunk entry.

Polling Server Every Second for Changes

Very CommonFORMULA

The constraint: 300M connected devices (100M DAU x 3 devices) need near-instant change notifications. Short-interval polling wastes server resources and bandwidth while still adding latency. Most polls return empty responses because files change infrequently.

Why: Polling is the simplest sync mechanism. It requires no server-side connection management and works through any proxy or firewall. Candidates default to it because it is familiar from web development.

WRONG: Poll GET /sync/changes every 1 second from each device. With 300M connected devices, that is 300M requests/sec. Over 99% return empty. Each empty response costs 200 bytes of headers. Wasted bandwidth: 60 GB/sec of empty poll responses.

RIGHT: We chose long polling (not WebSockets, not SSE) because it is stateless and firewall-friendly. Each device opens an HTTP request that the server holds open for up to 60 seconds. The server responds immediately when a change occurs. Empty polls drop from 300M/sec to 5M/sec (one reconnection per minute per device). Trade-off: up to 1 second latency versus WebSocket's near-instant push, but we eliminate persistent connection management.

Using File-Level Hashing for Dedup

CommonFORMULA

The constraint: 50 team members have slightly different versions of the same 100 MB file. File-level hashing treats each as entirely unique. Hashing the entire file instead of individual chunks misses partial duplicates and provides no benefit for incremental sync.

Why: File-level hashing is conceptually simpler: one hash per file, one lookup in the dedup index. Candidates think of dedup as a binary question (is this file a duplicate?) rather than a chunk-level optimization.

WRONG: Compute one SHA-256 hash per file. Two files that share 99% of their content but differ by 1 byte produce completely different hashes. Zero dedup benefit for similar files. A team of 50 people with slightly different versions of a 100 MB file stores 5 GB instead of 100 MB.

RIGHT: We compute SHA-256 per 4 MB chunk. Files sharing 99% of content share 99% of chunks in the dedup index. The 50 team members store 100 MB of shared chunks + 50 x 4 MB of unique chunks = 300 MB total. We chose block-level dedup (not file-level) because it captures partial duplicates that file-level completely misses. Trade-off: the chunk index is 100 TB instead of 4 TB (per-file index), but the 40-60% storage savings far exceed the index cost.

No Conflict Detection

CommonFORMULA

The constraint: two users can edit the same file offline for hours. Without conflict detection, one user's work is silently overwritten. Silently applying Last Writer Wins without surfacing conflicts means offline edits can be permanently lost without the user knowing.

Why: Conflict detection adds complexity: version vectors, conflict file creation, and user-facing resolution UI. LWW is simpler and works fine for metadata updates. Candidates apply the same strategy to file content without realizing the data loss risk.

WRONG: Apply Last Writer Wins to all file edits. User A edits offline for 3 hours, User B edits offline for 2 hours. B syncs first (version 6). A syncs and overwrites B's work (version 7). B's 2 hours of work are silently lost. No notification, no recovery.

RIGHT: We chose fork-and-surface (not LWW, not OT) because it guarantees zero data loss for any file type. We use version vectors to detect concurrent edits. When two edits share the same parent version, we store both and create a conflicted copy file. Trade-off: users see occasional conflict files, but data loss is the alternative. Data loss: zero. User friction: low (conflict files are rare in practice).

Single Metadata Database

CommonFORMULA

The constraint: 100 billion file records at 35K sync operations per second. A single MySQL instance maxes out at roughly 10K writes/sec and 1 billion rows before performance degrades.

Why: Starting with a single database is fine for prototyping. Candidates forget to shard when they scale the design to production numbers. A single MySQL instance handles roughly 10K writes/sec and 1 billion rows before performance degrades.

WRONG: Store all 100B file records in a single MySQL instance. At 200B per record, the table is 20 TB, far exceeding what a single instance can handle. Write throughput of 35K ops/sec at peak is 3.5x above a single instance's capacity. Every query scans a massive B-tree. p99 latency exceeds 500ms.

RIGHT: We shard by user_id hash (not file_id) across 16 MySQL instances. We chose user_id because the dominant access pattern ('my files', 'my sync') is user-scoped, so queries hit exactly one shard. Each shard holds 6.25B rows = 1.25 TB. Write throughput per shard:

35K / 16 = 2,\!200\text{ ops/sec}

, well within capacity. Trade-off: cross-user queries (shared files) require scatter-gather across all 16 shards, but these are 100x less frequent.

Storing Chunks Inline with Metadata

CommonFORMULA

The constraint: metadata is 200 bytes/row and read-heavy, while chunks are 4 MB and write-heavy. Putting file content in the same database as metadata forces one system to handle two conflicting access patterns.

Why: Storing everything in one place simplifies the architecture. Candidates use a BLOB column in MySQL for file content. This works for small files but fails catastrophically at scale because B-tree pages bloat with multi-megabyte values.

WRONG: Store 4 MB chunks as BLOB columns in MySQL alongside metadata. Each row becomes 4 MB+ instead of 200B. InnoDB buffer pool fills with blob data, evicting hot metadata pages. Index scans become 1000x slower. Backups take days. Replication lag spikes.

RIGHT: We separate metadata in MySQL from chunks in S3/GCS. We chose this split (not a single store) because the two access patterns conflict. Metadata rows stay at 200 bytes, keeping the B-tree compact and index scans fast. S3 handles multi-petabyte blob storage with 11 nines durability. Trade-off: a crash between S3 write and MySQL update creates orphaned chunks, which we clean with a periodic reconciliation job.

No Resumable Upload

CommonFORMULA

The constraint: a 5 GB upload on a 50 Mbps connection takes 13 minutes. Mobile networks drop frequently. Without resumable uploads, any network interruption forces a full restart from byte zero.

Why: Standard HTTP multipart upload works for small files. Candidates use the same approach for 5 GB files without considering network reliability. On mobile with a 30% drop rate for large files, the effective success rate is only 34% ( $0.7^3$ for 3 attempts).

WRONG: Accept the entire file in a single HTTP POST. A 5 GB upload at 50 Mbps takes 13 minutes. Connection drops at 90% (4.5 GB transferred). Entire upload restarts. After 3 failures, the user gives up. On mobile with 30% drop rate, effective success rate is 34%.

RIGHT: We chose the tus protocol (not a custom implementation) because it is an open standard with battle-tested client libraries. POST to create session, PUT 4 MB chunks sequentially, server tracks byte offset in Redis. On interruption, HEAD to get last offset, resume from there. Maximum wasted transfer per drop: 4 MB. Trade-off: one extra HEAD request per retry, plus 100 bytes of session state in Redis. Success rate on mobile: 99%+ because each chunk retry is fast.

Synchronous Notification Fanout

OccasionalFORMULA

The constraint: a shared folder with 10,000 collaborators means one edit triggers 10,000 notifications (times 3 devices = 30,000 connections). Blocking the sync response while notifying all of them creates unbounded latency and can cascade into timeouts.

Why: The simplest implementation notifies all affected users in the same request handler that processes the sync. For unshared files (1 user, 3 devices), this is fast. But a shared folder with 10K collaborators turns a 50ms sync into a 10-second blocking fanout.

WRONG: Synchronously send 10,000 long-poll responses in the sync request handler. At 1ms per notification: 10 seconds of blocking. The user who saved the file waits 10 seconds for confirmation. If any notification target is slow, the entire sync stalls. Timeout cascades affect all users.

RIGHT: We decouple the save confirmation from the fanout. The sync handler writes a single event to Kafka and returns immediately (50ms). A separate notification consumer reads from Kafka, partitioned by user_id, and fans out to long-poll connections asynchronously. We rate-limit per-folder notifications at 1/sec to prevent storms. Trade-off: passive viewers see changes delayed by up to 30 seconds via tiered delivery, but the save path stays fast.