Cloud Storage Failure Modes
What breaks, how to detect it, and how to fix it. Every failure includes detection metrics, mitigations, and severity rating.
Split-brain sync conflict corrupts shared folder
Data loss from overwritten edits is the most damaging failure in a file storage system. Users trust the system to preserve every edit.
Constraint: two engineers on the same team can edit the same file offline for hours. What breaks: Engineer A restructures sections 1-3, Engineer B rewrites the conclusion. Both land from a flight, reconnect, and sync within seconds of each other. Engineer A's sync completes first (version 6). Engineer B's sync arrives with parent version 5, conflicting with version 6. Without conflict detection, the server applies B's edit as version 7, overwriting A's 3 hours of work.
- We store both versions: A's edit as version 6, B's edit as a 'conflicted copy' file alongside the original. No data is lost.
- We surface the conflict to both users: notification says 'Conflicting edits detected for design-doc.pdf. Both versions are saved.'
- For shared folders: we lock the conflict to the specific file, not the entire folder. Other files sync normally.
- For text files: we implement three-way merge, diffing both edits against the common ancestor (version 5) and auto-merging non-overlapping changes. We flag overlapping regions for manual resolution.
Chunk upload partial failure leaves orphaned data
Orphaned chunks waste storage but do not corrupt data. The periodic GC keeps the waste bounded.
Constraint: large file uploads span many chunks over extended time. What breaks: a user uploads a 500 MB file (125 chunks). After 120 chunks succeed, the network drops. The user goes on a week-long trip. The upload session expires (24h TTL). The 120 chunks exist in S3 but are not linked to any file version. They are orphans consuming storage but unreachable.
- We extend session TTL for large uploads: if upload progress exceeds 50%, we auto-extend the session by 7 days.
- Resumable upload: on reconnect, the client sends HEAD to check session status. If expired, we create a new session and re-link the existing 120 chunks (they are already in S3 and the chunk index).
- Orphan GC: we run weekly, deleting chunks in S3 that have ref_count = 0 in the chunk index AND are older than a 7-day grace period.
- Dedup prevents waste: if the 120 orphaned chunks match content from other users' files, they already have ref_count > 0 and are not orphans.
Metadata DB failover returns stale version numbers
Stale versions cause sync to miss edits, which users perceive as data loss even though the data exists on the old primary's disk.
Constraint: MySQL replication has inherent lag. What breaks: the primary MySQL instance for shard 7 fails (disk error). A replica promotes to primary. The replica has 2 seconds of replication lag: it is missing the last 50 file version updates. A device syncing against the new primary receives version 5 for a file that is actually at version 7. It computes a delta against version 5, missing 2 edits.
- We use semi-synchronous replication: the primary waits for at least one replica to acknowledge each write before confirming to the client. This ensures the promoted replica has all committed writes.
- The sync service retries with exponential backoff when it detects a version number regression (device reports a higher version than the server shows).
- Read-after-write consistency: after a write, the sync service reads from the primary (not replica) for that user's next sync request. Sticky routing lasts 5 seconds.
- Failure detection: MySQL Group Replication with automatic failover. Promotion takes under 10 seconds. Client retry with backoff bridges the gap.
Notification service overload from viral shared folder
Notification overload degrades sync latency for all users, not only the viral folder. The blast radius makes this a system-wide issue.
Constraint: a single edit in a 10,000-person shared folder must notify all collaborators. What breaks: the notification service must fan out to 10,000 long-poll connections (times 3 devices = 30,000 connections). A second edit arrives 500ms later, triggering another 30,000 notifications. During a busy meeting, 20 edits in 10 seconds generate 600,000 notification deliveries. Queue depth spikes, latency degrades for all users (not this folder alone), and long-poll connections start timing out.
- We batch notifications per folder: aggregate all changes within a 1-second window into a single notification event. 20 edits in 10 seconds become 10 batched notifications, not 20.
- We fan out through Kafka partitioned by user_id: each partition handles notifications for a subset of users, preventing a single viral folder from monopolizing the notification pipeline.
- We rate-limit per folder: cap notifications at 1 per second per folder. Changes within the same second are merged. Users receive 'folder updated' rather than individual file notifications.
- We use tiered delivery: immediate notification for the file owner and active editors. Delayed notification (up to 30 seconds) for passive viewers. This reduces peak fan-out by 80%.
Dedup hash collision stores wrong content
A hash collision that corrupts user data destroys trust in the platform. Defense-in-depth with multiple verification layers is essential.
Constraint: we rely on SHA-256 uniqueness for content addressing. What breaks: a bug in the chunking library truncates the hash to 128 bits for performance, raising collision probability to . With 2.5 trillion chunks, the birthday paradox gives a non-negligible collision chance at 128 bits. A new chunk matches an existing hash, the upload is skipped (dedup), and the file now points to the wrong chunk content. The user downloads their file and gets corrupted data.
- We use the full 256-bit SHA-256 hash: at possible values, even 2.5 trillion chunks have a collision probability of , effectively zero.
- We store chunk size alongside hash: if two chunks produce the same hash but differ in size, we flag for byte-level comparison before dedup. This catches truncation bugs.
- End-to-end integrity: we store a file-level SHA-256 checksum (computed from all chunks in order). We verify on every download. This detects any chunk substitution regardless of cause.
- Immutable chunks: once written to S3, a chunk is never overwritten. Even if a collision is detected, the original chunk is preserved and the new chunk is stored separately.
Storage quota race condition allows overage
Quota overage costs the platform money but does not corrupt data. The atomic Redis check prevents most cases; periodic reconciliation catches the rest.
Constraint: multiple devices can upload simultaneously for the same user. What breaks: a user with 14.5 GB used on a 15 GB plan uploads from 3 devices simultaneously. Each device checks the quota independently: 14.5 GB < 15 GB, proceed. Each uploads a 300 MB file. All three pass the quota check. Total usage: 14.5 + 0.9 = 15.4 GB, exceeding the quota by 400 MB.
- Atomic quota check-and-decrement in Redis: WATCH user quota key, check remaining, DECRBY upload size, proceed only if remaining >= 0. If the DECRBY results in negative, we reject the upload. This is atomic across devices.
- Periodic reconciliation: cron job every hour computes actual storage per user by summing file sizes. Corrects any drift between the Redis counter and actual MySQL data.
- Soft quota: we allow overage up to 5% (750 MB on a 15 GB plan) to avoid user friction from race conditions. We block further uploads once the hard cap is reached.
- Upload reservation: when an upload session starts, we reserve the declared file_size from the quota. We release the reservation if the upload is abandoned after 24h.