Failure Modes

Cloud Storage Failure Modes

What breaks, how to detect it, and how to fix it. Every failure includes detection metrics, mitigations, and severity rating.

Failure Modes
HIGH

Split-brain sync conflict corrupts shared folder

Data loss from overwritten edits is the most damaging failure in a file storage system. Users trust the system to preserve every edit.

Constraint: two engineers on the same team can edit the same file offline for hours. What breaks: Engineer A restructures sections 1-3, Engineer B rewrites the conclusion. Both land from a flight, reconnect, and sync within seconds of each other. Engineer A's sync completes first (version 6). Engineer B's sync arrives with parent version 5, conflicting with version 6. Without conflict detection, the server applies B's edit as version 7, overwriting A's 3 hours of work.

Version vector comparison: B's edit claims parent version 5, but the server's latest is version 6. The parent version mismatch flags this as a concurrent edit, not a sequential one.
Mitigation
  1. We store both versions: A's edit as version 6, B's edit as a 'conflicted copy' file alongside the original. No data is lost.
  2. We surface the conflict to both users: notification says 'Conflicting edits detected for design-doc.pdf. Both versions are saved.'
  3. For shared folders: we lock the conflict to the specific file, not the entire folder. Other files sync normally.
  4. For text files: we implement three-way merge, diffing both edits against the common ancestor (version 5) and auto-merging non-overlapping changes. We flag overlapping regions for manual resolution.
MEDIUM

Chunk upload partial failure leaves orphaned data

Orphaned chunks waste storage but do not corrupt data. The periodic GC keeps the waste bounded.

Constraint: large file uploads span many chunks over extended time. What breaks: a user uploads a 500 MB file (125 chunks). After 120 chunks succeed, the network drops. The user goes on a week-long trip. The upload session expires (24h TTL). The 120 chunks exist in S3 but are not linked to any file version. They are orphans consuming storage but unreachable.

Periodic reconciliation job compares S3 inventory against the chunk index. Chunks in S3 with no matching entry in the file_chunks table are flagged as orphans. Upload sessions that expire with partial progress are logged.
Mitigation
  1. We extend session TTL for large uploads: if upload progress exceeds 50%, we auto-extend the session by 7 days.
  2. Resumable upload: on reconnect, the client sends HEAD to check session status. If expired, we create a new session and re-link the existing 120 chunks (they are already in S3 and the chunk index).
  3. Orphan GC: we run weekly, deleting chunks in S3 that have ref_count = 0 in the chunk index AND are older than a 7-day grace period.
  4. Dedup prevents waste: if the 120 orphaned chunks match content from other users' files, they already have ref_count > 0 and are not orphans.
HIGH

Metadata DB failover returns stale version numbers

Stale versions cause sync to miss edits, which users perceive as data loss even though the data exists on the old primary's disk.

Constraint: MySQL replication has inherent lag. What breaks: the primary MySQL instance for shard 7 fails (disk error). A replica promotes to primary. The replica has 2 seconds of replication lag: it is missing the last 50 file version updates. A device syncing against the new primary receives version 5 for a file that is actually at version 7. It computes a delta against version 5, missing 2 edits.

Sync service detects version conflict when the device's next sync produces a version number that the server already has. Replication lag monitoring alerts when lag exceeds 1 second.
Mitigation
  1. We use semi-synchronous replication: the primary waits for at least one replica to acknowledge each write before confirming to the client. This ensures the promoted replica has all committed writes.
  2. The sync service retries with exponential backoff when it detects a version number regression (device reports a higher version than the server shows).
  3. Read-after-write consistency: after a write, the sync service reads from the primary (not replica) for that user's next sync request. Sticky routing lasts 5 seconds.
  4. Failure detection: MySQL Group Replication with automatic failover. Promotion takes under 10 seconds. Client retry with backoff bridges the gap.
HIGH

Notification service overload from viral shared folder

Notification overload degrades sync latency for all users, not only the viral folder. The blast radius makes this a system-wide issue.

Constraint: a single edit in a 10,000-person shared folder must notify all collaborators. What breaks: the notification service must fan out to 10,000 long-poll connections (times 3 devices = 30,000 connections). A second edit arrives 500ms later, triggering another 30,000 notifications. During a busy meeting, 20 edits in 10 seconds generate 600,000 notification deliveries. Queue depth spikes, latency degrades for all users (not this folder alone), and long-poll connections start timing out.

Notification queue depth exceeding 100K messages. Per-folder notification rate exceeding 10/sec (normal: under 1/sec). Long-poll timeout rate spiking above 5%.
Mitigation
  1. We batch notifications per folder: aggregate all changes within a 1-second window into a single notification event. 20 edits in 10 seconds become 10 batched notifications, not 20.
  2. We fan out through Kafka partitioned by user_id: each partition handles notifications for a subset of users, preventing a single viral folder from monopolizing the notification pipeline.
  3. We rate-limit per folder: cap notifications at 1 per second per folder. Changes within the same second are merged. Users receive 'folder updated' rather than individual file notifications.
  4. We use tiered delivery: immediate notification for the file owner and active editors. Delayed notification (up to 30 seconds) for passive viewers. This reduces peak fan-out by 80%.
LOW

Dedup hash collision stores wrong content

A hash collision that corrupts user data destroys trust in the platform. Defense-in-depth with multiple verification layers is essential.

Constraint: we rely on SHA-256 uniqueness for content addressing. What breaks: a bug in the chunking library truncates the hash to 128 bits for performance, raising collision probability to 1/21281/2^{128}. With 2.5 trillion chunks, the birthday paradox gives a non-negligible collision chance at 128 bits. A new chunk matches an existing hash, the upload is skipped (dedup), and the file now points to the wrong chunk content. The user downloads their file and gets corrupted data.

Checksum verification on download: we compute SHA-256 of the assembled file and compare against the stored file-level checksum. Mismatch indicates a chunk substitution. User report of corrupted file content.
Mitigation
  1. We use the full 256-bit SHA-256 hash: at 22562^{256} possible values, even 2.5 trillion chunks have a collision probability of 105310^{-53}, effectively zero.
  2. We store chunk size alongside hash: if two chunks produce the same hash but differ in size, we flag for byte-level comparison before dedup. This catches truncation bugs.
  3. End-to-end integrity: we store a file-level SHA-256 checksum (computed from all chunks in order). We verify on every download. This detects any chunk substitution regardless of cause.
  4. Immutable chunks: once written to S3, a chunk is never overwritten. Even if a collision is detected, the original chunk is preserved and the new chunk is stored separately.
MEDIUM

Storage quota race condition allows overage

Quota overage costs the platform money but does not corrupt data. The atomic Redis check prevents most cases; periodic reconciliation catches the rest.

Constraint: multiple devices can upload simultaneously for the same user. What breaks: a user with 14.5 GB used on a 15 GB plan uploads from 3 devices simultaneously. Each device checks the quota independently: 14.5 GB < 15 GB, proceed. Each uploads a 300 MB file. All three pass the quota check. Total usage: 14.5 + 0.9 = 15.4 GB, exceeding the quota by 400 MB.

Post-upload reconciliation: we compare total storage used against quota. We flag accounts where usage exceeds quota by more than one upload's worth. Concurrent upload counter per user exceeding expected device count.
Mitigation
  1. Atomic quota check-and-decrement in Redis: WATCH user quota key, check remaining, DECRBY upload size, proceed only if remaining >= 0. If the DECRBY results in negative, we reject the upload. This is atomic across devices.
  2. Periodic reconciliation: cron job every hour computes actual storage per user by summing file sizes. Corrects any drift between the Redis counter and actual MySQL data.
  3. Soft quota: we allow overage up to 5% (750 MB on a 15 GB plan) to avoid user friction from race conditions. We block further uploads once the hard cap is reached.
  4. Upload reservation: when an upload session starts, we reserve the declared file_size from the quota. We release the reservation if the upload is abandoned after 24h.