Video Streaming Failure Modes

What breaks, how to detect it, and how to fix it. Every failure includes detection metrics, mitigations, and severity rating.

Failure Modes

HIGH

Transcoding worker crash mid-encode loses chunk progress

Without chunk-level independence, a single crash can block an entire video from processing. With it, only the failed chunk retries. User impact: delayed processing, not data loss.

Constraint: each 4K encoding task peaks at 8 GB RAM. An FFmpeg worker encoding chunk #42 runs out of memory and the OOM killer terminates the process. The chunk was 90% encoded. Without chunk-level checkpointing, the partial output is unusable. The queue's visibility timeout (5 minutes) expires and the chunk returns to the queue. Another worker re-encodes from scratch. If a bad frame caused the crash, the chunk fails repeatedly until it hits the Dead Letter Queue (DLQ). What breaks: one corrupted frame can block an entire video from reaching 'ready' status.

Worker heartbeat missed for 3 consecutive 10-second intervals. SQS message visibility timeout expiring without acknowledgment. CloudWatch alarm on FFmpeg process exit code != 0. DLQ depth increasing beyond 0 triggers a page.

Mitigation

Checkpoint at the chunk level: each 10-second chunk is an independent unit. A crash loses at most 10 seconds of encoding work, not the entire video. We recover by re-enqueuing only the failed chunk.
Retry failed chunks up to 3 times with exponential backoff. After 3 failures, move to DLQ for manual inspection. User impact: the video shows 'processing' for longer but no data is lost.
Set FFmpeg memory limits per resolution: 2 GB for 720p, 4 GB for 1080p, 8 GB for 4K. Over-provision worker instances by 20% to handle memory spikes.
Auto-scale the worker fleet based on queue depth. If pending chunks exceed 10,000, launch additional spot instances within 60 seconds.

CRITICAL

CDN cache miss storm on viral video

A viral video can take down the origin and degrade streaming for all users, not that video alone. The CDN is the primary defense.

Constraint: a brand-new viral video has zero cached copies across 200+ edge POPs. A celebrity uploads a video that gets 5 million views in the first 10 minutes. All 200+ POPs simultaneously request the first segment from the origin. Each POP sends 10-50 concurrent requests for different segments before the cache fills. The origin receives $200 \times 30\text{ segments} = 6{,}000$ simultaneous requests. What breaks: the origin's bandwidth saturates at 50 Gbps. Segments start timing out. Viewers see buffering. The CDN keeps retrying, amplifying the load.

Origin bandwidth spiking above 80% capacity. CDN cache hit ratio dropping below 50% (normal: 95%+). Origin 5xx error rate exceeding 1%. Viewer rebuffering rate spiking in real-time telemetry. Recovery trigger: automated pre-warm kicks in when a video crosses 10K views in 5 minutes.

Mitigation

Pre-warm: detect trending videos (rapid view count increase) and push segments to all edge POPs before the surge hits. Trigger at 10K views in 5 minutes. User impact during recovery: first viewers experience higher latency until edge caches populate.
Request coalescing at the edge: when multiple viewers request the same uncached segment simultaneously, the POP sends only one request to origin and serves all waiting viewers from that single response.
Staggered TTLs: set TTL = base + random(0, 60s) so segments across POPs do not all expire at the same moment, preventing synchronized cache miss storms.
Origin shield: an intermediate cache layer between edge POPs and the origin. All POPs fetch from the shield, which fetches from origin. Reduces origin fan-in from 200 POPs to 3-5 shield nodes.

MEDIUM

Upload interrupted at 90% completion

Resumable uploads make this a minor inconvenience (re-upload 10 MB) instead of a disaster (re-upload 5 GB). The key is correct byte offset tracking.

Constraint: a 5 GB upload on a 50 Mbps connection takes 13 minutes over a single TCP connection. A creator on a mobile network has uploaded 4.5 GB (90%) when the train enters a tunnel and the connection drops. Without resumable uploads, the entire 4.5 GB is lost. Even with resumable uploads, a bug in the server's byte offset tracking (off-by-one in the Content-Range parser) causes the server to report offset 4,500,000,000 when only 4,499,995,000 bytes were persisted. What breaks: the client sends the next chunk starting at the wrong offset, corrupting the assembled file.

Upload session inactive for more than 5 minutes (client disconnected). Checksum mismatch when assembling final file from chunks. Client retry requests with overlapping byte ranges indicating offset confusion. Recovery: client queries GET /upload/status and resumes from the server-reported offset.

Mitigation

Resumable upload protocol: we persist each chunk to object storage immediately and track the last confirmed byte offset in a database. On reconnect, client queries GET /upload/status for the exact offset. User impact: re-upload at most 10 MB instead of the full file.
Per-chunk checksums: client sends MD5 of each chunk in the header. We verify before acknowledging. This detects corruption from network errors or offset bugs.
Upload session TTL of 24 hours: gives the user a full day to resume after an interruption, then cleans up abandoned uploads to reclaim storage.

HIGH

Storage corruption in encoded video segments

Silent corruption is the worst kind: no alarms fire, but viewers see artifacts. Checksums and automated validation are the only defense. User impact: visible glitches until detection and purge.

Constraint: we write billions of segments to S3, and bit flips during transit are statistically inevitable at this volume. A transcoded 1080p segment is written to S3 but a bit flip during transfer corrupts 4 bytes in the middle. The segment is served to viewers, who see a green flash or frozen frame. What breaks: the corruption is silent. No error from S3 because the object was stored successfully (the corruption happened in transit). The CDN caches the corrupted segment, serving it to thousands of viewers for the duration of the TTL.

Viewer reports of visual glitches at a specific timestamp. Automated QA pipeline that spot-checks random segments with FFprobe for decode errors. Checksum verification comparing segment hash after upload vs stored hash. Recovery: purge corrupted segment from CDN, re-serve from healthy S3 replica.

Mitigation

End-to-end checksums: compute MD5 or CRC32 of each segment after encoding, store the checksum alongside the segment. Verify on every read from storage. S3 supports Content-MD5 header for upload integrity. User impact: if checksum fails, we serve from a replica instead of the corrupted copy.
3x replication across availability zones: if one copy is corrupted, we serve from a healthy replica. S3 Standard provides 99.999999999% (11 nines) durability.
Automated segment validation: after transcoding, run FFprobe on every segment to verify decodability before marking the video as 'ready.' Reject segments that fail decode.
CDN cache purge API: when corruption is detected, immediately invalidate the corrupted segment across all edge POPs and re-serve from a healthy origin copy.

HIGH

Transcoding backlog exceeds queue capacity

A transcoding backlog does not lose data but degrades creator experience. Videos stuck in 'processing' for hours cause support tickets and churn. User impact: delayed video availability, not data loss.

Constraint: our transcoding fleet processes 2,500 jobs/sec at steady state. A popular event (New Year's Eve, Super Bowl) causes upload volume to spike 5x from 100 to 500 uploads/sec. Each upload generates 8 transcoding jobs (one per resolution). The queue receives $500 \times 8 = 4{,}000$ jobs/sec but workers process 2,500 jobs/sec. The backlog grows by 1,500 jobs/sec. After 1 hour: $1{,}500 \times 3{,}600 = 5.4M$ pending jobs. What breaks: new uploads wait hours before transcoding starts. Creators see 'processing' for hours instead of minutes.

Queue depth exceeding 100K pending messages (normal: under 5K). Average time-in-queue exceeding 5 minutes (normal: under 10 seconds). Auto-scaling lag alerts when new workers take more than 2 minutes to become ready. Recovery: auto-scale workers and apply graceful degradation (transcode priority resolutions first).

Mitigation

Auto-scale transcoding workers based on queue depth: if pending > 10K, launch additional instances. We use spot/preemptible instances for 70% cost savings. Target: new workers ready within 90 seconds. User impact during scaling: queue drains within 10 minutes of additional capacity arriving.
Priority queuing: premium users and short videos (< 5 min) get higher priority. A 30-second clip should not wait behind a 3-hour documentary.
Graceful degradation: during extreme spikes, we transcode only the 3 most common resolutions (480p, 720p, 1080p) first, queue 4K and 240p for later. Most viewers will not notice the missing extremes. User impact: video is playable faster but missing edge-case resolutions temporarily.
Queue depth monitoring with alerts at 10K (warning), 50K (critical), and 100K (page on-call). Dashboard showing queue depth, drain rate, and ETA to empty.

MEDIUM

ABR manifest becomes stale after re-encoding

Affects only re-encoded videos, not new uploads. But when it happens, the video is completely unwatchable until the stale cache expires. User impact: playback failure for the specific re-encoded video.

Constraint: CDN caches the manifest (.m3u8) with a 1-hour TTL. A video is re-encoded (creator changes title triggering thumbnail regeneration, or the platform upgrades codecs from H.264 to AV1). The new segments are written to S3 with updated keys. The manifest is updated to point to new segment URLs. What breaks: the CDN has the old manifest cached. Viewers fetching the manifest get old segment URLs. The old segments were deleted after re-encoding. Result: 404 errors for every segment, video appears completely broken. This persists until the CDN TTL expires.

Spike in 404 errors from CDN for segment URLs. Viewer-side error telemetry reporting playback failures for a specific video_id. Manifest version mismatch between origin and CDN edge. Recovery: CDN cache purge + client-side manifest re-fetch with cache-buster.

Mitigation

Version the manifest URL: /v1/videos/id/master.m3u8?v=version_hash. Each re-encode produces a new version hash, busting the CDN cache automatically. User impact: none, because the new URL bypasses the stale cache.
Keep old segments alive for 2 hours after re-encoding (matching the maximum CDN TTL). We delete old segments only after all cached manifests have expired.
CDN cache purge on re-encode: we use the CDN's purge API to invalidate the old manifest across all POPs immediately after the new manifest is written.
Client-side retry: if segment fetch returns 404, the player re-fetches the manifest (bypassing cache with a cache-buster query parameter) and retries with the updated segment URLs.