Music Streaming Failure Modes

What breaks, how to detect it, and how to fix it. Every failure includes detection metrics, mitigations, and severity rating.

Failure Modes

HIGH:

CDN cache miss storm on new album release

A major artist announces an album drop at midnight. At 12:00:00, 50 million listeners across all POPs press play simultaneously. Every POP has a cold cache for these tracks. Without mitigation, each POP sends origin fetch requests for all tracks in the album, multiplied across 100+ POPs. Origin receives millions of concurrent requests for the same files, exceeding its 78 GB/sec capacity.

CDN cache hit ratio drops below 90% at affected POPs. Origin bandwidth spikes above 200 GB/sec. Alert triggers at 150% of normal origin load.

Mitigation

Request coalescing at each POP: collapse 50,000 concurrent requests for the same track into 1 origin fetch
Pre-warm announced releases: push album to all POPs 1 hour before drop time
Stagger regional releases: midnight release rolls across time zones, spreading load over 24 hours
Overflow to secondary CDN provider if primary POP capacity is exhausted

HIGH:

Play count pipeline lag causes royalty misattribution

Rights holders audit play counts and will dispute systematic undercounting

Kafka consumer group falls behind due to a slow Cassandra write node. Consumer lag reaches 30 minutes, meaning 250 million events are unprocessed. The daily royalty aggregation job runs at 3 AM and reads from Cassandra, which is now missing 250M events. The royalty report underpays artists whose plays occurred during the lag window.

Kafka consumer lag monitoring: alert at 15 minutes (83M events backlog). Cassandra write latency p99 exceeds 100ms. Daily reconciliation job detects divergence between Kafka committed offsets and Cassandra row counts.

Mitigation

Auto-scale consumer group: add partitions and consumers when lag exceeds 10 minutes
Delay royalty aggregation job until consumer lag is under 5 minutes
Run catch-up reconciliation: replay Kafka from last committed offset to fill gaps
Page the pipeline on-call at 30 minutes lag (potential financial impact)

MEDIUM:

Gapless buffer underrun on network degradation

Does not cause data loss but degrades the core listening experience

A listener enters a dead zone (subway tunnel, elevator) 8 seconds before the current track ends. Buffer B has received only 128 KB of the next track (8 seconds of audio at 128 kbps), but needs at least 256 KB for a clean transition. The decoder cannot produce enough PCM samples for a gapless crossfade, resulting in silence at the track boundary.

Client-side telemetry: buffer fill level drops below 2 seconds. Gapless success rate metric falls below 99.5% threshold. Per-device-type breakdown identifies affected hardware.

Mitigation

Fade-to-silence fallback: if Buffer B underruns, fade Buffer A over 500ms instead of hard silence
Increase prefetch lead time from 30 to 45 seconds in regions with poor connectivity
Downgrade next track to lowest quality tier (24 kbps = 0.63 MB for full track) when bandwidth is constrained
Cache most-played tracks locally on device for offline-like playback quality

MEDIUM:

Offline DRM license expiry while traveling

No data loss (tracks are still cached), but requires connectivity to restore access

A listener downloads 500 songs for a 3-week cruise with no internet access. The DRM license expires after 30 days. On day 31, the player checks the license expiration timestamp against the device clock and refuses to decrypt any downloaded tracks. The listener loses access to their entire offline library with no way to re-authenticate until they reach port.

Client-side: license expiry countdown warning at 7 days, 3 days, and 1 day before expiry. Server-side: monitor the distribution of last-sync timestamps across the user base.

Mitigation

Grace period: allow 48 hours of playback after license expiry before full lockout
Pre-departure sync: prompt users to refresh licenses when detecting travel patterns (airplane mode, time zone changes)
Partial refresh: allow license renewal over low-bandwidth connections (satellite, roaming) since the license itself is only ~1 KB
Rights holder negotiation: request 45-day license windows for premium subscribers (trade-off: requires label agreement)

MEDIUM:

Recommendation cold-start for 40K daily new tracks

No system outage, but failure to surface new music degrades platform value for both artists and listeners

40,000 new tracks are uploaded daily. Collaborative filtering (ALS) cannot recommend these tracks because they have zero play history. If we wait for play data to accumulate, new tracks receive zero algorithmic exposure for days, creating a bootstrapping problem where tracks cannot get plays without recommendations and cannot get recommendations without plays.

New track exposure rate: monitor the percentage of daily plays attributed to tracks uploaded in the last 7 days. If below 5%, cold-start tracks are not reaching listeners. Audio CNN embedding pipeline latency: alert if embedding generation falls behind the upload rate.

Mitigation

Audio CNN embedding: extract acoustic features (tempo, key, energy, timbre) from the raw waveform in under 2 seconds per track, producing a 128-dimensional embedding
Inject cold-start tracks into recommendation feeds using audio similarity to tracks the user already likes
Editorial playlists: human curators surface standout new releases to seed initial play data
Explore-exploit: allocate 10% of recommendation slots to cold-start tracks for exploration, measure engagement, and feed results back into the collaborative filtering model