Music Streaming Failure Modes
What breaks, how to detect it, and how to fix it. Every failure includes detection metrics, mitigations, and severity rating.
CDN cache miss storm on new album release
A major artist announces an album drop at midnight. At 12:00:00, 50 million listeners across all POPs press play simultaneously. Every POP has a cold cache for these tracks. Without mitigation, each POP sends origin fetch requests for all tracks in the album, multiplied across 100+ POPs. Origin receives millions of concurrent requests for the same files, exceeding its 78 GB/sec capacity.
- Request coalescing at each POP: collapse 50,000 concurrent requests for the same track into 1 origin fetch
- Pre-warm announced releases: push album to all POPs 1 hour before drop time
- Stagger regional releases: midnight release rolls across time zones, spreading load over 24 hours
- Overflow to secondary CDN provider if primary POP capacity is exhausted
Play count pipeline lag causes royalty misattribution
Rights holders audit play counts and will dispute systematic undercounting
Kafka consumer group falls behind due to a slow Cassandra write node. Consumer lag reaches 30 minutes, meaning 250 million events are unprocessed. The daily royalty aggregation job runs at 3 AM and reads from Cassandra, which is now missing 250M events. The royalty report underpays artists whose plays occurred during the lag window.
- Auto-scale consumer group: add partitions and consumers when lag exceeds 10 minutes
- Delay royalty aggregation job until consumer lag is under 5 minutes
- Run catch-up reconciliation: replay Kafka from last committed offset to fill gaps
- Page the pipeline on-call at 30 minutes lag (potential financial impact)
Gapless buffer underrun on network degradation
Does not cause data loss but degrades the core listening experience
A listener enters a dead zone (subway tunnel, elevator) 8 seconds before the current track ends. Buffer B has received only 128 KB of the next track (8 seconds of audio at 128 kbps), but needs at least 256 KB for a clean transition. The decoder cannot produce enough PCM samples for a gapless crossfade, resulting in silence at the track boundary.
- Fade-to-silence fallback: if Buffer B underruns, fade Buffer A over 500ms instead of hard silence
- Increase prefetch lead time from 30 to 45 seconds in regions with poor connectivity
- Downgrade next track to lowest quality tier (24 kbps = 0.63 MB for full track) when bandwidth is constrained
- Cache most-played tracks locally on device for offline-like playback quality
Offline DRM license expiry while traveling
No data loss (tracks are still cached), but requires connectivity to restore access
A listener downloads 500 songs for a 3-week cruise with no internet access. The DRM license expires after 30 days. On day 31, the player checks the license expiration timestamp against the device clock and refuses to decrypt any downloaded tracks. The listener loses access to their entire offline library with no way to re-authenticate until they reach port.
- Grace period: allow 48 hours of playback after license expiry before full lockout
- Pre-departure sync: prompt users to refresh licenses when detecting travel patterns (airplane mode, time zone changes)
- Partial refresh: allow license renewal over low-bandwidth connections (satellite, roaming) since the license itself is only ~1 KB
- Rights holder negotiation: request 45-day license windows for premium subscribers (trade-off: requires label agreement)
Recommendation cold-start for 40K daily new tracks
No system outage, but failure to surface new music degrades platform value for both artists and listeners
40,000 new tracks are uploaded daily. Collaborative filtering (ALS) cannot recommend these tracks because they have zero play history. If we wait for play data to accumulate, new tracks receive zero algorithmic exposure for days, creating a bootstrapping problem where tracks cannot get plays without recommendations and cannot get recommendations without plays.
- Audio CNN embedding: extract acoustic features (tempo, key, energy, timbre) from the raw waveform in under 2 seconds per track, producing a 128-dimensional embedding
- Inject cold-start tracks into recommendation feeds using audio similarity to tracks the user already likes
- Editorial playlists: human curators surface standout new releases to seed initial play data
- Explore-exploit: allocate 10% of recommendation slots to cold-start tracks for exploration, measure engagement, and feed results back into the collaborative filtering model