Music Streaming Cheat Sheet

Key concepts, trade-offs, and quick-reference notes for your interview prep.

Byte-Range Requests, Not HLS Segments

We chose byte-range requests (not HLS/DASH segments) because a 3.5-minute song at 128 kbps is only

3.5\text{ MB}

, roughly 1,000 times smaller than a typical video file. Segmenting a 3.5 MB file into 2-second chunks creates 105 tiny 33 KB files with more HTTP overhead than content. Instead, we serve the single file with an HTTP Range header and a seek table that maps timestamps to byte offsets. The CDN caches one object per track per quality tier (not thousands of segment files). Trade-off: we give up mid-track quality switching but avoid segment management entirely.

💡 Music uses single-file byte-range delivery. Video uses HLS segments. The reason is file size: 3.5 MB vs 6 GB.

99% CDN Cache Hit Ratio (vs Video's 95%)

Music achieves 99%+ cache hit ratio (versus video's 95%) because songs are replayed thousands of times while movies are typically watched once. The top 1% of tracks (1M songs) serve 80% of plays. Edge cache sizing:

1\text{M tracks} \times 10\text{ MB avg} = 10\text{ TB per POP}

. At 139K plays/sec, a 99% cache hit means only 1,390 requests/sec reach the origin. A 95% hit rate (video-level) would send 6,950/sec to origin, a 5x increase. Trade-off: we accept that long-tail tracks may have cold-start latency on first play at a POP, but the economics are overwhelmingly in favor of caching the hot catalog.

⚠ Applying video CDN assumptions (95% hit rate) to music. Music's replay frequency makes the cache economics fundamentally different.

Dual-Buffer Gapless: Buffer A Plays, Buffer B Decodes

We chose a dual-buffer architecture (not a single buffer) because initializing a new decoder, fetching the next track's header, and beginning decode takes 200 to 500ms, producing an audible gap. With dual-buffer: Buffer A plays the current track. 10 seconds before track end, the client begins decoding the next track into Buffer B. At the boundary, crossfade in under 50ms. The CDN prefetch started 30 seconds earlier, so Buffer B already has data. Peak memory: ~20 MB (two decoded PCM buffers). Trade-off: double the memory per active stream, but 20 MB is negligible on modern devices. For shuffle mode, we start prefetch the moment the shuffle algorithm selects the next track.

💡 Gapless requires overlapping decode of two tracks. Single-buffer means audible 200-500ms gaps at track boundaries.

5-Tier ABR: 24 / 96 / 128 / 160 / 320 kbps OGG Vorbis

We chose OGG Vorbis (not AAC or MP3) because OGG is royalty-free and delivers equivalent perceptual quality to AAC at the same bitrate. At 100M+ tracks, licensing fees for AAC add up. Five tiers serve different contexts: 24 kbps for extreme low-bandwidth (satellite, 2G), 96 kbps for mobile data saving, 128 kbps for standard free-tier, 160 kbps for premium mobile, 320 kbps for premium desktop/Hi-Fi. Storage per track across tiers:

(0.63 + 2.1 + 3.36 + 4.2 + 8.75)\text{ MB} = 19.04\text{ MB}

. Total:

100\text{M} \times 19.04\text{ MB} = 1.9\text{ PB}

, reduced to 1.75 PB with practical tier selection. The client switches tiers between tracks based on a 20-second lookahead buffer, not mid-track.

⚠ Switching quality mid-track like video ABR. Audio listeners notice quality transitions far more than video viewers, and mid-track switches cause audible artifacts at the transition point.

30-Second Threshold: Play Count = Royalty Payment

Every 30-second play triggers a royalty payment. Below 30 seconds, the play is not counted. This is the industry standard across Spotify, Apple Music, and YouTube Music. At

300\text{M DAU} \times 40\text{ songs/day} = 12\text{B events/day} = 139\text{K events/sec}

. We chose Kafka with exactly-once semantics (not at-least-once with dedup) because at-least-once would require a dedup layer handling 139K events/sec. A single double-counted play costs a fraction of a cent, but across 12B daily events, systematic double-counting means millions in incorrect royalty payments. Trade-off: exactly-once adds 10-15% latency overhead per event, but for financial data, accuracy is non-negotiable.

💡 Play counting is a financial transaction, not a vanity metric. Every counted play triggers a real payment to rights holders.

Track Prefetch: Next 2-3 Tracks, 30 Seconds Early

When the current track has 30 seconds remaining, the client requests the first 256 KB of the next 2 to 3 tracks from the CDN. At 99% cache hit, these prefetches are served from edge with sub-20ms latency. This hides all network latency for sequential listening. Why 2-3 tracks and not just 1? Because the user might skip the next track, and having the one after it already partially buffered means even skips feel instant. Why 256 KB? At 128 kbps, 256 KB is 16 seconds of audio, enough to fill Buffer B before the current track ends. Trade-off: we prefetch tracks the user might never listen to, wasting ~512 KB of bandwidth per skip, but bandwidth is cheap and perceived latency is expensive.

⚠ Prefetching only 1 track ahead. Users skip tracks frequently (30%+ skip rate), so prefetching only the next track means the track after a skip has zero head start.

Territorial Rights: Denormalized by (track_id, country_code)

We chose to denormalize rights metadata by (track_id, country_code) because the hot-path query is: "Can user in country X play track Y right now?" A denormalized lookup is a single key-value fetch at 139K/sec. A normalized schema with separate territory, rights-holder, and license-period tables requires 3 joins per play request. Total storage:

100\text{M tracks} \times 5\text{ territories avg} \times 400\text{B} = 200\text{ GB}

in PostgreSQL. We cache the top 10M tracks in Redis:

10\text{M} \times 2\text{ KB} = 20\text{ GB}

. Rights changes (takedowns, new licenses) propagate via Kafka within 30 seconds. Trade-off: fan-out writes on rights changes (updating all rows for a rights holder), but changes are rare (hundreds/day) versus 139K reads/sec.

💡 Rights checking is on the hot path of every play request. Optimize for reads (denormalized single-key lookup), not writes.

Offline DRM: AES-128-CTR with 30-Day License Refresh

Downloaded tracks are encrypted with AES-128-CTR (not AES-CBC) because CTR mode supports random seek without decrypting preceding blocks, essential for byte-range seeking within encrypted files. Each download includes a DRM license with a 30-day expiration. The client must connect to the server at least once every 30 days to refresh licenses. Storage per downloaded track at 320 kbps:

3.5\text{ min} \times 320 / 8 = 8.75\text{ MB}

. A 500-song library:

500 \times 8.75 = 4.4\text{ GB}

. Spotify allows 10,000 downloads across 5 devices. Trade-off: 30-day windows balance user convenience against rights holder protection. Longer windows (90 days) would let cancelled subscribers listen for 3 months on cached content.

⚠ Using AES-CBC for encrypted offline playback. CBC requires decrypting all preceding blocks to access a mid-file position, making seek operations impossibly slow for a 10 MB audio file.

ALS + Audio CNN: Hybrid Recommendations for Cold-Start

We chose a hybrid model (not pure collaborative filtering) because collaborative filtering fails for new users (no history) and new tracks (no play data). ALS (Alternating Least Squares) factorizes the 300M x 100M user-track matrix into 128-dimensional embeddings. For cold-start tracks (40K new/day), a CNN extracts acoustic features (tempo, key, energy, timbre) from the raw waveform, producing a 128-dimensional embedding that can be compared to existing track embeddings without any play data. Cache:

300\text{M users} \times 200\text{ track IDs} \times 8\text{B} = 480\text{ GB}

in Redis, refreshed every 24 hours. Spotify reports 30% of all plays come from algorithmic recommendations. Trade-off: 24-hour refresh cycle means recommendations lag behind listening shifts by up to a day.

💡 Collaborative filtering alone cannot recommend 40K new daily tracks with zero play history. Audio CNN embeddings solve the cold-start problem.

Request Coalescing: One Origin Fetch for 10K Concurrent Plays

#10

When a new album drops, 50,000 listeners at the same POP request the same track within seconds. Without coalescing, the POP sends 50,000 origin fetches for the same file. With request coalescing, the first request triggers an origin fetch; the remaining 49,999 wait for that single fetch to complete and are served from the now-cached result. This reduces origin load during album launches from 50,000 to 1 request per POP per track. We complement this with pre-warming: for announced releases (album drops with known timestamps), we push tracks to all POPs 1 hour before the drop. Trade-off: pre-warming consumes edge storage proactively, but for a major album launch, the alternative is origin collapse under millions of simultaneous cache misses.

⚠ No request coalescing at the CDN layer. A major album release without coalescing generates millions of simultaneous origin fetches for the same files, overwhelming origin bandwidth.