Using HLS/DASH Segments for Audio Delivery
Very CommonFORMULA
Candidates apply video streaming patterns to music without considering that audio files are 1,000 times smaller than video files.
Why: Video streaming is taught first in most system design courses, so candidates default to HLS segment delivery for all media types without analyzing whether segmentation benefits small files.
WRONG: Splitting a 3.5 MB audio file into 2-second HLS segments creates 105 tiny 33 KB files. Each segment needs its own CDN cache entry, its own HTTP request, and a manifest file listing all 105 URLs. The overhead of 105 HTTP requests exceeds the content payload.
RIGHT: Serve the single audio file via byte-range requests with a seek table mapping timestamps to byte offsets. One CDN cache entry per track per quality tier. The client uses HTTP Range headers for seeking. Trade-off: we lose mid-track quality switching, but audio listeners prefer consistent quality over adaptive switching that causes audible artifacts.
Single-Buffer Playback Without Gapless
Very CommonFORMULA
Candidates design a single decode buffer and accept 200-500ms gaps between tracks, not realizing that gapless playback is a core product requirement.
Why: Video players have natural breaks between episodes, so gaps are acceptable. Candidates apply video thinking to music, where live albums, concept albums, and DJ mixes require zero-gap transitions.
WRONG: A single playback buffer finishes the current track, flushes the decoder, initializes the new codec, fetches the next track's header, and begins decoding. This sequence takes 200 to 500ms, producing an audible gap that destroys the listening experience for albums designed with seamless transitions.
RIGHT: Use a dual-buffer architecture: Buffer A plays the current track while Buffer B begins decoding the next track 10 seconds early. Crossfade at the boundary in under 50ms. Prefetch the next track via CDN 30 seconds before the current track ends. Trade-off: 20 MB peak memory (two decoded buffers), negligible on modern devices.
At-Least-Once Play Counting Without Dedup
CommonFORMULA
Candidates use at-least-once message delivery for play events without realizing that each duplicate count triggers an incorrect royalty payment.
Why: Most system design topics treat duplicate messages as a minor inconvenience (show a notification twice). In music streaming, each duplicate play event means paying a rights holder for a play that did not happen.
WRONG: Using at-least-once Kafka delivery and hoping duplicates are rare. At 139K events/sec, even a 0.1% duplicate rate means 139 phantom plays per second, or 12 million incorrect royalty payments per day. Rights holders audit these numbers and will dispute systematic overcounting.
RIGHT: Use Kafka with exactly-once semantics (idempotent producer + transactional consumer). Each event carries a client-generated idempotency key. The pipeline deduplicates at ingestion. Trade-off: exactly-once adds 10-15% latency overhead, but for financial data, accuracy is non-negotiable. Back-pressure is preferable to incorrect payments.
Assuming Video-Level CDN Cache Hit Ratio (95%)
CommonFORMULA
Candidates design origin infrastructure for 5% cache miss rates when music actually achieves 99%+ cache hits due to replay frequency.
Why: Video streaming designs commonly cite 95% cache hit ratios, and candidates apply the same number to music without considering that songs are replayed thousands of times while movies are watched once.
WRONG: Sizing origin for a 5% miss rate: 139K plays/sec×5%=6,950 origin requests/sec. This over-provisions origin by 5x, wasting infrastructure budget on servers that will sit idle because actual miss rate is under 1%. RIGHT: Size origin for 1% miss rate: 139K×1%=1,390 origin requests/sec, a much smaller fleet. The 99%+ hit ratio comes from music's replay frequency: a Billboard Hot 100 track gets millions of plays. Use the savings to invest in edge cache capacity. Trade-off: origin must handle burst misses during album drops (use request coalescing). Skipping Territorial Rights Checks on Play Request
CommonFORMULA
Candidates design the streaming path without a rights authorization check, forgetting that licensing varies by country and tracks can be region-locked.
Why: The technical streaming architecture (CDN, codec, buffering) is more interesting to discuss than legal constraints. Candidates skip rights because it feels like a business concern, not a technical one.
WRONG: Streaming a track directly from CDN without checking territorial rights. A track licensed only in the US gets played by a user in Germany. The platform is now liable for unauthorized distribution, risking lawsuits and catalog takedowns from rights holders.
RIGHT: Every play request passes through a rights check: look up (track_id, country_code) in the denormalized rights table (Redis cache, 20 GB). If no valid license exists for that territory, return a 403. Trade-off: adds ~1ms per play request (Redis lookup), but prevents legal exposure. Rights changes propagate via Kafka within 30 seconds.
No Track Prefetch for Sequential Playback
CommonFORMULA
Candidates wait until the current track finishes before fetching the next one, adding 200ms+ of perceived latency between tracks.
Why: In video streaming, the next episode requires a conscious user action (click next). In music, tracks auto-advance, so sequential listening is the default mode, and any delay is immediately audible.
WRONG: Fetching the next track only after the current track ends. Even with a CDN (20ms edge latency), the decoder initialization, header parsing, and initial buffer fill add 200-500ms of silence. For playlists and albums, this gap occurs every 3-4 minutes.
RIGHT: Prefetch the first 256 KB of the next 2-3 tracks when the current track has 30 seconds remaining. At 128 kbps, 256 KB is 16 seconds of audio, enough to fill Buffer B before the crossfade. Also prefetch 2-3 tracks ahead to handle skips. Trade-off: wastes ~512 KB per skipped track, but bandwidth is cheap and latency is expensive.
Using AES-CBC for Offline Encrypted Playback
CommonFORMULA
Candidates choose AES-CBC for offline DRM without realizing it breaks random seek within encrypted files.
Why: AES-CBC is the most commonly taught block cipher mode. Candidates default to it without considering that CBC requires decrypting all preceding blocks to access any position in the file.
WRONG: Encrypting offline tracks with AES-CBC. When the user seeks to 1:45, the player must decrypt all bytes from 0 to 1:45 (roughly 2 MB at 320 kbps) before it can play from that position. Seek operations take hundreds of milliseconds instead of being instant.
RIGHT: Use AES-128-CTR mode because CTR allows random access decryption: each block is independently decryptable using the counter value as the IV. Seeking to any byte offset requires decrypting only that block. Trade-off: CTR mode is slightly less resistant to bit-flipping attacks than CBC, but for DRM (where the threat is key extraction, not ciphertext manipulation), this is acceptable.
Global Play Counter Instead of Partitioned Events
CommonFORMULA
Candidates use a single global counter for play counts, creating a write hotspot at 139K increments per second.
Why: A single atomic counter seems like the simplest correct approach. Candidates do not consider that 139K increments per second on a single key exceeds any single node's write capacity.
WRONG: A single Redis key tracking total platform plays, incremented 139K times per second. Even Redis's single-threaded model caps at ~100K ops/sec per key. The counter becomes a write bottleneck and a single point of failure for the entire royalty pipeline.
RIGHT: Partition play events by track_id across Kafka topics. Each partition handles a subset of tracks, distributing the 139K events/sec across hundreds of partitions. Aggregate per-track counts in Cassandra. Trade-off: per-track aggregation requires a map-reduce step for global totals, but the hot path (per-track royalty counting) runs in parallel. Global totals are computed offline.