STANDARDwalkthrough

Audio Codec Selection and Adaptive Bitrate

1 of 8
3 related
A listener on a morning commute enters a subway tunnel and bandwidth drops from 20 Mbps to 200 Kbps. With a fixed 320 kbps stream, the player stalls for 10 to 15 seconds until signal recovers.
We chose OGG Vorbis at 5 quality tiers (24, 96, 128, 160, 320 kbps) over AAC or MP3 because OGG Vorbis is royalty-free and delivers equivalent perceptual quality to AAC at the same bitrate, saving millions in licensing fees at 100M+ tracks. Why not HLS segments like video streaming?
The constraint: we cannot predict network conditions, so we must adapt in real time.
Because a 3.5-minute song at 128 kbps is only 3.5 MB, roughly 1,000 times smaller than a typical video file. Segmenting a 3.5 MB file into 2-second HLS chunks creates unnecessary HTTP overhead without meaningful benefit.
Instead, we stream the entire file via byte-range requests and let the client switch quality tiers between tracks, not mid-track. The client maintains a 20-second lookahead buffer: if buffer drops below 5 seconds, we downgrade to the next lower tier; if buffer exceeds 15 seconds, we upgrade.
Spotify uses OGG Vorbis for free-tier (128 kbps) and premium (320 kbps). Average song at 128 kbps: 3.5 min×128 kbps/8=3.36 MB3.5\text{ min} \times 128\text{ kbps} / 8 = 3.36\text{ MB}.
At 320 kbps: 3.5×320/8=8.75 MB3.5 \times 320 / 8 = 8.75\text{ MB}. Trade-off: we accept per-track switching granularity (not mid-track like video ABR) in exchange for simpler delivery with fewer HTTP requests.
What if the interviewer asks: why not switch quality mid-track like video? Because audio files are small enough to buffer entirely.
Mid-track switching causes audible artifacts at the transition point, which is unacceptable for music listeners who notice quality changes far more than video viewers.
Why it matters in interviews
Interviewers expect us to explain why music streaming uses byte-range requests instead of HLS segments and why OGG Vorbis wins on cost at scale. Describing the 5-tier buffer threshold logic shows we understand adaptive delivery for audio, not just recycled video ABR knowledge.
Related concepts