STANDARDwalkthrough

Audio File Chunking and Byte-Range Seeking

6 of 8
3 related
A listener taps the progress bar to jump to the chorus at 1:45 in a 3.5-minute song. Without seek support, the player must download the entire file from byte 0 to reach the desired position.
We solve this with byte-range requests using HTTP Range headers. The audio file includes a seek table (also called a comment header or Vorbis seek points) that maps timestamps to byte offsets.
The constraint: we need random access into any point in an audio file without downloading preceding content, but we cannot use HLS/DASH segmentation because audio files are too small to benefit from it.
When the user seeks to 1:45, the player looks up the byte offset in the seek table (say, byte 1,234,567) and sends: `Range: bytes=1234567-`. The server (or CDN) returns only the requested range.
This is fundamentally different from video streaming's approach: video uses HLS segments (separate 2-10 second files) because a 2-hour movie at 1080p is 6 GB and needs chunked delivery. A 3.5-minute song at 320 kbps is only 10 MB, small enough to serve as a single file with byte-range access.
Spotify uses this exact approach: OGG Vorbis files served via byte-range from their CDN. We chose to store tracks as single files (not pre-split segments) because a single file reduces storage metadata overhead, simplifies CDN caching (one cache key per track per quality tier), and maps naturally to the music catalog model.
Each track is stored at 2 to 5 quality tiers: 100M tracks×2 tiers×6.75 MB avg=1.35 PB100\text{M tracks} \times 2\text{ tiers} \times 6.75\text{ MB avg} = 1.35\text{ PB}. At 5 tiers: 100M×5×3.5 MB avg=1.75 PB100\text{M} \times 5 \times 3.5\text{ MB avg} = 1.75\text{ PB}.
Trade-off: single-file storage means we cannot independently cache the "hot" first 30 seconds separately from the rest of the track, but the entire file is small enough to cache in full. What if the interviewer asks: what about lossless formats like FLAC?
FLAC files at 1,411 kbps are roughly 30 MB per track. Even at this size, byte-range requests work fine since 30 MB is still 200x smaller than a video file.
Why it matters in interviews
This is the concept that differentiates music streaming from video streaming in an interview. Explaining why we use byte-range requests instead of HLS segments with the 1000x file size argument shows we understand the fundamental architectural difference.
Related concepts