Gapless Playback and Crossfade Engine

3 of 8

3 related

A listener plays Pink Floyd's "The Dark Side of the Moon" where tracks flow seamlessly into each other. A 200ms gap between tracks destroys the artistic intent.

We chose a dual-buffer architecture (not a single decode buffer) because a single buffer must flush, reinitialize, and refill between tracks. Buffer A plays the current track. When the current track has 10 seconds remaining, the client begins decoding the next track into Buffer B.

“The constraint: the audio decoder needs time to initialize for the next track, and network fetch adds latency, but the listener must hear zero silence between songs.”

At the track boundary, the audio output crossfades from Buffer A to Buffer B in under 50 milliseconds. The client pre-fetched the next track's first 256 KB via CDN (from the prefetch concept), so Buffer B has data ready before it needs to start decoding.

Why not a single buffer? Because a single buffer must finish the current track, flush, initialize the decoder for the new codec or bitrate, fetch the next track's header, and begin decoding.

That sequence takes 200 to 500ms, producing an audible gap. The dual-buffer approach overlaps these operations.

Spotify's crossfade engine also supports user-configurable crossfade from 0 to 12 seconds for DJ-style transitions. The 0-second crossfade (gapless) is the default and the hardest to implement because it requires sample-accurate alignment.

We handle codec mismatch by normalizing both buffers to PCM (uncompressed audio) before crossfading, since crossfading compressed audio produces artifacts. Trade-off: dual-buffer doubles peak memory usage from ~10 MB to ~20 MB per active stream, which is negligible on modern devices.

What if the interviewer asks: what about live radio or shuffle mode? For shuffle, we cannot predict the next track, so we start prefetch the moment the shuffle algorithm selects the next track, which gives us at least 10 seconds of lead time.

Why it matters in interviews

This is the concept that separates a music streaming design from a generic audio player. Explaining the dual-buffer architecture with sample-accurate crossfading shows we understand the real-time audio constraints that video streaming never faces.

Related concepts

← PreviousCDN Edge Caching with Track Prefetch Next →Play Count Pipeline for Royalty Accounting