Thumbnail Generation

7 of 8

3 related

A viewer hovers over the seek bar and expects an instant preview image. Without pre-generated thumbnails, the player would need to decode the video at that timestamp on the fly, consuming significant client CPU and adding visible delay.

We generate thumbnails as a downstream task in the transcoding DAG, triggered after the first resolution finishes encoding. The pipeline extracts keyframes at 10-second intervals, producing a sprite sheet of candidate images per video.

“The constraint: we need thumbnails at every 10-second interval, at multiple resolutions, stored for sub-millisecond retrieval by video ID.”

A 10-minute video at one frame per 10 seconds yields 60 candidates. For creator-facing thumbnails, an ML model scores each candidate on visual quality, face detection, and text clarity, then selects the top 3 to 5 options for the creator to choose from.

Each thumbnail is resized to multiple dimensions (120x90, 320x180, 480x360) and compressed to JPEG at roughly 5 KB per image. Implication: at 5 thumbnails times 5 KB, that is only 25 KB of thumbnail data per video, making thumbnail storage negligible compared to video storage.

We store thumbnails in Bigtable (not a traditional SQL database or filesystem) because Bigtable provides sub-millisecond random reads keyed by video ID, handles billions of rows without sharding configuration, and scales horizontally without manual intervention. A relational database would require explicit sharding for this volume.

A filesystem would lack the indexed lookup speed. Trade-off: we accept Bigtable's higher per-query cost compared to a filesystem in exchange for consistent low-latency reads at any scale.

Netflix generates up to 30 personalized thumbnail variants per title, selected by an A/B testing framework that optimizes for click-through rate. What if the interviewer asks: why generate thumbnails in the transcoding DAG rather than as a separate pipeline?

Because the transcoding DAG already decodes the video into frames. Extracting thumbnails from decoded frames is nearly free.

A separate pipeline would decode the video a second time, doubling the I/O cost for that step.

Why it matters in interviews

Thumbnails seem trivial but they directly impact click-through rates. Mentioning ML-based selection and why we chose Bigtable over alternatives shows we think about downstream user engagement, not only the encoding pipeline.

Related concepts

← PreviousVideo Deduplication Next →Video Metadata Store