Video Streaming Cheat Sheet

Key concepts, trade-offs, and quick-reference notes for your interview prep.

ABR: HLS (.m3u8 + .ts) vs DASH (.mpd + .m4s)

We chose HLS as the default streaming protocol (not DASH alone) because HLS works on every Apple device and most browsers without polyfills. DASH (MPEG standard) uses .mpd manifests with .m4s segments and offers more flexibility (codec-agnostic, multi-period), but it needs a polyfill on Safari. Both protocols split video into 2-10 second chunks so the player can switch resolution at each chunk boundary without rebuffering. YouTube serves both formats depending on the client. We generate both manifests during transcoding so we never lock ourselves to one protocol. Trade-off: dual-manifest generation adds ~5% to transcoding time, but eliminates client compatibility issues.

💡 We serve HLS to Apple clients and DASH to Android and smart TVs. Generating both costs 5% more transcoding time but removes all compatibility gaps.

Transcoding: Split, Parallel Encode, Merge

We chose chunk-based parallel encoding (not single-file sequential encoding) because a 60-minute 4K video takes 4+ hours on one machine. We split the video into GOP-aligned chunks (2-10 seconds each), push each chunk as an independent task to a message queue (SQS or Kafka), and encode across 360 parallel FFmpeg workers. Total encoding time drops from 4 hours to under 5 minutes. If one worker crashes, we only re-encode that 10-second chunk, not the whole video. Trade-off: chunked encoding adds a lightweight merge step and requires GOP-aligned splits (non-trivial to get right), but the parallelism and fault isolation are worth it.

⚠ Encoding the entire video as a single job on one machine. One 4K video blocks a worker for hours, and if the worker crashes at 95%, the entire encode restarts from scratch.

CDN: 95%+ Cache Hit Ratio, Origin Pull vs Push

We chose a tiered CDN strategy (not uniform caching for all videos) because video popularity follows a power-law distribution: the top 1% of videos get 80% of views. We use push pre-warming for viral and trending videos (replicate segments to all edge Points of Presence (POPs) before traffic arrives) and origin pull with long TTL (24h) for the long tail. YouTube achieves a 95%+ cache hit ratio because popular videos are watched repeatedly from the same regions. At 46K views/sec, even a 5% miss rate means 2,300 origin requests/sec. Trade-off: push pre-warming consumes edge storage proactively, but for the top 1% of videos it prevents thundering herd cache-miss storms on the origin.

💡 CDN is the single biggest performance lever in this design. Without it, every view hits the origin. With 46K views/sec at 2.5 MB/sec per stream, that is 115 TB/sec of bandwidth from the origin alone.

Video Chunking: 2-10 Sec, GOP-Aligned Segments

We chose GOP-aligned segments (not arbitrary time splits) because each segment must start with an I-frame (full image) to be independently decodable. A Group of Pictures (GOP) begins with an I-frame followed by P-frames and B-frames (deltas). GOP alignment means no segment depends on data from the previous segment. We chose 4-second segments as our default (not 2s or 10s) because shorter segments (2s) give faster quality switching but increase manifest size and HTTP request overhead, while longer segments (10s) reduce overhead but delay quality adaptation by 10 seconds. Trade-off: 4-second segments are a middle ground that gives reasonable ABR responsiveness without excessive request overhead.

⚠ Non-GOP-aligned segments. If a segment starts mid-GOP, the decoder needs the I-frame from the previous segment to render the first frames. This causes visual glitches during quality switches and breaks independent decodability.

Resumable Upload: Chunked Protocol, Byte Offset Tracking

We chose a resumable upload protocol (not a single HTTP POST) because large video files (1-50 GB) cannot survive a single connection. A 10 GB upload on a 50 Mbps connection takes 27 minutes. We split the file into 5-10 MB chunks on the client, upload them sequentially, and track the last successful byte offset on the server. If the connection drops, the client queries the server for the offset and resumes from there. Maximum wasted transfer per interruption: 10 MB instead of the full file. We use Google's URI-based session pattern: POST to initiate, PUT to send chunks, GET to check progress. Trade-off: the server must persist per-session byte offset state (one row per active upload), but this is trivial compared to re-uploading gigabytes.

💡 We require resumable uploads for files over 100 MB. A single dropped connection without resume means re-uploading the entire file from scratch.

Storage: 500 hrs/min Uploaded, 25 GB/sec Ingress

YouTube receives 500 hours of video per minute. At 50 MB/min per resolution: raw ingress is

500 \times 60 \times 50\text{ MB} = 1.5\text{ TB/min} \approx 25\text{ GB/sec}

. After transcoding into 8 resolutions:

1.5\text{ TB/min} \times 8 = 12\text{ TB/min}

stored. Daily growth across all resolutions:

12\text{ TB} \times 1{,}440 = 17.3\text{ PB/day}

. We store segments in object storage (S3 or GCS) with 3x replication across regions because object storage provides infinite horizontal scaling, native CDN integration, and lifecycle policies for cost optimization. We chose S3 Standard (not Glacier) for the hot path because segment fetches need sub-100ms latency. Trade-off: S3 Standard costs 3x more per GB than Glacier, but segments must be instantly accessible for streaming.

⚠ Forgetting to multiply by the number of resolutions. The raw upload is 1.5 TB/min, but after transcoding to 8 renditions the stored data is 8x larger: 12 TB/min.

Bitrate Ladder: 240p (300 Kbps) to 4K (20 Mbps)

We chose per-title encoding (not a fixed bitrate ladder) because an animated show needs fewer bits than an action movie at the same perceived quality. A static scene at 720p might look perfect at 1 Mbps, but a fast-action sequence at 720p needs 3 Mbps. Netflix pioneered this approach: the bitrate ladder is customized per video. Our default ladder has 8-10 renditions: 240p at 300 Kbps, 360p at 500 Kbps, 480p at 1 Mbps, 720p at 2.5 Mbps, 1080p at 5 Mbps, 1440p at 10 Mbps, and 4K at 20 Mbps. The ABR algorithm picks the highest rendition that fits within the viewer's measured bandwidth. Trade-off: per-title encoding costs 2-3x more compute during transcoding, but it saves 20-30% on storage and CDN bandwidth over the video's lifetime.

💡 We do not use a fixed bitrate ladder for all content. Per-title encoding adds compute cost at upload time but saves significantly on long-term storage and bandwidth.

View Count: Kafka Async, Batch Update Every 30 Sec

We chose Kafka async ingestion + batch MySQL updates (not synchronous per-view database writes) because at 46K views/sec, incrementing a counter in MySQL on every view creates a write hotspot and InnoDB row-level lock contention. Each view event goes to a Kafka topic (fire-and-forget, sub-1ms). A consumer aggregates counts per video_id over a 30-second window and batch-updates MySQL in a single UPDATE per video. For the real-time display, we use a Redis INCR counter (handles 100K+ ops/sec per shard with sub-millisecond latency). Trade-off: the MySQL count lags up to 30 seconds behind reality, but the Redis counter shows the real-time number. We accept eventual consistency between the two.

⚠ Synchronous increment in the database on every view. At 46K writes/sec to a single row (viral video), InnoDB row-level locking serializes all writes. Response time spikes from 5ms to 500ms+.

Metadata: ~10 KB/Video, Sharded by video_id, Redis Cached

Each video has about 10 KB of metadata: title (256B), description (2 KB), tags (500B), user_id (8B), view/like counts (16B), timestamps (16B), thumbnail URLs (500B), resolution list (200B), and status fields. With 1 billion videos:

10^9 \times 10\text{ KB} = 10\text{ TB}

. We chose to shard by video_id (not user_id) because the hot path is metadata lookup by video_id during streaming. We cache hot metadata in Redis with a 1-hour TTL. At 46K reads/sec, a 95% cache hit ratio means only 2,300 MySQL queries/sec hit the database. Trade-off: sharding by video_id makes 'my videos' queries (by user_id) require scatter-gather across shards, but that path is 100x less frequent than video metadata reads.

💡 Metadata is tiny compared to video content (10 KB vs 500 MB per video). We keep it in a relational database for ACID guarantees and cache aggressively in Redis. The bottleneck is never metadata storage, always video bandwidth.

Thumbnails: Bigtable, 5 KB Each, 5 Per Video

#10

We chose Bigtable (not S3 or a POSIX filesystem) for thumbnail storage because we need sub-10ms random reads for billions of small files. S3 adds 50-100ms latency per request because object storage is optimized for throughput, not latency. Filesystems hit inode limits at 5 billion files. Each video generates 5 thumbnails (auto-selected frames at 20%, 40%, 50%, 60%, 80% of duration), each about 5 KB. Row key = video_id, columns = thumb_1 through thumb_5. Total:

10^9 \times 5 \times 5\text{ KB} = 25\text{ TB}

. Bigtable distributes this across tablet servers and delivers sub-10ms reads. Trade-off: Bigtable costs more per GB than S3, but the latency requirement makes S3 unsuitable for the thumbnail hot path.

⚠ Storing thumbnails on a regular filesystem or in S3. Filesystems hit inode limits with billions of small files. S3 adds 50-100ms latency per request. Bigtable handles both the volume (25 TB) and the latency (sub-10ms) requirements.