Video Metadata Store

8 of 8

3 related

We have 2 billion video records with title, description, upload date, and view counts. The access pattern is heavily read-biased: a 200:1 read-to-write ratio.

200x

more memory

We chose MySQL with Vitess (not Cassandra or DynamoDB) because our metadata queries are relational (join video with user, filter by status, sort by date) and Vitess provides transparent horizontal sharding over MySQL without rewriting our query layer. Cassandra would handle the write throughput but lacks efficient joins and secondary indexes for our query patterns.

“The constraint: a single unsharded database cannot serve 200x more reads than writes at this scale, and view counts on viral videos create write-hot rows that lock out concurrent readers.”

DynamoDB would require denormalizing every access pattern into separate tables. Trade-off: we accept the operational complexity of running Vitess (shard routing, schema migrations across shards) in exchange for keeping our relational query model.

We shard by video_id (not user_id) because our hot path is video lookup by ID. Sharding by user_id would put all videos from a prolific creator on one shard, creating hotspots.

Each row holds title (200B), description (5KB), user_id (8B), upload_timestamp, duration, and status flags. View counts and like counts are denormalized directly onto the video row but updated asynchronously via a counter service that batches increments every 5 seconds, avoiding write-hot rows under viral traffic.

Implication: 5-second batching means displayed view counts may lag real-time by up to 5 seconds, but exact real-time accuracy is not required for display purposes and this lag eliminates row-level lock contention. The read path hits a caching layer (Redis, not Memcached, because Redis supports structured data types for storing partial metadata objects) first, achieving a cache hit rate above 99% for popular videos.

For search, we replicate metadata into an Elasticsearch index that supports full-text queries on title and tags. YouTube reportedly shards across thousands of MySQL instances using Vitess.

What if the interviewer asks: why not use a single NoSQL store for both metadata and search? Because full-text search requires inverted indexes (Elasticsearch excels here), while transactional metadata updates require ACID guarantees (MySQL excels here).

Combining both in one store means neither works well.

Why it matters in interviews

Interviewers probe whether we understand read-heavy sharding and why denormalized counters need async updates. Explaining why we chose Vitess over Cassandra and why we shard by video_id (not user_id) shows we design for the actual access pattern, not a generic database choice.

Related concepts

← PreviousThumbnail Generation