Whiteboard Scale›KV Store›Design Walkthrough

KV Store System Design Walkthrough

Complete design walkthrough with animated diagrams, capacity math, API design, schema, and failure modes.

Solution PathTarget: 30 min

We designed a Dynamo-class key-value store: 100 TB logical (100B keys) on 500 nodes, serving 20M reads/sec and 5M writes/sec. The LSM engine (WAL + memtable + SSTable) makes every disk write sequential: 10K writes/sec per node, paid for with 10x compaction amplification and bought back with 250 MB of bloom filters. A consistent-hash ring with 256 vnodes places keys and replicas (N=3) with no metadata service; quorums (R+W>N) tune freshness per request; sloppy quorums, hinted handoff, read repair, and weekly Merkle anti-entropy keep it writable through failures and convergent afterward.

1/10

What is KV Store?

Amazon, 2004: the relational databases behind the shopping cart kept buckling under write load and failing over with downtime, and the team asked a heretical question: what if the cart never refused a write, even during failures, even during partitions: and we paid for it with consistency instead of availability? The answer became Dynamo, and its descendants (Cassandra, Riak, DynamoDB, ScyllaDB) now hold sessions, carts, feeds, counters, and device state across the industry.

A distributed key-value store is a partitioned map: Get(key), Put(key, value): with three interlocking machines under the API. Machine one, the storage engine: an LSM tree that turns every write into sequential I/O (WAL + memtable + SSTable), so one node absorbs 10K+ writes/sec where a B-tree chokes.

Machine two, the placement layer: a consistent-hash ring with 256 vnodes per node that spreads 100 billion keys across 500 nodes, remaps only 1/N of them when topology changes, and yields replica placement (the preference list) for free. Machine three, the consistency protocol: quorum knobs (R + W > N when you need fresh reads, R=W=1 when you need speed), sloppy quorums with hinted handoff so writes survive dead replicas, and three repair mechanisms on three time horizons.

What it deliberately is NOT: a relational database (no joins, no secondary indexes, no cross-key transactions): and that refusal is where the scalability comes from. The design contrast worth naming: our distributed DB control plane (topic 12) chose consistency-first with Raft and range leases; Dynamo-class stores choose availability-first with quorums and reconciliation.

Same CAP theorem, opposite corners, both correct: for different data.

Dynamo's bet: never refuse a write, pay in consistency. Three machines: LSM engine (10K writes/sec/node), hash ring (100B keys, 1/N remap), tunable quorums (R+W>N to R=W=1). Deliberately not relational: the refusal IS the scalability. Availability-first vs topic 12's consistency-first.

A distributed key-value store is the storage workhorse behind carts, sessions, feeds, and device state: a horizontally partitioned map from key to value that keeps accepting writes through node failures and network partitions. It sounds like a hash map with servers. The real design is three interlocking machines: a storage engine that turns random writes into sequential I/O (the LSM tree), a placement layer that spreads 100 billion keys across 500 nodes with minimal reshuffling (the consistent-hash ring), and a consistency protocol that lets you buy exactly as much freshness as each workload needs (quorums, vector clocks, repair).

Scale anchor: 20M reads/sec, 5M writes/sec peak over 100 TB logical data (100B keys, ~1 KB values)
Write speed comes from the LSM tree: WAL + memtable + sequential SSTable flushes = 10K+ writes/sec per node
Consistency is tunable per operation: R + W > N for fresh reads, R=W=1 where staleness is fine
Availability through failures: sloppy quorums + hinted handoff keep writes flowing when replicas die