Whiteboard ScaleKV StoreCheat Sheet
Cheat Sheet

KV Store Cheat Sheet

Key concepts, trade-offs, and quick-reference notes for your interview prep.

LSM Write Path: WAL, Memtable, SSTable

#1
Every write appends to the write-ahead log (sequential disk, group commit ~1ms) and inserts into the in-memory memtable (skip list). At ~128 MB the memtable flushes to disk as an immutable sorted SSTable: again sequential. The disk NEVER sees random writes, which is why one LSM node sustains 10K+ writes/sec where a B-tree node chokes on random page updates. Background compaction merges SSTables and drops overwritten values and tombstones. Costs live elsewhere: read amplification (many SSTables to consult) and ~10x write amplification from compaction re-writes.

💡 Say the path in order: WAL for durability, memtable for speed, SSTable for persistence, compaction for hygiene.

Quorum Math: R + W > N

#2
With N = 3 replicas, write to W = 2, read from R = 2: 2+2>32 + 2 > 3 guarantees every read set overlaps every acknowledged write set in at least one fresh replica. One node can be slow or dead and both operations still succeed. Tune per use case: R=1, W=1 for speed with staleness; W=3, R=1 for read-heavy config data. The caveat that separates seniors: overlap guarantees the read set contains the newest value, not that replicas agree: versions + reconciliation + read repair handle disagreement.

💡 W = N kills availability: one slow replica blocks every write. W = 2 is the Dynamo compromise.

Ring Placement: 256 Vnodes, 1/N Remap

#3
Keys and nodes hash onto a consistent-hash ring; a key belongs to the first node clockwise. Topology change remaps only 1/N of keys. Raw rings load-skew badly (250%+ variance at 500 nodes), so each physical node projects 256 virtual nodes, cutting variance under 10%. Replication rides the ring for free: a key's preference list is the next N distinct physical nodes clockwise. Any node can coordinate: hash the key, fan out to the preference list, count acknowledgments.

💡 Same ring as the consistent-hashing topic; the KV store adds the preference list on top.

LWW vs Vector Clocks: Who Wins After a Partition

#4
Last-Writer-Wins: highest timestamp silently replaces the rest. Simple, but with clock skew (NTP is tens of ms at best) LWW can pick the losing write: silent data loss, the worst storage failure. Vector clocks: per-node counters record causal history; concurrent versions become siblings the application must merge (Dynamo cart union: both items survive). CRDTs: merge is commutative by construction, converges automatically, limited to expressible types. Position: LWW where losing a concurrent write is tolerable (caches, presence), vector clocks or CRDTs where it is not (carts, counters).

💡 The concrete failure: a cart item vanishes with no error. Name clock skew as the mechanism.

Sloppy Quorums and Hinted Handoff

#5
When canonical replicas are unreachable, the coordinator writes to the next healthy node on the ring with a hint: "deliver to replica 2 when it returns". Writes stay available through failures; on recovery, hints hand off and delete. Three costs: sloppy quorums weaken R + W > N (a canonical read quorum can miss a hinted write), hints are buffered work (cap at ~3 hours, ~100 GB across neighbors for a dead 10K writes/sec node), and mass recovery needs rate-limited handoff or it becomes its own incident.

💡 Hinted handoff is a temporary durability patch, not replication. Anti-entropy covers anything older than the hint TTL.

Three Repair Mechanisms, Three Time Horizons

#6
Read repair (seconds): every quorum read already holds R versions; on mismatch, resolve and asynchronously fix stale replicas. Free, but only heals what gets read. Hinted handoff (minutes): bridges short outages. Merkle-tree anti-entropy (weekly): replica pairs compare hash trees over their ranges: identical roots = identical data in one round trip; descend only differing subtrees (O(logn)O(\log n)), stream only differing buckets. Repair after long outages must be throttled: a repair storm is an incident wearing a safety vest.

💡 "Eventually consistent" is vague until you name the three mechanisms and their horizons.

Bloom Filters: 10 Bits/Key, 1% False Positives

#7
A read may need to consult every SSTable: that is read amplification. Each SSTable carries a bloom filter in RAM: k hash functions over a bit array; any zero bit means definitely absent (skip the file), all-ones means probably present (~1% false positive at 10 bits/key, k=7). No false negatives, so skipping is safe. Inside the one SSTable that survives filtering, a sparse index (one entry per 64 KB block) narrows to a single disk read. Budget: ~250 MB RAM/node of filters buys ~10x read-I/O reduction: the best RAM-for-IO trade in the design.

💡 Bloom filters answer membership only: they do nothing for range scans.

Compaction Strategies and Write Amplification

#8
Size-tiered (Cassandra default): merge similar-sized SSTables. Write amp ~4-6x, best for ingest-heavy, but keys span tiers (wider reads, transient 2x space). Leveled (RocksDB): non-overlapping ranges per level, each level 10x larger. Reads touch at most one SSTable per level, space overhead ~10%, but write amp ~10x: our 5 GB/sec logical ingest becomes ~50 GB/sec of physical compaction I/O. Choose per table: leveled for latency-sensitive ranges, size-tiered for write floods. The silent killer is compaction debt: ingest outrunning compaction piles up SSTables and degrades every read.

💡 Monitor pending compactions and SSTables-per-read: they degrade before p99 does.

Capacity Math: From Ops/sec to Node Count

#9
Anchor: 20M reads/sec, 5M writes/sec peak, values ~1 KB, 100 TB logical data = 100B keys. Write nodes: 5M/10K per node=5005\text{M} / 10\text{K per node} = 500. Storage: 100 TB×3 RF=300 TB100\text{ TB} \times 3\text{ RF} = 300\text{ TB}, at 2 TB usable per node = 150 nodes: writes dominate, so 500 nodes. Ingest: 5M/sec×1KB=5 GB/sec5\text{M/sec} \times 1\text{KB} = 5\text{ GB/sec} logical, ~50 GB/sec physical after compaction. RAM per node: 128 MB active memtable + ~250 MB bloom filters + block cache. WAL: sequential, recycled after flush.

💡 Size by the binding constraint: here write throughput (500 nodes), not storage (150).

The Metrics That Predict Trouble

#10
p99 read latency target <10ms, write <5ms: but these are lagging indicators. Leading ones: pending compactions (debt building), SSTables consulted per read (read amplification creeping), hinted handoff queue depth (a node quietly failing), clock skew across coordinators (LWW data-loss risk: alert past 100ms), tombstone ratio per range (delete-heavy workloads poisoning reads), disk headroom (compaction needs transient space: past 70% is an emergency, not a warning). Weekly: anti-entropy repair completion rate per range.

💡 Compaction debt degrades reads cluster-wide before any single metric pages. Watch the leading indicators.