TRICKYwalkthrough
LSM Trees: Turning Random Writes Sequential
The cluster must absorb 5 million writes per second, and every value must survive a power cut. A B-tree, the default index in PostgreSQL and MySQL, updates pages in place: each write becomes a random disk I/O, and even good SSDs sustain far fewer random writes than sequential ones.
Three candidates. Option one: B-tree with a big buffer pool.
“How do we make disks do what they are good at?”
Random page updates eventually flush to disk; under sustained write load, checkpointing becomes the bottleneck and p99 write latency spikes. Option two: append-only log with an in-memory hash index (Bitcask style).
Blazing writes, but the index must fit in RAM and range scans are impossible. Option three: the Log-Structured Merge tree.
Every write does two things: append to a write-ahead log (sequential, group-committed every ~1ms for durability) and insert into an in-memory memtable (a skip list). When the memtable reaches ~128 MB it flushes to disk as an immutable, sorted SSTable, again purely sequential I/O.
Background compaction merges SSTables, discarding overwritten values and tombstones. The result: the disk only ever sees sequential writes, and one node sustains 10K+ writes/sec where a B-tree node would choke.
Cassandra, RocksDB, LevelDB, and DynamoDB's storage layer all sit on this idea. The price is paid on two other axes.
Reads must consult the memtable plus potentially several SSTables (mitigated by bloom filters and sparse indexes), and compaction re-writes data repeatedly: write amplification around 10x for leveled compaction, which turns a 5 GB/sec logical ingest into ~50 GB/sec of physical disk writes across the cluster. What if the interviewer asks: why not just buy more B-tree nodes?
Because the failure is per-node economics: LSM gets ~10x more write throughput from the same SSD, so the B-tree cluster costs 10x more machines before replication even starts.
Related concepts