KV Store Failure Modes
What breaks, how to detect it, and how to fix it. Every failure includes detection metrics, mitigations, and severity rating.
Node loss (and the rebooting-replica blip)
Hardware dies or a node restarts for patching. Keys whose preference lists include it lose one of three replicas; strict quorums would start failing writes for W=2 the moment a second replica hiccups.
- Sloppy quorum: coordinator writes to the next healthy ring node with a hint (deliver-on-return, 3h TTL)
- On rejoin: rate-limited hint delivery restores the canonical replica without a traffic spike
- Past the hint TTL or on permanent loss: ring reassigns arcs and anti-entropy streams the ranges
Network partition with writes on both sides
A switch failure splits the cluster. Sloppy quorums keep both sides writable (by design), so the same key accepts divergent writes on both sides for the duration.
- Versions carry causal context: descendant histories merge automatically on heal
- Concurrent histories become siblings surfaced to the application merge (cart-union pattern) where configured
- LWW tables accept the documented risk: last timestamp wins, which is why cart-class data never uses LWW
Clock skew turns LWW into silent data loss
A coordinator's NTP drifts 800ms ahead. Its writes carry future timestamps; concurrent writes through healthy coordinators lose every LWW comparison for the next 800ms, and losing writes vanish without any error.
- Fence coordinators whose skew exceeds the bound: they proxy to healthy coordinators instead of stamping timestamps
- Vector clocks or CRDTs on tables where a lost write is unacceptable
- Hybrid logical clocks as the stamp source, bounding the damage a wall-clock jump can do
Compaction debt spiral
A traffic surge outruns compaction (or an operator throttles it to protect p99). SSTables accumulate, reads fan out wider, disks fill with dead versions, tombstones stop expiring, and read latency degrades cluster-wide.
- Backpressure ingest before throttling compaction: shedding writes beats poisoning every read
- Emergency: temporary size-tiered mode on affected ranges (cheaper merges), then re-level off-peak
- Provision NVMe bandwidth for ~10x write amplification as a capacity-planning line item, not a surprise
Hot key melts its preference list
A flash-sale inventory key takes 500K reads/sec. All of it lands on the key's three replicas; they saturate while the other 497 nodes idle, and co-located keys suffer collateral damage.
- Coordinator-layer hot-key cache with 50-200ms TTL absorbs read storms (staleness bounded and documented)
- Shard writable hot keys (key#0..key#7, scatter-gather on read) or move counters to a purpose-built service
- Request-level load shedding for the offending key before its replicas take unrelated keys down