Failure Modes

KV Store Failure Modes

What breaks, how to detect it, and how to fix it. Every failure includes detection metrics, mitigations, and severity rating.

Failure Modes

MEDIUM

Node loss (and the rebooting-replica blip)

Hardware dies or a node restarts for patching. Keys whose preference lists include it lose one of three replicas; strict quorums would start failing writes for W=2 the moment a second replica hiccups.

Gossip marks the node down within seconds; hinted-handoff queue depth rises on ring neighbors; per-range under-replication gauge.

Mitigation

Sloppy quorum: coordinator writes to the next healthy ring node with a hint (deliver-on-return, 3h TTL)
On rejoin: rate-limited hint delivery restores the canonical replica without a traffic spike
Past the hint TTL or on permanent loss: ring reassigns arcs and anti-entropy streams the ranges

HIGH

Network partition with writes on both sides

A switch failure splits the cluster. Sloppy quorums keep both sides writable (by design), so the same key accepts divergent writes on both sides for the duration.

Gossip ring views diverge; sibling-creation rate spikes on heal; partition detector on inter-rack heartbeats.

Mitigation

Versions carry causal context: descendant histories merge automatically on heal
Concurrent histories become siblings surfaced to the application merge (cart-union pattern) where configured
LWW tables accept the documented risk: last timestamp wins, which is why cart-class data never uses LWW

CRITICAL

Clock skew turns LWW into silent data loss

A coordinator's NTP drifts 800ms ahead. Its writes carry future timestamps; concurrent writes through healthy coordinators lose every LWW comparison for the next 800ms, and losing writes vanish without any error.

Per-node clock-skew gauge against cluster median (alert past 100ms); LWW overwrite-rate anomaly per range; NTP daemon health.

Mitigation

Fence coordinators whose skew exceeds the bound: they proxy to healthy coordinators instead of stamping timestamps
Vector clocks or CRDTs on tables where a lost write is unacceptable
Hybrid logical clocks as the stamp source, bounding the damage a wall-clock jump can do

HIGH

Compaction debt spiral

A traffic surge outruns compaction (or an operator throttles it to protect p99). SSTables accumulate, reads fan out wider, disks fill with dead versions, tombstones stop expiring, and read latency degrades cluster-wide.

Pending-compaction count and growth rate; SSTables-consulted-per-read p99; disk-space slope; tombstone ratio per range.

Mitigation

Backpressure ingest before throttling compaction: shedding writes beats poisoning every read
Emergency: temporary size-tiered mode on affected ranges (cheaper merges), then re-level off-peak
Provision NVMe bandwidth for ~10x write amplification as a capacity-planning line item, not a surprise

HIGH

Hot key melts its preference list

A flash-sale inventory key takes 500K reads/sec. All of it lands on the key's three replicas; they saturate while the other 497 nodes idle, and co-located keys suffer collateral damage.

Top-K hot-key sampler per node; per-replica CPU/queue divergence from cluster median; coordinator queue depth on the affected preference list.

Mitigation

Coordinator-layer hot-key cache with 50-200ms TTL absorbs read storms (staleness bounded and documented)
Shard writable hot keys (key#0..key#7, scatter-gather on read) or move counters to a purpose-built service
Request-level load shedding for the offending key before its replicas take unrelated keys down