Consistent Hashing Failure Modes

What breaks, how to detect it, and how to fix it. Every failure includes detection metrics, mitigations, and severity rating.

Failure Modes

CRITICAL

Node crash remaps 1% of keys, cache miss storm without RF

MEDIUM with RF=3 (successor serves instantly).

Constraint: each node holds 1/100 of total keys (5M keys). The node crashes (OOM, hardware failure, bad deploy). Without replication (RF=1), those 5M keys are lost. Every request misses the cache and hits the database. At 115K ops/sec with 1% affected, that is 1,150 extra DB queries/sec. With RF=3, the clockwise successor already holds the data and takes over seamlessly. Detection: heartbeat stops for 30 seconds, coordinator marks node dead, increments epoch. Gateways refresh within 5 seconds.

Node heartbeat stops. Ring coordinator marks dead after 30 seconds. Cache miss rate spikes for keys in the failed node's range. Gateway logs show connection refused to the dead node.

Mitigation

RF=3 replication: the successor already has the data. Zero misses reach the database on single-node failure.
Ring coordinator detects failure within 30 seconds via heartbeat monitoring. Epoch increments, gateways update within 5 seconds.
Request coalescing at the gateway: if multiple misses for the same key arrive simultaneously, only one database query is made.
Circuit breaker: if miss rate exceeds 10x normal, reject new requests to protect the database.

HIGH

Hot key overloads a single node

One overloaded node degrades latency for all keys it serves, not just the hot key.

Constraint: consistent hashing distributes keys evenly, but not request volume. A viral social media post is cached under one key. That key receives 80% of all reads: $115K \times 0.8 = 92K$ reads/sec to one node while 99 others handle ~230 reads/sec each. The hot node's CPU maxes out, latency degrades from 1ms to 50ms, and the node starts timing out requests. Virtual nodes do not help: they balance key count, not access patterns. Detection: per-node CPU divergence. One node at 90% while peers average 30%. Cache latency p99 for keys on that node exceeds 50ms.

Per-node CPU divergence exceeding 3x cluster average. Cache latency p99 for specific keys exceeding 50ms. Monitoring dashboard shows one node at 90% CPU while peers are at 30%.

Mitigation

Hot key splitting: split the key into 10 sub-keys (key:0 to key:9). Client appends rand(0,9), spreading reads across up to 10 nodes.
Bounded-load consistent hashing: cap per-node load at (1+epsilon) x average. Overflows walk to the next node.
Local cache on the gateway: for extremely hot keys, cache the value at the gateway layer for 1 second, absorbing burst reads without hitting the cache node.
Auto-detect and alert: monitor CPU divergence, automatically flag hot keys for splitting.

MEDIUM

Rebalancing overloads donor nodes during node addition

Live traffic degrades during rebalancing but recovers once transfer completes (~42-84 seconds).

Constraint: adding a node transfers 5.25 GB from existing donors at 1 Gbps. The donor nodes must serve live traffic while simultaneously streaming data to the new node. At peak (347K ops/sec), each donor handles 3.5K ops/sec. Adding the I/O load of streaming 5.25 GB saturates the donor's NIC. Live request latency degrades from 1ms to 20ms. What breaks: cache hit ratio drops as donors slow down, and more requests time out and fall through to the database. Detection: donor node network utilization exceeds 80%. Cache latency for keys on donor nodes spikes.

Donor node network utilization exceeding 80%. Cache latency spikes on donor nodes. Rebalance transfer rate dropping below expected throughput.

Mitigation

Bandwidth throttling: limit rebalance transfer to 500 Mbps (50% of NIC capacity), leaving headroom for live traffic.
Off-peak scheduling: trigger node additions during low-traffic windows (2-6 AM) when live traffic is 1/3 of peak.
Incremental transfer: stream data in small batches with pauses between batches, allowing the donor to process queued live requests.
Pre-warming: the new node can fetch popular keys first (by access frequency) to minimize the impact window.

CRITICAL

Split brain in multi-datacenter ring

Divergent rings serve different data for the same keys. Silent data inconsistency until detected.

Constraint: the ring spans two datacenters (DC-A and DC-B) for geographic availability. The network link between DCs fails. Each DC believes the other's nodes are dead (heartbeats stop). DC-A's coordinator removes DC-B's 50 nodes from the ring. DC-B's coordinator removes DC-A's 50 nodes. Both DCs rebalance independently, creating two divergent rings. Writes to the same key in both DCs create conflicting values. Detection: sudden loss of exactly 50% of nodes (all in one DC) is a partition signal, not mass failure.

Exactly half the nodes (all in one DC) fail heartbeat simultaneously. Cross-DC network latency spikes to infinity. Ring coordinator detects pattern: all failed nodes are in the same datacenter.

Mitigation

Quorum writes: W=2 of 3 replicas. If a write cannot reach quorum (replicas in the partitioned DC are unreachable), the write fails rather than creating divergent state.
Partition detection heuristic: if all nodes in one DC fail simultaneously, treat as network partition, not mass failure. Do not rebalance.
Anti-entropy repair on reconnection: when DCs reconnect, run full-ring repair using last-write-wins with vector clocks to reconcile divergent keys.
Read repair: on quorum reads, if replicas return different values, the newest value is written back to stale replicas.

MEDIUM

Stale ring metadata on gateway routes to wrong node

Stale ring causes misrouting for up to 5 seconds. With retry and replica fallback, most requests succeed on the second attempt.

Constraint: gateways poll the ring coordinator every 5 seconds. A node fails and is removed from the ring. The coordinator updates the epoch. A gateway has not yet polled and still has the old ring (epoch N-1). For the next 5 seconds, this gateway routes keys from the failed node's range to the dead node. Reads fail (connection refused) and fall back to replicas (if RF>1). Writes to the dead node are lost unless the gateway retries to the successor. Detection: gateway logs show connection errors to the dead node IP. Ring epoch mismatch in gateway metrics.

Gateway connection errors to recently removed node IPs. Ring epoch mismatch between gateway and coordinator. Increased cache miss rate concentrated on the failed node's key range.

Mitigation

Epoch-based staleness detection: on any connection error, the gateway immediately polls the coordinator for a ring refresh instead of waiting for the next 5-second interval.
Client-side retry: on 503, the client retries after 100ms. By then, the gateway likely has the updated ring.
Replica fallback: on primary failure, the gateway falls back to the next clockwise replica node in the ring.
Force-refresh endpoint: admin can trigger immediate ring refresh on all gateways via a broadcast API call.

CRITICAL

Cache miss thunderstorm cascading to database

Cascading failure can take down the database, turning a cache expiry into a full system outage.

Constraint: a popular key with TTL=3600s expires at T. At T+0, 10,000 concurrent requests arrive, all find a miss, all query the database with the same query. The database, sized for 5% miss rate (5,750 queries/sec), gets 10,000 identical queries in one second. Queue fills. Other queries are blocked. Latency spikes from 10ms to 5 seconds. Cascading failure: more entries expire while the DB is slow, causing more misses. Detection: miss rate spikes 10x in 5 seconds. DB query queue exceeds 1,000.

Cache miss rate spikes 10x within 5 seconds. Database CPU exceeds 90%. Database query queue depth exceeds 1,000. Cascading latency increase across unrelated cache keys.

Mitigation

Request coalescing (single-flight): the first miss acquires a lock, fetches from database, populates cache. Subsequent requests for the same key wait for the result. 10,000 misses become 1 database query.
Staggered TTLs: add rand(0, 60s) jitter to base TTL. Keys expire at different times, preventing synchronized thundering herd.
Circuit breaker: if miss rate exceeds 10x normal for 5 seconds, reject new cache-miss database queries. Return stale cached value if available (stale-while-revalidate).
Pre-warm: for known popular keys, refresh the cache 60 seconds before expiry using a background process.