Consistent Hashing Anti-Patterns

Common design mistakes candidates make. Learn what goes wrong and how to avoid each trap in your interview.

Using Modular Hashing (hash % N) for Dynamic Clusters

Very CommonFORMULA

Modular hashing remaps 99% of keys on any topology change. For a 500M-key cache, one node failure triggers 495M cache misses cascading to the database.

Why: Modular hashing is taught in every CS101 hash table lecture. It is the first thing that comes to mind: hash the key, mod by server count, done. It works perfectly for static clusters. But caches are dynamic: nodes fail at 3 AM, new nodes are added during scaling events. The 99% remapping rate is catastrophic because it converts a single node failure into a cluster-wide cache invalidation event.

WRONG: Assign keys to servers via hash(key) % N. When N changes from 100 to 99 (one failure), 99% of keys remap to different servers. 495M cache misses flood the database in seconds. The database, sized for 5% of traffic (the normal miss rate), is overwhelmed by 20x its expected load.

RIGHT: Use a hash ring where keys and servers are mapped onto a circular space. On node removal, only keys in the failed node's arc move to the clockwise successor: 1/N = 1% of total keys. We chose the ring (not rendezvous hashing) because the ring supports efficient replica placement via clockwise walk and handles weighted nodes via vnode count. Trade-off accepted: ring requires 307 KB of metadata versus zero for modular hashing.

One Ring Position Per Physical Node (No Virtual Nodes)

Very CommonFORMULA

Without virtual nodes, 3 servers produce arc sizes varying by 250%. One server handles 60% of keys while another handles 10%.

Why: Candidates draw the consistent hashing ring correctly but place only one position per server. With 3 servers, the hash function produces 3 random positions on a $2^{32}$ ring. Random placement of 3 points creates arcs with high variance. The standard deviation of arc length with N positions is $1/\sqrt{N}$ of the average, so with 3 positions the relative standard deviation is $1/\sqrt{3} \approx 58\%$ . In practice, arcs vary by 250% or more.

WRONG: Place one ring position per physical server. With 3 servers, arc sizes are random: one server might own 60% of the ring while another owns 10%. The overloaded server's cache fills up and starts evicting, while underloaded servers waste memory.

RIGHT: Place 256 virtual nodes per server on the ring. With 100 servers: 25,600 positions. The law of large numbers smooths arc lengths: relative standard deviation is

1/\sqrt{25{,}600} \approx 0.6\%

. Measured load variance drops from 250% to under 10%. We chose 256 (not 32 or 1024) because 32 vnodes still leave 18% variance while 1024 quadruples metadata size for only marginal improvement. Trade-off accepted: vnodes complicate rebalancing (new node claims positions from multiple donors) but the load balance benefit is worth it.

Placing All Replicas on the Same Physical Rack

CommonFORMULA

Blindly replicating to the next RF-1 ring positions can place all copies on one rack. A single rack failure loses all replicas.

Why: The replication algorithm walks clockwise and picks the next RF-1 positions. In a rack-unaware implementation, two consecutive vnodes might belong to servers in the same rack. With 4 racks and 100 servers, about 25% of consecutive vnode pairs are on the same rack. RF=3 replication without rack awareness has a non-trivial probability of placing all 3 copies in one rack. A rack-level failure (power supply, top-of-rack switch) then loses all copies.

WRONG: Walk clockwise and take the next RF-1 vnodes, regardless of physical location. With servers in 4 racks, there is a ~6% probability that all 3 replicas land in the same rack. A single rack power failure loses all copies of those keys.

RIGHT: Extend the clockwise walk: skip vnodes that belong to a physical node already selected OR in the same rack as an already-selected node. Continue until RF distinct nodes in distinct racks are found. Cassandra and DynamoDB both implement this rack-aware replica placement. Trade-off accepted: the walk may skip more vnodes, and in a 2-rack setup, RF=3 is impossible without cross-rack writes. We require at least 3 racks for RF=3.

Synchronous Rebalancing That Blocks Client Requests

CommonFORMULA

Blocking all requests during rebalancing means 42 seconds of downtime. At 115K ops/sec, that is 4.8M failed requests cascading to the database.

Why: The simplest rebalancing implementation: (1) update the ring topology, (2) transfer data, (3) resume serving. Steps 1 through 3 take 42 seconds for a single node addition at 1 Gbps. During that window, keys in the transferring range have no valid owner. The gateway does not know whether to route to the old node (which is shedding data) or the new node (which does not have it yet).

WRONG: Stop accepting requests for the affected key range during rebalancing. At 115K ops/sec with ~1% of keys affected, that is 1,150 blocked requests/sec for 42 seconds = 48,300 failed requests. If the gateway routes all keys during rebalance: it routes to the new node which returns a miss, triggering unnecessary database queries.

RIGHT: Use a two-phase bootstrap: (1) new node copies data from donors in the background while the old owner continues serving, (2) once caught up, atomically update the ring topology and switch traffic. During phase 1, reads go to the old owner (always correct). During the atomic switch, there is a sub-second window where some gateways have the old ring and some have the new ring. Both the old and new node serve the overlapping range during this window. Trade-off accepted: brief double-serving wastes some bandwidth but ensures zero downtime.

Ignoring Hot Keys Because 'The Ring Distributes Evenly'

CommonFORMULA

Virtual nodes balance key count, not request volume. A single viral key receiving 80% of reads overloads one node regardless of how many vnodes exist.

Why: After explaining virtual nodes and 10% load variance, candidates assume the system is balanced. They conflate key distribution with load distribution. Virtual nodes guarantee each server holds approximately $1/N$ of total keys. But if one key receives 80% of all reads, the server holding that key gets 80% of total read traffic. At 115K reads/sec, that is 92K reads/sec to one server while each of the other 99 servers handles ~230 reads/sec.

WRONG: Rely on virtual nodes for load balancing. Declare the system 'evenly distributed' because each node holds ~1% of keys. Ignore that a single hot key sends 92K reads/sec to one node while others handle 230 reads/sec. The hot node's latency degrades; cache hit ratio drops as it starts evicting entries under memory pressure.

RIGHT: Monitor per-node CPU divergence (one node at 90% while peers are at 30%). When detected, split the hot key into 10 sub-keys (key:0 through key:9). Clients append a random suffix to reads, spreading load across up to 10 nodes. For writes, update all 10 sub-keys (10x write amplification, acceptable for read-heavy hot keys). Alternative: use bounded-load consistent hashing to cap per-node load at

(1+\epsilon) \times \text{average}

. Trade-off accepted: hot key splitting requires client coordination and metadata about which keys are hot.

Storing Ring Topology Only in Memory (No Durable Backup)

CommonFORMULA

If the ring coordinator restarts, all topology is lost. Gateways with stale rings route to wrong nodes. A 307 KB MySQL table prevents this entirely.

Why: The ring metadata is small (307 KB) and changes rarely. Storing it in memory on the coordinator seems sufficient: fast reads, no database dependency. But if the coordinator process crashes or the machine reboots, the ring topology must be reconstructed from scratch. During reconstruction, gateways cannot refresh their rings. If a node failed during the outage, gateways still route to the dead node.

WRONG: Store ring topology only in the coordinator's memory. On coordinator restart, reconstruct the ring by querying all live nodes. Reconstruction takes 30-60 seconds (waiting for node responses). During that window, gateways use stale ring data. If a node failed during the outage, gateways route to the dead node, causing cache misses cascading to the database.

RIGHT: Persist ring topology in a MySQL ring_topology table (25,600 rows, 1.25 MB). On coordinator restart, load from MySQL in milliseconds. Gateways cache the ring locally and refresh every 5 seconds via epoch comparison. Even if the coordinator is down for minutes, gateways have a recent valid ring. We chose MySQL (not etcd) because ring changes happen a few times per week, and MySQL is already in the infrastructure. Trade-off accepted: MySQL dependency, but it is read-only in the hot path and the data fits in a single page.

No Request Coalescing on Cache Miss Storms

CommonFORMULA

When a popular key expires, 10,000 concurrent misses all query the database. Request coalescing deduplicates them into 1 database query.

Why: Cache miss handling is straightforward: if the key is not in cache, fetch from database, store in cache, return. This works at low concurrency. But when a popular key expires, thousands of requests arrive within the same millisecond, all find a cache miss, and all independently query the database. At 10,000 concurrent requests for the same key, the database receives 10,000 identical queries. If the query takes 50ms, the database is blocked for 500,000 query-milliseconds.

WRONG: Every cache miss independently queries the database. A popular key expiring triggers 10,000 concurrent identical database queries. The database, sized for 5% miss rate (5,750 queries/sec), suddenly receives 10,000 queries in 1 second for one key. Other queries are starved; latency spikes to seconds.

RIGHT: Implement request coalescing (also called single-flight): the first miss for a key acquires a lock, fetches from database, populates cache, and unblocks all waiting requests. The 9,999 subsequent requests wait for the result instead of querying independently. We chose coalescing at the gateway (not at the cache node) because the gateway already deduplicates requests before they reach the cache. Combined with staggered TTLs (base TTL + random jitter of 0-60 seconds) to prevent synchronized expiry. Trade-off accepted: coalescing adds ~1ms of lock overhead per miss, but eliminates 9,999 redundant database queries.

Using Wall Clocks for Ring Versioning

OccasionalFORMULA

Wall clocks across servers can differ by seconds. Two concurrent ring updates with close timestamps create ambiguous ordering. Monotonic epoch numbers eliminate this.

Why: Using timestamps to version ring changes seems natural: the most recent change has the highest timestamp. But distributed systems have clock skew. NTP keeps clocks within 1-10ms, but during network partitions, skew can grow to seconds. Two ring updates at similar times could have reversed timestamps on different gateways, leading to one gateway accepting a ring version that another rejects.

WRONG: Version ring changes with wall clock timestamps. Two ring updates within 10ms of each other may appear in different order on different gateways due to clock skew. Gateway A sees update X as newer, gateway B sees update Y as newer. The cluster has inconsistent ring views.

RIGHT: Use monotonically increasing epoch numbers assigned by the single ring coordinator. Each topology change increments the epoch. Gateways always accept higher epochs. There is no ambiguity: epoch 42 is always newer than epoch 41 regardless of wall clock time. We chose a single coordinator (not distributed consensus) because ring changes are rare (a few per week) and the coordinator is not in the hot path. Trade-off accepted: the coordinator is a single point of failure for ring changes (not for cache reads). Failover to a standby coordinator with the same MySQL backend takes seconds.