Single Metadata Server Without Consensus
Very CommonFORMULA
Storing range metadata in one MySQL instance means a single point of failure for every routing decision in the cluster. If that instance dies, no queries can be routed.
Why: Candidates default to MySQL because it is familiar. They underestimate how critical range metadata is: every query depends on it. A single MySQL instance handles the 200 MB dataset easily, so the bottleneck is not size but availability. Without consensus replication, a crash takes down the entire routing layer.
WRONG: Store range metadata in a single MySQL instance. One server crash makes 1M ranges unroutable. Failover with async replication risks losing recent range assignments, causing split-brain: two nodes believe they own the same range.
RIGHT: Store range metadata in a 5-node Raft group for linearizable reads and F=2 fault tolerance. Every range assignment is committed through consensus before being applied. We chose embedded Raft (not ZooKeeper) to avoid a separate failure domain. Trade-off: Raft adds 2-5ms commit latency per metadata update, but metadata changes are infrequent (~1K/sec from heartbeats, ~100/hour from rebalances). Gossip-Based Failure Detection for Range Leadership
CommonFORMULA
Using gossip to determine which node leads a range leaves a probabilistic convergence window where two nodes might simultaneously believe they are the leader.
Why: Gossip is well-known from Cassandra and distributed hash tables. It scales to thousands of nodes with O(log N) convergence rounds. But gossip trades consistency for availability. For range leadership, we need the opposite: it is better to delay detection by a few seconds than to allow two leaders to exist simultaneously.
WRONG: Use gossip-based failure detection (phi-accrual or similar). Gossip converges in O(logN) rounds, but convergence is probabilistic. During convergence, two nodes can serve writes for the same range, causing data divergence that requires expensive reconciliation. RIGHT: Use lease-based detection with 9-second leases. The old leader checks its lease before every operation. When the lease expires, it stops serving, period. The new leader is elected only after the old lease is guaranteed to have expired. We chose leases (not gossip) because deterministic fencing prevents split-brain at the range level. Trade-off: slower detection (~12 seconds vs ~5 seconds for gossip), but zero ambiguity about leadership.
Stop-the-World Range Splits
CommonFORMULA
Pausing all queries during a range split creates unavailability proportional to data size. At 512 MB and 14 MB/sec transfer, that is 36 seconds of downtime per split.
Why: The simplest split implementation pauses the range, copies data, updates metadata, and resumes. Candidates choose this because it avoids reasoning about concurrent queries during the split. But with ~100 splits/hour at steady state, the cluster would accumulate 60 minutes of downtime per hour, which is absurd.
WRONG: Pause all queries to the range, split the data, update metadata, resume. At ~100 splits/hour, each taking 36 seconds, the cluster spends 60 minutes/hour with some ranges unavailable. This makes the split mechanism itself a source of outages.
RIGHT: Use a two-phase split. Phase 1: the old range leader creates two new ranges locally while continuing to serve all queries. Phase 2: the control plane commits the split to Raft, incrementing the epoch. In-flight queries on the old range are served; new queries route to the new ranges via the updated epoch. We chose two-phase (not stop-the-world) because zero-downtime operations are non-negotiable for a production database.
Rebalancing on Every Minor Imbalance
CommonFORMULA
Triggering rebalance at 5% variance causes range churn: ranges bounce between nodes faster than they settle, wasting bandwidth and destabilizing the cluster.
Why: It seems logical to keep load perfectly balanced. But a 5% variance threshold on a 10K-node cluster triggers hundreds of moves per hour. Each move transfers up to 512 MB at 14 MB/sec. The aggregate bandwidth consumed by unnecessary rebalancing can saturate donor nodes and degrade query latency for all ranges they serve.
WRONG: Trigger rebalance at 5% variance. On a 10K-node cluster, natural load fluctuation means hundreds of ranges constantly exceed this threshold. The rebalancer enters a thrashing loop: move range A from node 1 to node 2, then node 2 becomes overloaded, move range B away. Net effect: continuous data shuffling with no stability.
RIGHT: Use a 20% trigger threshold with hysteresis: start rebalancing at 20% variance, stop at 15%. This prevents oscillation. Steady state: ~100 moves/hour, consuming ~1.4 GB/hour of bandwidth (trivial for 10 Gbps NICs). We chose hysteresis (not a fixed threshold) because it creates a dead zone where small fluctuations are absorbed without action.
Locking the Entire Table for Schema Changes
CommonFORMULA
Running ALTER TABLE on a billion-row table acquires a metadata lock that blocks all reads and writes for the duration of the operation, potentially hours.
Why: ALTER TABLE is the standard SQL DDL command. On a single-node database, it works (with downtime). On a distributed database with 1M ranges, the lock would need to be held across all ranges simultaneously. Since ranges are on different nodes, a distributed lock for DDL is complex and fragile. Candidates default to ALTER TABLE because it is familiar.
WRONG: Run ALTER TABLE ADD COLUMN across all ranges. Each range acquires a metadata lock while applying the change. With 1M ranges processed serially at 1ms each: 17 minutes of table-wide blocking. No reads or writes can proceed during this time.
RIGHT: Use the ghost table approach: create a shadow table with the new schema, backfill in batches, capture changes via binlog, and atomically swap. Each range is locked for only 1ms during the final swap. Parallelized across 100 workers, the entire 1M-range cluster completes in under 30 seconds. We chose ghost tables (not online DDL) because the approach has been battle-tested at YouTube scale.
Ignoring Placement Constraints During Rebalancing
CommonFORMULA
Moving a range to the closest available node without checking rack/zone constraints can place all 3 replicas in the same rack, making a rack failure catastrophic.
Why: The rebalancer optimizes for speed: move the range to the nearest node with available capacity. But 'nearest' often means same rack (lowest network latency). After several rebalance cycles, replicas that were originally spread across racks may converge to a single rack. A rack power failure then loses all 3 copies.
WRONG: The rebalancer picks the target node with the lowest latency and most free disk space, ignoring physical topology. After 10 rebalance cycles, all 3 replicas of a range end up in rack 5. A single rack failure loses the range entirely.
RIGHT: The rebalancer queries the placement constraint solver before every move. The solver checks: target node is in a different rack/zone than existing replicas, has sufficient disk, and is below CPU threshold. We chose constraint-aware placement (not latency-optimized) because rack-level failures are common (power supply, top-of-rack switch) and losing all replicas is unrecoverable.
Polling All 1M Ranges for Heartbeat Status
CommonFORMULA
Iterating all 1M range entries to find dead leaders is O(N) per heartbeat cycle. At 1K heartbeats/sec, this wastes CPU scanning 999K healthy ranges.
Why: The naive implementation checks each range's leader status on every heartbeat: 'is this range's leader still alive?' With 1M ranges and 1K heartbeats/sec, that is 1 billion range checks per second. The correct approach flips the model: track node liveness, not range liveness. When a node dies, look up which ranges it led.
WRONG: On each heartbeat, scan all 1M range entries to check if their leaders are alive. At 1K heartbeats/sec, this is 1B range checks/sec. The metadata leader spends all its CPU on scanning instead of processing metadata updates.
RIGHT: Index the node_registry by (status, lease_expiry). The failure detector scans only nodes with expired leases, typically 0-2 nodes at any time. When a node is marked dead, use the index on range_metadata(leader_node_id) to find its ~100 ranges. We chose node-centric detection (not range-centric) because 10K nodes is 100,000x fewer entries to scan than 1M ranges.
Trusting Client-Reported Range Ownership
CommonFORMULA
Allowing nodes to self-report range ownership without control plane verification leads to split-brain when two nodes claim the same range after a network partition.
Why: In a decentralized design, each node tracks which ranges it owns and reports this to the cluster. After a network partition heals, two nodes may both claim ownership of a range. Without a central authority, there is no tiebreaker. This design confuses 'what I think I own' with 'what the control plane has assigned me.'
WRONG: Nodes self-report range ownership. After a network partition, two nodes claim range R. Other nodes receive conflicting membership updates and cannot determine the true owner. Queries to range R return inconsistent results depending on which node answers.
RIGHT: The control plane is the sole authority for range ownership. Every assignment goes through Raft consensus. Nodes verify their ownership via epoch checks: if a node's epoch is lower than the control plane's, its ownership claim is stale. Clients reject responses tagged with lower epochs. We chose centralized authority (not decentralized) because range ownership requires linearizability, which only consensus can provide.