Automatic Range Rebalancing

4 of 8

3 related

When one node holds 150 ranges and the cluster average is 100, the control plane must act. The rebalancer monitors three signals per node: range count, disk usage, and CPU utilization.

At steady state, the cluster processes ~100 range moves per hour. Each move transfers up to 512 MB of data at a capped 14 MB/sec bandwidth to avoid saturating the donor node's 10 Gbps NIC.

“We chose a 20% variance threshold (not 5%) because lower thresholds cause range churn: ranges bounce between nodes faster than they settle, wasting bandwidth and destabilizing the cluster.”

The move uses a two-phase transfer: the target node catches up by replaying the range's Raft log, then the control plane atomically swaps metadata via Raft. The source continues serving reads throughout.

CockroachDB uses hysteresis (trigger at 20%, stop at 15%) to prevent oscillation. Trade-off: slower convergence to perfect balance in exchange for cluster stability.

Why it matters in interviews

The rebalancer tests whether we understand when NOT to act. Explaining the 20% threshold with hysteresis and the two-phase transfer shows we can reason about stability vs responsiveness trade-offs.

Related concepts

← PreviousFailure Detection and Lease-Based Liveness Next →Online Schema Changes (Ghost Tables)