DB Control Plane Cheat Sheet

Key concepts, trade-offs, and quick-reference notes for your interview prep.

Range Metadata Fits in Memory: 1M Ranges x 200B = 200 MB

Each range entry stores range_id (8B), start_key (64B), end_key (64B), leader_node (8B), 3 replica_nodes (24B), epoch (8B), and timestamps (24B) = 200 bytes. At 1M ranges:

1{,}000{,}000 \times 200\text{B} = 200\text{ MB}

. This fits entirely in memory on each Raft replica. Implication: range lookups never touch disk, guaranteeing sub-millisecond latency for every routing decision.

💡 1M ranges x 200B = 200 MB. Fits in memory. No disk I/O for routing.

⚠ Paginating range metadata from disk. At 200 MB total, the entire dataset fits in memory. Disk-backed lookups add 1-5ms per query, unacceptable for a routing hot path.

Range Splits at 512 MB by Data Volume Median Key

When a range exceeds 512 MB, the control plane picks the median key by data volume, not the key-space midpoint. Why? The midpoint splits by key order, but skewed distributions can put 90% of data in one half. The volume median ensures each new range holds roughly 256 MB. Trade-off: finding the median requires scanning the range, adding ~1 second to the split. CockroachDB processes ~100 splits/hour at steady state.

💡 Split at 512 MB. Median by data volume, not key-space midpoint. ~100 splits/hour.

⚠ Splitting at the key-space midpoint. With skewed key distributions, one child range holds 90% of the data and needs to be split again immediately.

Lease-Based Failure Detection: 9-Second Leases Not Gossip

We chose leases (not gossip) because leases provide a deterministic fencing boundary. When a 9-second lease expires, the old leader is guaranteed to have stopped serving. Gossip converges probabilistically in

O(\log N)

rounds with uncertainty about when convergence completes. Failure timeline: lease expires (9s) + election (2s) + propagation (1s) = 12 seconds typical detection latency.

💡 9s lease + 2s election + 1s propagation = 12s total. Deterministic fencing.

⚠ Using gossip for range leadership decisions. Gossip convergence is probabilistic, leaving a window where two nodes believe they are the leader of the same range.

Raft for Metadata: 5-Node Group Tolerates 2 Failures

Range metadata must be linearizable: two nodes must never believe they own the same range. A 5-node Raft group provides

F=2

fault tolerance. We chose embedded Raft (not Paxos, not ZooKeeper) because CockroachDB and TiKV embed meta-ranges in the same Raft implementation as data ranges. One fewer external dependency, one fewer failure domain. Raft log throughput: 200 KB/sec steady state, 2 MB/sec during rebalancing.

💡 5-node Raft, F=2 tolerance, embedded (not external). 200 KB/sec steady Raft log.

⚠ Using a single MySQL instance for range metadata. No consensus means no split-brain prevention. Two concurrent metadata updates can assign the same range to different nodes.

Epoch Versioning: Monotonic Counter Prevents Stale Routes

Every metadata update increments a monotonically increasing epoch. Query routers cache the range map locally and compare their epoch to the metadata store on each refresh. If stale, they pull the full 200 MB range map. Clients reject responses tagged with a lower epoch than they have seen. Target: all routers converge to the latest epoch within 10 seconds. This prevents stale routers from sending queries to nodes that no longer own a range.

💡 Monotonic epoch. Routers refresh on mismatch. 10-second convergence target.

⚠ Using wall clocks for range versioning. Clock skew between nodes means a 'newer' update might have a lower timestamp, causing routers to prefer stale metadata.

Two-Phase Range Transfer: Source Serves Until Target Catches Up

Range moves use a two-phase protocol. Phase 1: the target node catches up by replaying the range's Raft log entries. The source continues serving all reads and writes. Phase 2: once the target is caught up, the control plane commits an atomic metadata swap via Raft. After the swap, the epoch increments and routers redirect to the target. Zero downtime because the source never stops serving until the target is ready. If the target crashes mid-transfer, abort and retry with a different target.

💡 Target catches up via log replay. Source serves throughout. Atomic swap at the end.

⚠ Pausing the source range during transfer. This creates a downtime window proportional to data size. At 512 MB and 14 MB/sec, that is 36 seconds of unavailability.

Ghost Table Schema Change: Shadow, Backfill, Swap

Adding a column to a billion-row table cannot lock it. The ghost table approach: (1) create a shadow table with the new schema, (2) backfill rows in batches, (3) capture changes via binlog during backfill, (4) atomic swap. Parallelized across 100 workers, 1M ranges complete in under 30 seconds. Vitess (PlanetScale) invented this for YouTube's 300K MySQL instances. Trade-off: doubles storage temporarily during backfill.

💡 Shadow + backfill + swap. 100 workers = 30 seconds for 1M ranges. No table lock.

⚠ Running ALTER TABLE directly. A metadata lock blocks all reads and writes for the duration of the schema change, which can take hours on a billion-row table.

Rebalancer Targets 20% Load Variance Threshold

The rebalancer triggers when per-node load variance exceeds 20% from the cluster average. Why not 5%? Lower thresholds cause range churn: ranges bounce between nodes faster than they settle, wasting 14 MB/sec of bandwidth per move and destabilizing cache locality. CockroachDB uses hysteresis: trigger at 20%, stop rebalancing at 15%. This prevents oscillation. Steady state: ~100 range moves/hour.

💡 20% trigger, 15% stop (hysteresis). ~100 moves/hour. Lower thresholds cause churn.

⚠ Rebalancing on every minor imbalance. A 5% variance threshold triggers hundreds of unnecessary range moves per hour, saturating network bandwidth and degrading query latency.

Heartbeat Carries Load Metrics, Not Just Liveness

Each heartbeat from the 10K nodes carries CPU utilization, disk usage percentage, range count, and current QPS. This means the failure detector and rebalancer share the same data channel. At 10K nodes every 10 seconds: 1K heartbeats/sec. Each heartbeat is ~200 bytes, so 200 KB/sec of metadata ingestion. The rebalancer reads these metrics to decide which ranges to move and where. Without load metrics, the rebalancer would need a separate monitoring channel.

💡 Heartbeats carry CPU, disk, range count. 1K/sec at 200 KB/sec. Dual-purpose signal.

⚠ Treating heartbeats as liveness-only signals. Without embedded load metrics, the control plane needs a separate monitoring system to inform rebalancing decisions, doubling operational complexity.

Placement Constraints: Spread Across Racks, Zones, Regions

#10

RF=3 replicas must land on nodes in distinct racks at minimum, ideally in distinct availability zones. The placement constraint solver checks: different rack than existing replicas, sufficient disk space, and CPU below threshold. Spanner goes further, placing replicas across continents. Trade-off: cross-region replication adds 50-100ms write latency because the Raft leader waits for majority ack. Follower reads in the local region avoid this cost for reads.

💡 Distinct racks minimum, distinct zones ideal. Cross-region adds 50-100ms write latency.

⚠ Moving a range to the closest available node during rebalancing. Without placement constraint checks, all 3 replicas can end up in the same rack, making a rack failure catastrophic.