DB Control Plane Failure Modes
What breaks, how to detect it, and how to fix it. Every failure includes detection metrics, mitigations, and severity rating.
Metadata store leader failure halts all control plane operations
2-5 seconds of control plane read-only mode. Data plane unaffected (cached routes).
Constraint: the Raft leader handles all metadata writes (range assignments, splits, rebalances). The leader node crashes (OOM, hardware failure, kernel panic). Until a new leader is elected, no metadata can be updated. Range splits queue up, rebalance transfers cannot complete their atomic swap, and new node registrations are blocked. The data plane continues serving reads from cached range maps, but any operation requiring a metadata write stalls.
- Raft elects a new leader in 2-5 seconds. The 5-node group tolerates 2 failures, so one leader crash does not lose quorum.
- In-flight metadata updates retry against the new leader automatically via Raft client redirection.
- Data plane queries are unaffected: routers use cached range maps. Only control plane mutations stall.
- Monitor Raft leader election frequency. More than 3 elections per hour signals instability.
Split-brain: two nodes claim leadership of the same range
Up to 9 seconds of stale reads if client cache points to old leader. Writes during partition are rolled back.
Constraint: a network partition isolates the old range leader from the Raft group. The old leader still holds a valid lease and continues serving reads and writes. Meanwhile, the Raft group on the other side of the partition elects a new leader for the range. For up to 9 seconds (lease duration), two nodes believe they are the leader. Writes to both leaders create divergent state.
- Lease-based fencing: the old leader checks its lease before every operation. When the 9-second lease expires, it stops serving immediately.
- Epoch monotonically increases. Clients reject responses with a lower epoch than they have previously seen.
- After partition heals, the old leader discovers the new epoch and steps down. Any writes it accepted during the partition are rolled back via Raft log reconciliation.
- Monitor for epoch divergence across nodes. Alert on any node serving with an epoch more than 1 behind the cluster.
Range rebalance stall: target crashes mid-transfer
Source continues serving. Zero user impact. Transfer is retried with a different target.
Constraint: the target node crashes after receiving 400 MB of a 512 MB range transfer. The range is in 'transferring' status in the metadata store. Neither source nor target is the definitive owner of the new state. The source holds the authoritative data (it never stopped serving), but the metadata shows a transfer in progress. If not handled, the range stays in limbo indefinitely.
- Source continues serving throughout the transfer. The two-phase protocol means the atomic swap never happened, so the source remains the owner.
- On timeout or target failure, the control plane aborts the transfer: reset range status to 'active', mark target as suspect, discard partial data on target.
- Retry with a different target node that passes health checks.
- Monitor transfer completion rate. Alert if more than 5% of transfers fail in an hour.
Heartbeat storm after network partition recovery
Delayed failure detection for 5-10 seconds during the burst. No data plane impact if grace period is applied.
Constraint: a brief network partition resolves. 5K nodes that were partitioned simultaneously send heartbeats. The metadata leader receives 5K heartbeats in 1 second instead of the normal 1K/sec. The Raft log backs up. Commit latency spikes from 2ms to 200ms. The heartbeat processing queue grows beyond 10K entries. Lease extensions are delayed, causing false-positive failure detections for nodes that are actually healthy.
- Jittered heartbeat retry: each node adds a random 0-5 second delay to its next heartbeat after detecting reconnection, spreading the burst over 5 seconds.
- Rate limiting on heartbeat ingestion: cap at 3K/sec on the metadata leader. Excess heartbeats are queued, not dropped.
- Extend lease grace period by 10 seconds during detected partition recovery to prevent false-positive failure detections.
- Monitor for correlated heartbeat failures. If 50%+ of nodes miss heartbeats simultaneously, treat as partition, not mass failure.
Online schema change partial failure: 500 ranges fail, 999,500 succeed
Mixed schemas cause query inconsistency. Requires manual intervention to either complete or roll back.
Constraint: a schema change is applied to 999,500 of 1M ranges successfully. 500 ranges fail (disk full on their nodes, incompatible schema version, node temporarily unreachable). The cluster now has mixed schemas: most ranges have the new column, 500 do not. Queries spanning old-schema and new-schema ranges return inconsistent results. A SELECT with the new column succeeds on 999,500 ranges but fails on 500.
- Route queries only to ranges with the new schema until all ranges converge. Ranges with old schema serve reads using backward-compatible schema mapping.
- Retry failed ranges after fixing the root cause (free disk space, upgrade schema version, wait for node recovery).
- If retry is not feasible, roll back all 999,500 successful ranges to the old schema. The control plane tracks per-range status for both forward and backward paths.
- Require pre-flight checks before schema changes: verify disk space on all nodes, check schema version compatibility, ensure no ranges are in splitting/transferring state.