Failure Modes

DB Control Plane Failure Modes

What breaks, how to detect it, and how to fix it. Every failure includes detection metrics, mitigations, and severity rating.

Failure Modes
MEDIUM

Metadata store leader failure halts all control plane operations

2-5 seconds of control plane read-only mode. Data plane unaffected (cached routes).

Constraint: the Raft leader handles all metadata writes (range assignments, splits, rebalances). The leader node crashes (OOM, hardware failure, kernel panic). Until a new leader is elected, no metadata can be updated. Range splits queue up, rebalance transfers cannot complete their atomic swap, and new node registrations are blocked. The data plane continues serving reads from cached range maps, but any operation requiring a metadata write stalls.

Raft followers detect missing heartbeat within 1 election timeout (2 seconds). Raft election begins. Metadata write latency spikes to infinity. Pending split and rebalance operations time out.
Mitigation
  1. Raft elects a new leader in 2-5 seconds. The 5-node group tolerates 2 failures, so one leader crash does not lose quorum.
  2. In-flight metadata updates retry against the new leader automatically via Raft client redirection.
  3. Data plane queries are unaffected: routers use cached range maps. Only control plane mutations stall.
  4. Monitor Raft leader election frequency. More than 3 elections per hour signals instability.
HIGH

Split-brain: two nodes claim leadership of the same range

Up to 9 seconds of stale reads if client cache points to old leader. Writes during partition are rolled back.

Constraint: a network partition isolates the old range leader from the Raft group. The old leader still holds a valid lease and continues serving reads and writes. Meanwhile, the Raft group on the other side of the partition elects a new leader for the range. For up to 9 seconds (lease duration), two nodes believe they are the leader. Writes to both leaders create divergent state.

Epoch mismatch when a client talks to the stale leader. The stale leader's epoch is lower than the new leader's. Network monitoring detects the partition. Raft commit latency on the old leader spikes because it cannot reach a majority.
Mitigation
  1. Lease-based fencing: the old leader checks its lease before every operation. When the 9-second lease expires, it stops serving immediately.
  2. Epoch monotonically increases. Clients reject responses with a lower epoch than they have previously seen.
  3. After partition heals, the old leader discovers the new epoch and steps down. Any writes it accepted during the partition are rolled back via Raft log reconciliation.
  4. Monitor for epoch divergence across nodes. Alert on any node serving with an epoch more than 1 behind the cluster.
LOW

Range rebalance stall: target crashes mid-transfer

Source continues serving. Zero user impact. Transfer is retried with a different target.

Constraint: the target node crashes after receiving 400 MB of a 512 MB range transfer. The range is in 'transferring' status in the metadata store. Neither source nor target is the definitive owner of the new state. The source holds the authoritative data (it never stopped serving), but the metadata shows a transfer in progress. If not handled, the range stays in limbo indefinitely.

Transfer timeout after 120 seconds. Source node reports transfer incomplete. Target node heartbeat stops. Range status stuck in 'transferring' for longer than expected.
Mitigation
  1. Source continues serving throughout the transfer. The two-phase protocol means the atomic swap never happened, so the source remains the owner.
  2. On timeout or target failure, the control plane aborts the transfer: reset range status to 'active', mark target as suspect, discard partial data on target.
  3. Retry with a different target node that passes health checks.
  4. Monitor transfer completion rate. Alert if more than 5% of transfers fail in an hour.
MEDIUM

Heartbeat storm after network partition recovery

Delayed failure detection for 5-10 seconds during the burst. No data plane impact if grace period is applied.

Constraint: a brief network partition resolves. 5K nodes that were partitioned simultaneously send heartbeats. The metadata leader receives 5K heartbeats in 1 second instead of the normal 1K/sec. The Raft log backs up. Commit latency spikes from 2ms to 200ms. The heartbeat processing queue grows beyond 10K entries. Lease extensions are delayed, causing false-positive failure detections for nodes that are actually healthy.

Raft commit latency spikes above 100ms. Heartbeat processing queue depth exceeds 10K. False-positive node failure alerts spike. All affected nodes are in the same network partition boundary.
Mitigation
  1. Jittered heartbeat retry: each node adds a random 0-5 second delay to its next heartbeat after detecting reconnection, spreading the burst over 5 seconds.
  2. Rate limiting on heartbeat ingestion: cap at 3K/sec on the metadata leader. Excess heartbeats are queued, not dropped.
  3. Extend lease grace period by 10 seconds during detected partition recovery to prevent false-positive failure detections.
  4. Monitor for correlated heartbeat failures. If 50%+ of nodes miss heartbeats simultaneously, treat as partition, not mass failure.
HIGH

Online schema change partial failure: 500 ranges fail, 999,500 succeed

Mixed schemas cause query inconsistency. Requires manual intervention to either complete or roll back.

Constraint: a schema change is applied to 999,500 of 1M ranges successfully. 500 ranges fail (disk full on their nodes, incompatible schema version, node temporarily unreachable). The cluster now has mixed schemas: most ranges have the new column, 500 do not. Queries spanning old-schema and new-schema ranges return inconsistent results. A SELECT with the new column succeeds on 999,500 ranges but fails on 500.

schema_change_log shows 500 rows with status=failed. Monitoring alert on incomplete schema change. Query errors for the new column on specific ranges.
Mitigation
  1. Route queries only to ranges with the new schema until all ranges converge. Ranges with old schema serve reads using backward-compatible schema mapping.
  2. Retry failed ranges after fixing the root cause (free disk space, upgrade schema version, wait for node recovery).
  3. If retry is not feasible, roll back all 999,500 successful ranges to the old schema. The control plane tracks per-range status for both forward and backward paths.
  4. Require pre-flight checks before schema changes: verify disk space on all nodes, check schema version compatibility, ensure no ranges are in splitting/transferring state.