STANDARDwalkthrough

Failure Detection and Lease-Based Liveness

3 of 8
3 related
How does the control plane know a node is dead, not just slow? With 10K nodes sending heartbeats every 10 seconds, the metadata leader processes 1K heartbeats/sec.
Gossip protocols like Cassandra's phi-accrual detector converge probabilistically, leaving a window where two nodes might believe they are the leader. The failure timeline: node dies at t=0t=0, last heartbeat was at t=Tt=-T (up to 10s ago), lease expires at t=(10T)+9=t=(10-T)+9= up to 19s19s, new election takes 2 seconds, propagation takes 1 second.
We chose lease-based detection (not gossip-based) because leases give a deterministic fencing boundary: when a 9-second lease expires, the old leader is guaranteed to have stopped serving (it checks its own lease before every operation).
Typical total: 12 seconds from death to new leader serving. CockroachDB uses 9-second leases for exactly this reason.
Trade-off: leases add latency to failure detection (must wait for full lease duration) but prevent split-brain at the range level.
Why it matters in interviews
Interviewers probe the gossip vs lease trade-off. Explaining that leases provide deterministic fencing while gossip only offers probabilistic convergence shows staff-level distributed systems understanding.
Related concepts