Metadata Store (Raft-Based Consensus)

2 of 8

3 related

Where does the cluster store the mapping of 1M ranges to their leader nodes? We need a store that is strongly consistent because two nodes must never believe they own the same range simultaneously.

replicas for redundancy

This fits entirely in memory on each replica. We chose a Raft-based metadata store embedded in the database (not etcd, not ZooKeeper) because CockroachDB and TiKV both embed their meta-range in the same Raft implementation as data ranges.

“The math: 1M ranges x 200 bytes per entry (range_id 8B, start_key 64B, end_key 64B, leader 8B, 3 replicas 24B, epoch 8B, timestamps 24B) = 200 MB.”

An external dependency adds a separate failure domain and operational complexity. The metadata store forms a 5-node Raft group that tolerates 2 simultaneous failures.

Every range assignment, split, or rebalance is committed through Raft for linearizable reads. Google Spanner uses a dedicated meta-range that maps range IDs to leader nodes.

Why it matters in interviews

This concept tests whether we understand why consensus is required for range metadata. Showing that 200 MB fits in memory and explaining why embedded Raft beats external coordination demonstrates production-grade architectural reasoning.

Related concepts

← PreviousRange Partitioning and Key Space Management Next →Failure Detection and Lease-Based Liveness