STANDARDwalkthrough
Metadata Store (Raft-Based Consensus)
Where does the cluster store the mapping of 1M ranges to their leader nodes? We need a store that is strongly consistent because two nodes must never believe they own the same range simultaneously.
This fits entirely in memory on each replica. We chose a Raft-based metadata store embedded in the database (not etcd, not ZooKeeper) because CockroachDB and TiKV both embed their meta-range in the same Raft implementation as data ranges.
“The math: 1M ranges x 200 bytes per entry (range_id 8B, start_key 64B, end_key 64B, leader 8B, 3 replicas 24B, epoch 8B, timestamps 24B) = 200 MB.”
An external dependency adds a separate failure domain and operational complexity. The metadata store forms a 5-node Raft group that tolerates 2 simultaneous failures.
Every range assignment, split, or rebalance is committed through Raft for linearizable reads. Google Spanner uses a dedicated meta-range that maps range IDs to leader nodes.
Related concepts