EASYwalkthrough
Cluster Topology and Node Lifecycle
A new node joins the cluster. It registers with the control plane, providing its IP, port, datacenter, and rack.
With 100 ranges per node at 512 MB each, draining transfers ~51 GB. At 14 MB/sec rebalance bandwidth, this takes about 60 minutes.
“The control plane assigns initial ranges (either new splits or transfers from overloaded nodes) and the node starts serving. Decommission is the reverse and more interesting: the control plane must drain all ranges from the departing node to other nodes before removal.”
The node continues serving reads during the drain. We chose a layered approach (not pure consensus for everything): CockroachDB uses gossip for node discovery (all nodes learn about each other) and Raft for authoritative state (the metadata store decides range ownership).
Gossip handles the chattiness of 10K nodes discovering each other, while Raft handles the consistency requirement that range assignments must be linearizable.
Related concepts