ID Generator Failure Modes
What breaks, how to detect it, and how to fix it. Every failure includes detection metrics, mitigations, and severity rating.
Clock steps backward on a generator host
NTP in step mode corrects a fast clock; a VM migration or hypervisor pause lands the guest in the past; a leap-second smear misconfigures. The worker's wall clock now reads a millisecond it already used: continuing to generate mints duplicates.
- Policy ladder: regression under 20ms -> spin-wait it out; larger -> refuse and page (down beats duplicating)
- Optional grace: freeze the timestamp and borrow remaining sequence numbers (up to 4,096 IDs) before refusing
- Prevention: slew-only NTP on generator hosts; persist last_timestamp so restarts wait instead of replaying
Two workers hold the same worker ID
A GC pause outlives the lease: worker 7 freezes, its lease expires, a new node claims 7, old-7 thaws and resumes minting. Or the low-tech version: a copy-pasted manifest hardcodes the same ID twice.
- Fencing deadline: generation stops at lease TTL minus margin, BEFORE expiry can hand the ID to someone else
- Assignment by construction: ephemeral leases or StatefulSet ordinals, never hand-maintained config
- On canary hit: park both claimants immediately, quarantine the ID, audit the overlap window downstream
Coordination store (ZooKeeper) outage
The registry cluster loses quorum during a network event. New workers cannot claim IDs; running workers cannot renew leases and their fencing deadlines approach.
- Running workers keep minting until their individual deadlines: a short registry blip costs nothing
- Extended outage: workers park in waves as deadlines pass: loud, safe degradation rather than silent risk
- Capacity: keep enough spare claimed-but-idle workers that parking a wave leaves headroom; registry runs 5-node quorum
Sequence saturation under a burst writer
One caller (a backfill job, a tight retry loop) demands more than 4,096 IDs/ms from a single worker. Every millisecond saturates; the generator spin-waits constantly and becomes the caller's rate limiter.
- Steer to GenerateBatch: one RPC per 500-4,096 IDs removes the per-call ceiling
- Per-caller rate limits so one bad citizen cannot make the fleet look slow
- Spread hot callers across workers (client-side load balancing) or hand them library mode
Restart replays a hot millisecond
A worker crashes mid-millisecond and supervisord restarts it within the same ms (or onto a marginally slow clock). Without persisted state it reuses timestamp+sequence pairs it already issued.
- Persist last_timestamp locally on every flush interval and on clean shutdown; on boot, spin until wall clock exceeds it
- If local state is lost (disk gone), refuse to start until the current wall clock exceeds the registry's best-effort copy plus a safety margin
- Boot-time NTP sanity check before the first mint