ID Generator Failure Modes

What breaks, how to detect it, and how to fix it. Every failure includes detection metrics, mitigations, and severity rating.

Failure Modes

CRITICAL

Clock steps backward on a generator host

NTP in step mode corrects a fast clock; a VM migration or hypervisor pause lands the guest in the past; a leap-second smear misconfigures. The worker's wall clock now reads a millisecond it already used: continuing to generate mints duplicates.

Monotonic-vs-wall-clock regression check on every mint; backward-step event counter (any occurrence alerts); fleet clock-skew gauge vs median.

Mitigation

Policy ladder: regression under 20ms -> spin-wait it out; larger -> refuse and page (down beats duplicating)
Optional grace: freeze the timestamp and borrow remaining sequence numbers (up to 4,096 IDs) before refusing
Prevention: slew-only NTP on generator hosts; persist last_timestamp so restarts wait instead of replaying

CRITICAL

Two workers hold the same worker ID

A GC pause outlives the lease: worker 7 freezes, its lease expires, a new node claims 7, old-7 thaws and resumes minting. Or the low-tech version: a copy-pasted manifest hardcodes the same ID twice.

Duplicate canary (SETNX on every Nth ID) is the backstop; registry audit comparing live workers to claimed znodes; per-worker mint counters showing two sources for one ID.

Mitigation

Fencing deadline: generation stops at lease TTL minus margin, BEFORE expiry can hand the ID to someone else
Assignment by construction: ephemeral leases or StatefulSet ordinals, never hand-maintained config
On canary hit: park both claimants immediately, quarantine the ID, audit the overlap window downstream

HIGH

Coordination store (ZooKeeper) outage

The registry cluster loses quorum during a network event. New workers cannot claim IDs; running workers cannot renew leases and their fencing deadlines approach.

Registry health from the platform; lease-renewal latency climbing toward the fencing deadline on every worker simultaneously.

Mitigation

Running workers keep minting until their individual deadlines: a short registry blip costs nothing
Extended outage: workers park in waves as deadlines pass: loud, safe degradation rather than silent risk
Capacity: keep enough spare claimed-but-idle workers that parking a wave leaves headroom; registry runs 5-node quorum

MEDIUM

Sequence saturation under a burst writer

One caller (a backfill job, a tight retry loop) demands more than 4,096 IDs/ms from a single worker. Every millisecond saturates; the generator spin-waits constantly and becomes the caller's rate limiter.

Sequence saturation ratio per worker (fraction of ms hitting 4,095) alerting at 10% sustained; per-caller QPS attribution.

Mitigation

Steer to GenerateBatch: one RPC per 500-4,096 IDs removes the per-call ceiling
Per-caller rate limits so one bad citizen cannot make the fleet look slow
Spread hot callers across workers (client-side load balancing) or hand them library mode

HIGH

Restart replays a hot millisecond

A worker crashes mid-millisecond and supervisord restarts it within the same ms (or onto a marginally slow clock). Without persisted state it reuses timestamp+sequence pairs it already issued.

Duplicate canary; startup log asserting the wait-for-clock step; last_timestamp snapshot age at boot.

Mitigation

Persist last_timestamp locally on every flush interval and on clean shutdown; on boot, spin until wall clock exceeds it
If local state is lost (disk gone), refuse to start until the current wall clock exceeds the registry's best-effort copy plus a safety margin
Boot-time NTP sanity check before the first mint