Whiteboard ScaleID GeneratorFailure Modes
Failure Modes

ID Generator Failure Modes

What breaks, how to detect it, and how to fix it. Every failure includes detection metrics, mitigations, and severity rating.

Failure Modes
CRITICAL

Clock steps backward on a generator host

NTP in step mode corrects a fast clock; a VM migration or hypervisor pause lands the guest in the past; a leap-second smear misconfigures. The worker's wall clock now reads a millisecond it already used: continuing to generate mints duplicates.

Monotonic-vs-wall-clock regression check on every mint; backward-step event counter (any occurrence alerts); fleet clock-skew gauge vs median.
Mitigation
  1. Policy ladder: regression under 20ms -> spin-wait it out; larger -> refuse and page (down beats duplicating)
  2. Optional grace: freeze the timestamp and borrow remaining sequence numbers (up to 4,096 IDs) before refusing
  3. Prevention: slew-only NTP on generator hosts; persist last_timestamp so restarts wait instead of replaying
CRITICAL

Two workers hold the same worker ID

A GC pause outlives the lease: worker 7 freezes, its lease expires, a new node claims 7, old-7 thaws and resumes minting. Or the low-tech version: a copy-pasted manifest hardcodes the same ID twice.

Duplicate canary (SETNX on every Nth ID) is the backstop; registry audit comparing live workers to claimed znodes; per-worker mint counters showing two sources for one ID.
Mitigation
  1. Fencing deadline: generation stops at lease TTL minus margin, BEFORE expiry can hand the ID to someone else
  2. Assignment by construction: ephemeral leases or StatefulSet ordinals, never hand-maintained config
  3. On canary hit: park both claimants immediately, quarantine the ID, audit the overlap window downstream
HIGH

Coordination store (ZooKeeper) outage

The registry cluster loses quorum during a network event. New workers cannot claim IDs; running workers cannot renew leases and their fencing deadlines approach.

Registry health from the platform; lease-renewal latency climbing toward the fencing deadline on every worker simultaneously.
Mitigation
  1. Running workers keep minting until their individual deadlines: a short registry blip costs nothing
  2. Extended outage: workers park in waves as deadlines pass: loud, safe degradation rather than silent risk
  3. Capacity: keep enough spare claimed-but-idle workers that parking a wave leaves headroom; registry runs 5-node quorum
MEDIUM

Sequence saturation under a burst writer

One caller (a backfill job, a tight retry loop) demands more than 4,096 IDs/ms from a single worker. Every millisecond saturates; the generator spin-waits constantly and becomes the caller's rate limiter.

Sequence saturation ratio per worker (fraction of ms hitting 4,095) alerting at 10% sustained; per-caller QPS attribution.
Mitigation
  1. Steer to GenerateBatch: one RPC per 500-4,096 IDs removes the per-call ceiling
  2. Per-caller rate limits so one bad citizen cannot make the fleet look slow
  3. Spread hot callers across workers (client-side load balancing) or hand them library mode
HIGH

Restart replays a hot millisecond

A worker crashes mid-millisecond and supervisord restarts it within the same ms (or onto a marginally slow clock). Without persisted state it reuses timestamp+sequence pairs it already issued.

Duplicate canary; startup log asserting the wait-for-clock step; last_timestamp snapshot age at boot.
Mitigation
  1. Persist last_timestamp locally on every flush interval and on clean shutdown; on boot, spin until wall clock exceeds it
  2. If local state is lost (disk gone), refuse to start until the current wall clock exceeds the registry's best-effort copy plus a safety margin
  3. Boot-time NTP sanity check before the first mint