TRICKYwalkthrough

Clock Backward: The Failure That Defines the Design

3 of 8

3 related

A Snowflake worker's uniqueness rests on one assumption: time only moves forward. Then NTP notices the machine's clock is 300ms fast and steps it backward.

This is the failure mode that separates candidates who memorized the bit layout from those who can operate it. First, know the two NTP behaviors: slew (gradually adjusting the clock by tiny fractions: safe) and step (jumping it: dangerous backward).

“The worker now sees a millisecond it has already used: if it keeps generating, it will mint duplicate IDs: the one absolute failure an ID generator must never have.”

Production generators handle a backward step with a policy ladder. Small step (a few ms): spin-wait until the wall clock passes the last-used timestamp: generation pauses briefly, no duplicates.

Large step (seconds): refuse to generate and alert: an ID generator that is down is an incident; one that duplicates is a catastrophe, and every downstream system silently corrupts. Better, prevent the hazard: run NTP in slew-only mode on generator hosts, use the kernel's monotonic clock to detect wall-clock regressions, and persist the last-used timestamp so a crashed worker that restarts within the same millisecond (or onto a slow clock) waits before minting.

The elegant escape hatch some systems choose: borrow from the sequence: if the clock reads backward, keep the last timestamp and continue incrementing the sequence until it overflows, buying up to 4,096 IDs of grace: logical time briefly outrunning physical time, a tiny hybrid logical clock. The trade-off across all policies is availability versus safety, and the right answer is always safety: a paused generator degrades loudly; a duplicating one corrupts quietly.

What if the interviewer asks: how do you TEST this? Fault-inject clock steps in staging (libfaketime or chaos jobs) and keep a duplicate-detecting canary in production: a Redis SETNX on every Nth generated ID.

Why it matters in interviews

Clock-backward handling is where this topic connects to the LWW clock-skew data loss from the key-value topic: same root cause, different blast radius. The spin-wait / refuse / sequence-borrow ladder plus slew-only NTP is the operational answer interviewers rarely hear complete.

Related concepts

← PreviousWhy Not UUIDs: The Index Locality Tax Next →Worker ID Assignment Without Collisions