STANDARDwalkthrough
The Two Sins: Firing Early and Firing Late
A scheduler has exactly two ways to be wrong about time, and they are not symmetric. Firing late is a degradation: measurable, budgetable, recoverable: the SLO is a lateness distribution (ours: p99 within 5 seconds of scheduled time, p99.9 within 60s), and lateness has honest causes (herd pulses, worker backlogs, takeover replays) with honest remedies (capacity, priority lanes, catch-up ordering). Firing early is a correctness violation: full stop. A payment retry that fires before its backoff window re-hits a struggling processor; a market-open task that fires at 9:29:58 acts on a closed market; a reminder that arrives before the meeting was even scheduled to be remembered is simply wrong.
A scheduler instance whose clock runs fast fires early while believing itself punctual, so the design reuses the ID-generator's clock hygiene wholesale: slew-only NTP on scheduler hosts, per-instance skew monitoring against fleet median with fencing past 100ms (a skewed instance stops firing: late is recoverable, early is not), and bucket handoffs that carry the previous holder's high-water timestamp so a replacement with a slow clock cannot re-enter time it has already left. Lateness then gets engineered honestly. Catch-up after an outage is a policy moment: firing a two-hour backlog in timestamp order is wrong when fresh :00 tasks are also due: the catch-up path interleaves (fresh tasks ride priority lanes; backlog drains at a controlled rate) and re-checks misfire policies (that cache-warm task from 90 minutes ago should be skipped, not fired: its moment passed).
“The invariant: never fire before scheduled time: costs discipline in exactly one place: clocks.”
And measurement must be honest: lateness is enqueue-to-scheduled delta at the dispatcher plus queue wait plus execution start: reporting only the first while workers drown is how dashboards lie. What if the interviewer asks: why is p99.9 lateness 60 seconds, not 5?
Because tail lateness during takeover replays and midnight pulses is real, and an SLO that ignores the workload's known pulses is a promise designed to be broken.
Related concepts