The Two Sins: Firing Early and Firing Late

7 of 8

3 related

A scheduler has exactly two ways to be wrong about time, and they are not symmetric. Firing late is a degradation: measurable, budgetable, recoverable: the SLO is a lateness distribution (ours: p99 within 5 seconds of scheduled time, p99.9 within 60s), and lateness has honest causes (herd pulses, worker backlogs, takeover replays) with honest remedies (capacity, priority lanes, catch-up ordering). Firing early is a correctness violation: full stop. A payment retry that fires before its backoff window re-hits a struggling processor; a market-open task that fires at 9:29:58 acts on a closed market; a reminder that arrives before the meeting was even scheduled to be remembered is simply wrong.

A scheduler instance whose clock runs fast fires early while believing itself punctual, so the design reuses the ID-generator's clock hygiene wholesale: slew-only NTP on scheduler hosts, per-instance skew monitoring against fleet median with fencing past 100ms (a skewed instance stops firing: late is recoverable, early is not), and bucket handoffs that carry the previous holder's high-water timestamp so a replacement with a slow clock cannot re-enter time it has already left. Lateness then gets engineered honestly. Catch-up after an outage is a policy moment: firing a two-hour backlog in timestamp order is wrong when fresh :00 tasks are also due: the catch-up path interleaves (fresh tasks ride priority lanes; backlog drains at a controlled rate) and re-checks misfire policies (that cache-warm task from 90 minutes ago should be skipped, not fired: its moment passed).

“The invariant: never fire before scheduled time: costs discipline in exactly one place: clocks.”

And measurement must be honest: lateness is enqueue-to-scheduled delta at the dispatcher plus queue wait plus execution start: reporting only the first while workers drown is how dashboards lie. What if the interviewer asks: why is p99.9 lateness 60 seconds, not 5?

Because tail lateness during takeover replays and midnight pulses is real, and an SLO that ignores the workload's known pulses is a promise designed to be broken.

Why it matters in interviews

The asymmetry: late is a budget, early is a bug: is the topic's crispest design principle, and it links clock discipline (ID generator) to SLO engineering. The catch-up interleaving and misfire re-check show operational depth beyond the happy path.

Related concepts

← PreviousRetries, Backoff, and the Poison Task Next →Tasks vs Workflows: Where the Scheduler Stops