Task Scheduler Failure Modes
What breaks, how to detect it, and how to fix it. Every failure includes detection metrics, mitigations, and severity rating.
Dispatcher dies mid-bucket
The instance owning bucket 19:35 crashes after firing 40K of 3.5M tasks. Its lease expires; until takeover, the bucket is unowned and nothing in it fires.
- Takeover within the lease TTL: replacement reloads the bucket, skips below the checkpoint, re-fires the uncertain suffix
- Duplicates from the suffix collapse against idempotency keys downstream: at-least-once doing its job
- Unowned-bucket time budgeted under 30s; fencing guarantees the dead instance cannot double-fire during the window
The midnight pulse outruns the workers
500K firings hit the queue in one second. Dispatchers handle it (microsecond work), but execution backs up: queue depth explodes and fresh latency-sensitive tasks queue behind bulk.
- Clock-based pre-scaling: the pulse is on the calendar: workers scale ahead, not reactively
- Priority lanes: medication reminders never sit behind 400K cache refreshes
- Default-on stable jitter drains the herd at its source over time (each new cron spreads itself)
Clock skew threatens early fires
A dispatcher's clock drifts +3s. It believes 19:35:00 has arrived at 19:34:57 and fires the slot: payment retries hit early, pre-market tasks act early: the one unforgivable direction.
- Slew-only NTP on dispatcher hosts; skewed instances fence themselves (stop firing: late beats early)
- Bucket handoffs carry the prior holder's high-water timestamp: a successor cannot re-enter time already fired
- Wheel ticks driven by a monotonic clock disciplined against the fenced wall clock
Poison task destabilizes the worker fleet
A malformed payload crashes its worker on every attempt. Retry machinery faithfully reschedules it; each cycle kills another worker; multiplied by a tenant's batch of such tasks, fleet capacity bleeds.
- Crash threshold (2 worker deaths) short-circuits to DLQ regardless of remaining retry budget
- DLQ alerts on arrival by task class: quarantine with payload, attempts, and stack traces for diagnosis
- Per-tenant retry budgets stop one tenant's poison batch from becoming everyone's outage
Extended outage creates a catch-up storm
Two hours down; 80M firings owed. Blind replay would bury fresh work, flood downstreams with stale tasks, and execute jobs whose moment has passed.
- Interleaved drain: fresh tasks ride priority lanes untouched; backlog releases at a controlled rate
- Misfire policies re-checked per task: FIRE_ALL replays, FIRE_ONCE collapses the series, SKIP drops expired moments
- Drain rate adaptive to downstream error rates: catch-up must not become the second outage