Task Scheduler Failure Modes

What breaks, how to detect it, and how to fix it. Every failure includes detection metrics, mitigations, and severity rating.

Failure Modes

HIGH

Dispatcher dies mid-bucket

The instance owning bucket 19:35 crashes after firing 40K of 3.5M tasks. Its lease expires; until takeover, the bucket is unowned and nothing in it fires.

Lease-coverage monitor (any unowned bucket alerts within seconds); checkpoint lag gauge; the every-minute canary task going silent.

Mitigation

Takeover within the lease TTL: replacement reloads the bucket, skips below the checkpoint, re-fires the uncertain suffix
Duplicates from the suffix collapse against idempotency keys downstream: at-least-once doing its job
Unowned-bucket time budgeted under 30s; fencing guarantees the dead instance cannot double-fire during the window

HIGH

The midnight pulse outruns the workers

500K firings hit the queue in one second. Dispatchers handle it (microsecond work), but execution backs up: queue depth explodes and fresh latency-sensitive tasks queue behind bulk.

Queue depth and oldest-message age per lane; lateness sawtooth by second-of-minute; pre-scale readiness check before :00.

Mitigation

Clock-based pre-scaling: the pulse is on the calendar: workers scale ahead, not reactively
Priority lanes: medication reminders never sit behind 400K cache refreshes
Default-on stable jitter drains the herd at its source over time (each new cron spreads itself)

CRITICAL

Clock skew threatens early fires

A dispatcher's clock drifts +3s. It believes 19:35:00 has arrived at 19:34:57 and fires the slot: payment retries hit early, pre-market tasks act early: the one unforgivable direction.

Per-instance skew vs fleet median (fence at 100ms); the early-fire counter (scheduled > actual): any nonzero tick pages as an incident.

Mitigation

Slew-only NTP on dispatcher hosts; skewed instances fence themselves (stop firing: late beats early)
Bucket handoffs carry the prior holder's high-water timestamp: a successor cannot re-enter time already fired
Wheel ticks driven by a monotonic clock disciplined against the fenced wall clock

HIGH

Poison task destabilizes the worker fleet

A malformed payload crashes its worker on every attempt. Retry machinery faithfully reschedules it; each cycle kills another worker; multiplied by a tenant's batch of such tasks, fleet capacity bleeds.

Crash-count per task (distinct from clean failures); worker-death rate correlated to task class; retry amplification ratio.

Mitigation

Crash threshold (2 worker deaths) short-circuits to DLQ regardless of remaining retry budget
DLQ alerts on arrival by task class: quarantine with payload, attempts, and stack traces for diagnosis
Per-tenant retry budgets stop one tenant's poison batch from becoming everyone's outage

MEDIUM

Extended outage creates a catch-up storm

Two hours down; 80M firings owed. Blind replay would bury fresh work, flood downstreams with stale tasks, and execute jobs whose moment has passed.

Backlog size and age at recovery; misfire-policy mix of the backlog; downstream health during drain.

Mitigation

Interleaved drain: fresh tasks ride priority lanes untouched; backlog releases at a controlled rate
Misfire policies re-checked per task: FIRE_ALL replays, FIRE_ONCE collapses the series, SKIP drops expired moments
Drain rate adaptive to downstream error rates: catch-up must not become the second outage