Failure Modes

Task Scheduler Failure Modes

What breaks, how to detect it, and how to fix it. Every failure includes detection metrics, mitigations, and severity rating.

Failure Modes
HIGH

Dispatcher dies mid-bucket

The instance owning bucket 19:35 crashes after firing 40K of 3.5M tasks. Its lease expires; until takeover, the bucket is unowned and nothing in it fires.

Lease-coverage monitor (any unowned bucket alerts within seconds); checkpoint lag gauge; the every-minute canary task going silent.
Mitigation
  1. Takeover within the lease TTL: replacement reloads the bucket, skips below the checkpoint, re-fires the uncertain suffix
  2. Duplicates from the suffix collapse against idempotency keys downstream: at-least-once doing its job
  3. Unowned-bucket time budgeted under 30s; fencing guarantees the dead instance cannot double-fire during the window
HIGH

The midnight pulse outruns the workers

500K firings hit the queue in one second. Dispatchers handle it (microsecond work), but execution backs up: queue depth explodes and fresh latency-sensitive tasks queue behind bulk.

Queue depth and oldest-message age per lane; lateness sawtooth by second-of-minute; pre-scale readiness check before :00.
Mitigation
  1. Clock-based pre-scaling: the pulse is on the calendar: workers scale ahead, not reactively
  2. Priority lanes: medication reminders never sit behind 400K cache refreshes
  3. Default-on stable jitter drains the herd at its source over time (each new cron spreads itself)
CRITICAL

Clock skew threatens early fires

A dispatcher's clock drifts +3s. It believes 19:35:00 has arrived at 19:34:57 and fires the slot: payment retries hit early, pre-market tasks act early: the one unforgivable direction.

Per-instance skew vs fleet median (fence at 100ms); the early-fire counter (scheduled > actual): any nonzero tick pages as an incident.
Mitigation
  1. Slew-only NTP on dispatcher hosts; skewed instances fence themselves (stop firing: late beats early)
  2. Bucket handoffs carry the prior holder's high-water timestamp: a successor cannot re-enter time already fired
  3. Wheel ticks driven by a monotonic clock disciplined against the fenced wall clock
HIGH

Poison task destabilizes the worker fleet

A malformed payload crashes its worker on every attempt. Retry machinery faithfully reschedules it; each cycle kills another worker; multiplied by a tenant's batch of such tasks, fleet capacity bleeds.

Crash-count per task (distinct from clean failures); worker-death rate correlated to task class; retry amplification ratio.
Mitigation
  1. Crash threshold (2 worker deaths) short-circuits to DLQ regardless of remaining retry budget
  2. DLQ alerts on arrival by task class: quarantine with payload, attempts, and stack traces for diagnosis
  3. Per-tenant retry budgets stop one tenant's poison batch from becoming everyone's outage
MEDIUM

Extended outage creates a catch-up storm

Two hours down; 80M firings owed. Blind replay would bury fresh work, flood downstreams with stale tasks, and execute jobs whose moment has passed.

Backlog size and age at recovery; misfire-policy mix of the backlog; downstream health during drain.
Mitigation
  1. Interleaved drain: fresh tasks ride priority lanes untouched; backlog releases at a controlled rate
  2. Misfire policies re-checked per task: FIRE_ALL replays, FIRE_ONCE collapses the series, SKIP drops expired moments
  3. Drain rate adaptive to downstream error rates: catch-up must not become the second outage