STANDARDwalkthrough

Retries, Backoff, and the Poison Task

6 of 8
3 related
A task fires on time and its handler fails: the payment API is down, the webhook 500s, the container OOMs. The scheduler's contract does not end at firing; it ends at resolution: success, exhausted retries, or quarantine.
And then the poison task: the one that crashes its worker every single attempt: a malformed payload, a pathological input, a handler bug. Unbounded, one poison task can cycle through and destabilize the entire worker fleet; the defense is a crash-count threshold (distinct from failure count: a clean error is a failure; a worker death is a crash) after which the task moves to the dead-letter queue: quarantined with its full context (payload, attempt history, stack traces) for human diagnosis, and critically, alerting by task class, because a DLQ nobody reads is a black hole with good intentions.
The retry machinery reuses this course's greatest hits deliberately. Exponential backoff with jitter: retry at 1m, 4m, 16m... with randomization, because synchronized retries after a downstream outage are a self-inflicted herd: the same lesson as the notification gateways, now on the execution side. Retry as reschedule: a retry is just a new scheduled task (same idempotency lineage, attempt counter incremented) written into a future time bucket: the scheduler retries through itself, which means retries inherit bucketing, leasing, at-least-once, and monitoring for free instead of needing a parallel bespoke path. Retry budgets: per task (max attempts from the template, then terminal failure) and per destination: when one webhook endpoint is down, ten thousand tasks retrying against it must not consume the worker fleet: a circuit breaker per destination converts them to fast-fail reschedules, exactly the provider-pool pattern from notifications.
Numbers to carry: retries typically add 10-20% to firing volume in steady state and multiples of that during downstream outages: capacity planning that ignores retry amplification under-provisions exactly when it matters. What if the interviewer asks: should retries respect the original priority?
Mostly yes, but decayed: a reminder retried for the fifth time an hour late has lost its urgency claim: priority decay per attempt is a cheap fairness win.
Why it matters in interviews
Retry design is where three earlier patterns (backoff + jitter, circuit breakers, DLQs) compose into one flow, and retry-as-reschedule is the elegant move that makes the scheduler self-hosting. The poison-task/crash-count distinction is an operational scar interviewers recognize.
Related concepts