Task Scheduler Anti-Patterns

Common design mistakes candidates make. Learn what goes wrong and how to avoid each trap in your interview.

Polling the Database for Due Tasks

Very CommonFORMULA

Every scheduler instance runs SELECT ... WHERE fire_at <= now() FOR UPDATE SKIP LOCKED on a loop.

Why: It is the obvious design, works perfectly at startup scale, and every ORM tutorial shows it.

WRONG: At 10B rows and 11.6K firings/sec, the hot end of the fire_at index becomes the most contended pages in the database; polls fight inserts, instances fight each other, and the poll interval becomes your minimum lateness.

RIGHT: Time-bucket partitioning: tasks write into 5-minute buckets; instances lease whole buckets and load them into memory ahead of time. The query disappears entirely: firing reads RAM.

sleep() and In-Process Timers as the Scheduler

Very CommonFORMULA

Application servers hold future work as in-memory timers (setTimeout, ScheduledExecutorService) until it is due.

Why: It is one line of code and there is no infrastructure to stand up: the deploy that erases every pending timer is next week's problem.

WRONG: A deploy, crash, or scale-down silently vaporizes every scheduled action on that instance: no error, no retry, no record. Reminders just never happen.

RIGHT: Durable-first: a task exists in the store before anything acknowledges scheduling it. In-memory structures (timing wheels) are serving caches over durable buckets, rebuilt on every restart.

One Cron Box (and Its Twin, Two Cron Boxes)

Very CommonFORMULA

A single server runs crontab for the whole platform; when it worries someone, a second identical box is added.

Why: Cron is right there, and the failure mode (that box dies, silence) takes months to manifest.

WRONG: One box: a reboot during the nightly billing run means billing silently did not happen. Two boxes: everything runs twice, and double-billing is worse than none. There is no third option without real design.

RIGHT: Leased ownership with fencing (one live owner per bucket, by construction), at-least-once firing with idempotent handlers, and takeover replay: the minimum machinery that makes 'run it once-ish, reliably' true.

Tolerating Early Fires

CommonFORMULA

Treating firing time as approximately correct in both directions, so clock skew or eager prefetch fires tasks seconds early.

Why: Lateness gets all the attention; nobody writes a test for 'did not fire early', and a fast clock feels punctual to itself.

WRONG: A dispatcher with +3s skew fires payment retries before their backoff expires (hammering a struggling processor) and market-open tasks before the market opens. Nothing alerts, because everything looks on time to the skewed instance.

RIGHT: Never-early is an invariant: slew-only NTP, per-instance skew fencing at 100ms, handoffs carrying high-water timestamps, and an early-fire counter that must read zero forever: any tick is an incident.

Materializing the Whole Recurrence

CommonFORMULA

Expanding 'every weekday at 9am' into months or years of concrete task rows at creation time.

Why: Pre-expansion makes the read path uniform (everything is just a task) and hides the recurrence math in one place.

WRONG: A million crons x 260 yearly instances = 260M rows of pure liability; editing a schedule becomes a mass delete-and-recreate; and the horizon question (how far ahead?) has no right answer.

RIGHT: Template + exactly one materialized next instance, re-materialized at each firing. One pending row per cron forever; edits touch one template and one row; recurring tasks cost the same as one-shot ones.

Synchronized, Unbudgeted Retries

CommonFORMULA

Failed tasks retry on a fixed short interval with no jitter, no per-destination budget, and no cap.

Why: Fixed intervals are easy to reason about, and retry behavior is only ever observed one task at a time in development.

WRONG: A webhook endpoint dies; 10,000 tasks fail together and retry together every 60 seconds: a synchronized battering ram that keeps the endpoint down and starves the worker fleet in pulses.

RIGHT: Exponential backoff with jitter desynchronizes the wave; per-destination circuit breakers convert a dead endpoint's retries to fast-fail reschedules; per-task attempt budgets end the story deterministically in the DLQ.

The Dispatcher Executes the Task

CommonFORMULA

The scheduler process that determines what is due also runs the task's handler inline.

Why: It skips a queue and a worker fleet: for small handlers it looks like elegant simplification.

WRONG: One slow handler (a 30s report) blocks the dispatch loop; everything behind it in the wheel fires late. Task code crashes take the scheduler down with them: the platform's timekeeper dies of a customer's bug.

RIGHT: Fire = enqueue. The dispatcher's per-task cost stays at microseconds; execution isolation, scaling, and crash containment live in a separate worker fleet consuming the queue.

Blind Catch-Up After Downtime

OccasionalFORMULA

After an outage, firing the entire missed backlog immediately, in timestamp order, at full speed.

Why: Replaying history in order feels like correctness: the system is 'doing what it would have done'.

WRONG: A 2-hour outage ends and 80M missed firings blast the queue ahead of everything fresh: current :00 tasks fire late, downstreams get a tsunami of stale work, and 90-minute-old cache warms execute pointlessly.

RIGHT: Interleaved catch-up: fresh tasks ride priority lanes untouched; backlog drains at a controlled rate; every replayed task re-checks its misfire policy (fire-all / fire-once / skip) because for many, the moment has simply passed.