Whiteboard Scale›Task Scheduler›Design Walkthrough

Task Scheduler System Design Walkthrough

Complete design walkthrough with animated diagrams, capacity math, API design, schema, and failure modes.

Solution PathTarget: 28 min

We designed a distributed task scheduler holding 10 billion pending tasks (2 TB) and firing 1 billion per day: 11.6K/sec average with 500K/sec midnight pulses, because humans schedule at :00. The key insight: no fire_at index exists anywhere: time-bucket partitioning converts the 10B-row time query into a bounded bucket read, served from O(1) timing wheels under leased ownership with fencing. Firing is at-least-once (checkpoint after enqueue, idempotency keys of hash(task, scheduled_time)); never-early is a clock invariant with 100ms skew fencing. Crons materialize one next instance; retries reschedule through the system itself; the herd gets default-on jitter, calendar-based pre-scaling, and priority lanes.

1/10

What is Task Scheduler?

Every product you use is quietly full of appointments with the future: the reminder Slack owes you Tuesday at 9, the payment retry due in 90 seconds, the report that must exist by the 1st, the cache that refreshes hourly, the escalation that fires if nobody acks the page in 15 minutes. Something has to remember all of it and act at the right moment: that something is the distributed task scheduler, the platform's alarm clock.

The naive versions fail in instructive ways: sleep() in an app server vaporizes every pending timer on the next deploy; cron on one box turns a reboot into silently-missed billing; cron on two boxes runs billing twice: and the gap between those failures and a real system is exactly this topic. Three hard problems live under the humble API. Finding what fires now: among 10 billion pending tasks, without scanning them: solved by making time itself the partition key (buckets) and serving the final seconds from timing wheels. Crash-consistent firing: a dispatcher that dies mid-bucket must neither lose its remaining tasks nor blindly double-run the fired ones: solved with leases, checkpoints, and at-least-once firing against idempotent handlers: the delivery-semantics lesson from notifications, wearing a clock. The workload's shape: humans schedule at :00, so the system's peak is not a surprise but a calendar entry: 10-50x pulses every minute and hour, 500K firings/sec at midnight UTC: demanding jitter-by-default, clock-based pre-scaling, and priority lanes.

The scope boundary that sharpens the design: this is tasks, not workflows: single-shot, time-triggered execution with tiny terminal state: while DAGs, joins, and human approvals belong to the Temporal/Airflow layer that consumes this system's timers as its primitives. The one-line contract to open an interview with: never early, at-most-briefly late, never silently lost: everything in the next forty minutes exists to keep those three promises.

The platform's alarm clock: reminders, retries, reports, escalations. Naive versions fail instructively (sleep() = deploy amnesia; 1 cron box = silence; 2 = double billing). Three problems: find what fires (buckets), crash-consistent firing (leases + at-least-once), the :00 workload (500K/sec calendar pulses). Contract: never early, briefly late, never silently lost.

A distributed task scheduler is the platform's alarm clock: services hand it work with a timestamp: send this reminder Tuesday 9am, retry this payment in 90 seconds, run billing on the 1st: and it fires each task at its moment, once-ish, reliably, at any scale. The contract sounds humble and hides three hard problems: finding what fires now among 10 billion pending tasks without scanning them (time-bucket partitioning + timing wheels), surviving crashes mid-fire without losing or silently duplicating work (leases, checkpoints, idempotency), and absorbing the workload's built-in violence: humans schedule at :00, so the top of every minute is a 10-50x pulse and midnight UTC is a 500K/sec instant.

Scale: 10B pending tasks (2 TB), 1B firings/day = 11.6K/sec average with 500K/sec midnight pulses
The key move: time-bucket partitioning: the 10B-row time query becomes a bounded bucket read into a timing wheel
Guarantees: at-least-once firing + idempotent handlers; never early (a clock invariant), late within a budgeted SLO
Recurrence: cron templates materialize one next instance: infinite schedules, finite rows