Distributed Task Scheduler
COMMONTask scheduler design is asked at Google, Amazon, Uber, and Airbnb (who built Airflow) because it hides three hard problems under a humble API: run this at 9am. It is the platform alarm clock behind Slack reminders, payment retries, and every cron job that outgrew one box. You will find what fires now among 10 billion pending tasks without scanning them (time-bucket partitioning into timing wheels), survive dispatcher crashes without losing or double-running work (leases, checkpoints, at-least-once), and absorb the workload's built-in violence: humans schedule at :00, so midnight UTC is a 500K/sec pulse that recurs forever.
- Kill the database poll: time-bucket partitioning turns a 10B-row time query into a bounded bucket read
- Fire at-least-once across crashes: leases with fencing, checkpoint-after-enqueue, idempotent handlers
- Design for the calendar, not the mean: 500K/sec midnight pulses, default-on jitter, clock-based pre-scaling
Visual Solutions
Step-by-step animated walkthroughs with capacity estimation, API design, database schema, and failure modes built in.
Cheat Sheet
Key concepts, trade-offs, and quick-reference notes for Task Scheduler. Everything you need at a glance.
Anti-Patterns
Common design mistakes candidates make. Wrong approaches vs correct approaches for each trap.
Failure Modes
What breaks in production, how to detect it, and how to fix it. Detection metrics, mitigations, and severity ratings.
Start simple. Build to staff-level.
“I would design a task scheduler holding 10 billion pending tasks in 2 TB and firing 1 billion per day: 11.6K/sec average with 500K/sec midnight pulses, because humans schedule everything at :00. The key move: no fire_at index exists anywhere: time-bucket partitioning writes each task into the 5-minute bucket containing its fire time, so finding what fires is a bounded partition read, served from O(1) timing wheels under leased ownership with fencing. Firing is at-least-once: checkpoint after enqueue, idempotency keys of hash(task, scheduled_time): and never-early is a clock invariant enforced by 100ms skew fencing. Crons materialize exactly one next instance, retries reschedule through the system itself, and the herd gets default-on jitter, calendar-based pre-scaling, and priority lanes.”
Time-Bucket Partitioning
TRICKYTime is the partition key: the 10B-row query becomes a bounded bucket read
Core Scheduler DesignTiming Wheels
STANDARDOne slot per second, O(1) insert, tick-and-fire: derived, not name-dropped
Core Scheduler DesignAt-Least-Once Firing
TRICKYCheckpoint after enqueue; key = hash(task, scheduled_time); fencing before expiry
High Level System DesignCron Materialization
STANDARDTemplates are factories: exactly one next instance, re-materialized on fire
Database SchemaThe Top-of-Minute Herd
TRICKYHumans schedule at :00: 10-50x pulses; jitter by default, scale on the clock
Scale and NumbersRetries and Dead Letters
STANDARDRetry = reschedule through yourself; crash-count -> DLQ that alerts on arrival
Replication and Fault ToleranceLate vs Early
STANDARDLate is a budgeted SLO; early is a bug: clock fencing makes the invariant hold
Monitoring and Complete SystemTasks vs Workflows
EASYTiny terminal state vs living DAG state: Temporal is a customer, not a competitor
What is a Task Scheduler