Whiteboard ScaleTopicsTask Scheduler

Distributed Task Scheduler

COMMON

Task scheduler design is asked at Google, Amazon, Uber, and Airbnb (who built Airflow) because it hides three hard problems under a humble API: run this at 9am. It is the platform alarm clock behind Slack reminders, payment retries, and every cron job that outgrew one box. You will find what fires now among 10 billion pending tasks without scanning them (time-bucket partitioning into timing wheels), survive dispatcher crashes without losing or double-running work (leases, checkpoints, at-least-once), and absorb the workload's built-in violence: humans schedule at :00, so midnight UTC is a 500K/sec pulse that recurs forever.

  • Kill the database poll: time-bucket partitioning turns a 10B-row time query into a bounded bucket read
  • Fire at-least-once across crashes: leases with fencing, checkpoint-after-enqueue, idempotent handlers
  • Design for the calendar, not the mean: 500K/sec midnight pulses, default-on jitter, clock-based pre-scaling
GoogleAmazonUberAirbnbNetflixSlack
8
Concepts
Deep dives
10
Cheat Items
Quick ref
Elevator Pitch3-minute interview summary

I would design a task scheduler holding 10 billion pending tasks in 2 TB and firing 1 billion per day: 11.6K/sec average with 500K/sec midnight pulses, because humans schedule everything at :00. The key move: no fire_at index exists anywhere: time-bucket partitioning writes each task into the 5-minute bucket containing its fire time, so finding what fires is a bounded partition read, served from O(1) timing wheels under leased ownership with fencing. Firing is at-least-once: checkpoint after enqueue, idempotency keys of hash(task, scheduled_time): and never-early is a clock invariant enforced by 100ms skew fencing. Crons materialize exactly one next instance, retries reschedule through the system itself, and the herd gets default-on jitter, calendar-based pre-scaling, and priority lanes.

Concepts Unlocked8 concepts in this topic