Distributed Task Scheduler

COMMON

Task scheduler design is asked at Google, Amazon, Uber, and Airbnb (who built Airflow) because it hides three hard problems under a humble API: run this at 9am. It is the platform alarm clock behind Slack reminders, payment retries, and every cron job that outgrew one box. You will find what fires now among 10 billion pending tasks without scanning them (time-bucket partitioning into timing wheels), survive dispatcher crashes without losing or double-running work (leases, checkpoints, at-least-once), and absorb the workload's built-in violence: humans schedule at :00, so midnight UTC is a 500K/sec pulse that recurs forever.

Kill the database poll: time-bucket partitioning turns a 10B-row time query into a bounded bucket read
Fire at-least-once across crashes: leases with fencing, checkpoint-after-enqueue, idempotent handlers
Design for the calendar, not the mean: 500K/sec midnight pulses, default-on jitter, clock-based pre-scaling

GoogleAmazonUberAirbnbNetflixSlack

Concepts

Deep dives

Cheat Items

Quick ref

▶

Visual Solutions

Step-by-step animated walkthroughs with capacity estimation, API design, database schema, and failure modes built in.

AnimatedWatch solutions →

📋

Cheat Sheet

Key concepts, trade-offs, and quick-reference notes for Task Scheduler. Everything you need at a glance.

Quick referenceView cheat sheet →

⚠

Anti-Patterns

Common design mistakes candidates make. Wrong approaches vs correct approaches for each trap.

8 anti-patternsLearn pitfalls →

🔥

Failure Modes

What breaks in production, how to detect it, and how to fix it. Detection metrics, mitigations, and severity ratings.

5 failure modesStudy failures →

Difficulty Ladder

Start simple. Build to staff-level.

Level 1

Junior / Basics

Core concepts, single-service design, straightforward requirements

Level 2

Mid-Level Interview

Multi-service architecture, trade-off discussions, standard scaling

Level 3

Senior / Deep Dive

Complex distributed systems, failure modes, consistency guarantees

Level 4

Staff+ / FAANG Hard

Planet-scale design, novel architectures, cross-cutting concerns

Elevator Pitch3-minute interview summary

“I would design a task scheduler holding 10 billion pending tasks in 2 TB and firing 1 billion per day: 11.6K/sec average with 500K/sec midnight pulses, because humans schedule everything at :00. The key move: no fire_at index exists anywhere: time-bucket partitioning writes each task into the 5-minute bucket containing its fire time, so finding what fires is a bounded partition read, served from O(1) timing wheels under leased ownership with fencing. Firing is at-least-once: checkpoint after enqueue, idempotency keys of hash(task, scheduled_time): and never-early is a clock invariant enforced by 100ms skew fencing. Crons materialize exactly one next instance, retries reschedule through the system itself, and the herd gets default-on jitter, calendar-based pre-scaling, and priority lanes.”

Concepts Unlocked8 concepts in this topic