Notification System
VERY COMMONNotification system design is asked at Google, Apple, Amazon, and Meta because it tests queueing, third-party integration, and product judgment in one problem. It is how platforms deliver 10 billion pushes, emails, and SMS per day without an OTP ever waiting behind a marketing campaign. You will design physically isolated priority tiers that keep OTPs under 5 seconds during a 100M-recipient campaign burst, an at-least-once pipeline with deterministic idempotency keys, and provider gateways sized by concurrency that survive an FCM outage.
- Design priority tier isolation that keeps OTPs under 5s during a 100M-recipient campaign
- Build at-least-once delivery with deterministic idempotency keys that survives consumer crashes
- Size a provider gateway fleet by concurrency (580K/sec x 50ms = 29K in flight) with circuit breakers
Visual Solutions
Step-by-step animated walkthroughs with capacity estimation, API design, database schema, and failure modes built in.
Cheat Sheet
Key concepts, trade-offs, and quick-reference notes for Notifications. Everything you need at a glance.
Anti-Patterns
Common design mistakes candidates make. Wrong approaches vs correct approaches for each trap.
Failure Modes
What breaks in production, how to detect it, and how to fix it. Detection metrics, mitigations, and severity ratings.
Start simple. Build to staff-level.
“I would design a notification platform delivering 10 billion sends per day across 1 billion device tokens on push, email, SMS, and in-app channels. Three physically isolated priority tiers guarantee OTPs land in under 5 seconds even while a 100M-recipient campaign bursts at 167K sends per second through two-stage chunked fanout. Delivery is at-least-once with deterministic idempotency keys in Redis, so consumer crashes never produce duplicates. Coalescing windows and collapse keys cut engagement volume 60%, and per-user budgets cap marketing at 2 pushes per day. Provider gateways, sized by concurrency (580K/sec x 50ms = 29K in flight, about 100 servers), terminate at APNs and FCM behind circuit breakers with channel failover for P0.”
Priority Queue Isolation
TRICKYSeparate topics and worker pools per tier make OTP starvation structurally impossible
High Level System DesignDevice Token Lifecycle
STANDARD1B tokens, 1.5%/week churn, and the 410 feedback loop that protects sender reputation
Database SchemaCampaign Fanout
STANDARDTwo-stage chunked fanout turns one API call into 100M sends at 167K/sec
Core Notification DesignAt-Least-Once + Idempotency
TRICKYDeterministic keys + Redis SETNX make crash-replays invisible to users
Core Notification DesignCollapse Keys and Coalescing
STANDARD50 likes become one buzz: coalescing windows server-side, collapse keys device-side
Core Notification DesignPer-User Budgets
STANDARDAttention is a budget: 2 marketing pushes/day, then degrade down the channel ladder
Replication and Fault ToleranceProvider Gateways and Backpressure
TRICKYCircuit breakers, retry topics, and the retryable-vs-permanent error split at the APNs/FCM boundary
Replication and Fault ToleranceQuiet Hours and Timezones
EASYLocal-time deferral, and the jittered 08:00 release that flattens the morning herd
Monitoring and Complete System