Cheat Sheet

Notifications Cheat Sheet

Key concepts, trade-offs, and quick-reference notes for your interview prep.

Three Priority Tiers, Three Physical Queues

#1
We split traffic into physically separate Kafka topics with dedicated worker pools. P0 transactional (OTP, security alerts, ride updates): 50M/day, p99 under 5s5\text{s}, bypasses budgets and quiet hours. P1 engagement (messages, mentions): 3B/day, target 30s, coalescing applies. P2 marketing and digests (7B/day): batched, deferred by quiet hours, capped by per-user budgets. A priority FIELD in a shared queue does not work: Kafka consumes partitions in order, so 100M campaign messages ahead of an OTP delay it regardless of any field. Isolation makes starvation structurally impossible.

💡 One shared queue with a priority field is the classic wrong answer. Physical isolation per tier is the fix.

Scale Math: 10B/day to Gateway Fleet Size

#2
Volume: 10B/86,400=116K/sec10\text{B} / 86{,}400 = 116\text{K/sec} average, 580K/sec at 5x peak (breaking news plus campaign overlap). Provider round trip is ~50ms median, so peak in-flight requests: 580K×0.05=29K580\text{K} \times 0.05 = 29\text{K}. One gateway server with 8 APNs HTTP/2 connections at ~4,000 streams each holds 32K streams theoretical, ~300 concurrent requests practical per worker process. We provision ~100 gateway servers for 3x headroom across regions and providers. Kafka ingest: 116K/sec×500B=58 MB/sec116\text{K/sec} \times 500\text{B} = 58\text{ MB/sec}, trivial for a modest cluster.

💡 Derive the fleet from concurrency (rate x latency), not from raw QPS. 580K/sec x 50ms = 29K in flight.

Device Token Lifecycle and the 410 Feedback Loop

#3
500M users hold ~1B device tokens (2 devices each): 1B×150B=150 GB1\text{B} \times 150\text{B} = 150\text{ GB}. Churn is 1.5% per week = 15M dead tokens/week from uninstalls and OS rotation. Never-pruned tokens waste throughput AND damage sender reputation with Apple/Google, throttling delivery for everyone. The loop: APNs 410 Unregistered or FCM UNREGISTERED at send time publishes an invalidation event; a pruning consumer deletes the token within minutes. Clients re-register their current token on every app launch; a 270-day last_seen backstop expires abandoned devices.

💡 410/Unregistered responses are permanent errors: route them to token pruning, never to retry.

At-Least-Once + Deterministic Idempotency Keys

#4
Exactly-once is impossible across the provider boundary (no transactional API at APNs/FCM), and at-most-once drops OTPs. We run at-least-once with a deterministic idempotency key: hash(user_id,event_id,channel)\text{hash}(\text{user\_id}, \text{event\_id}, \text{channel}). The gateway does Redis SETNX with 24h TTL before each send; a crashed worker's re-read batch hits existing keys and sends nothing. Key must be derived from the event, not a random UUID: a random key changes on retry and defeats dedup. The OS collapse identifier is the client-side second seatbelt.

💡 Random UUIDs as idempotency keys are a bug: retries mint new keys. Hash the event identity instead.

Coalescing Windows + Provider Collapse Keys

#5
50 likes in 2 minutes must not be 50 buzzes. Server side: a 30-120s coalescing window keyed by (user, event_type, object) folds events into one payload: "Priya and 49 others liked your post". Provider side: apns-collapse-id / collapse_key makes queued messages replace each other on the device, so a phone leaving a dead zone gets the final state, not a 20-push burst. Coalescing removes ~60% of engagement volume (3B to 1.2B P1 sends/day), directly cutting provider spend. Direct messages never coalesce; likes and badge counts always do. FCM keeps only 4 collapse keys per device offline.

💡 Coalescing is a latency trade: up to 120s added to collapsible types. Classify types explicitly.

Per-User Daily Budgets (Fatigue Control)

#6
Uncoordinated product teams collectively overwhelm users, and a user who disables push is unreachable forever, including for future campaigns. We enforce per-user, per-tier token buckets at send time: P2 marketing 2/day, P1 engagement 10/day after coalescing, P0 unlimited. Counters: one Redis hash per user per day, 500M×40B=20 GB500\text{M} \times 40\text{B} = 20\text{ GB}, 48h TTL. Budget exhausted means degrade down the channel ladder (push to in-app inbox to nothing), not silent drop. Check at send time, after coalescing and quiet-hours deferral, so deferred messages draw from the day they actually deliver.

💡 Opt-out is the real cost function. Past ~2 marketing pushes/day, disable rates climb sharply.

Two-Stage Campaign Fanout

#7
One campaign call must become 100M device sends within a 10-minute window: 100M/600s=167K/sec100\text{M} / 600\text{s} = 167\text{K/sec} of fanout on top of baseline. Stage one: resolve the segment into 10,000-user recipient chunks (10K chunks), each a single compact message. Stage two: fanout workers expand chunks, re-check preferences and budgets at send time, emit per-device messages to P2. Chunking gives parallelism, checkpointed retries (one chunk of blast radius, not the campaign), and a place to filter close to the data. 200 workers x 850 sends/sec sustains the wave.

💡 Chunk size balances checkpoint granularity vs message overhead. 10K users/chunk is the sweet spot.

Gateway Backpressure: Circuit Breakers + Retry Topics

#8
When FCM error rates spike to 40%, naive immediate retries triple traffic against a struggling dependency. Correct posture: each provider pool tracks a rolling error rate; past threshold the circuit opens and calls fail fast for a cool-down. Unsent messages park in a retry topic with exponential backoff and jitter (1m, 5m, 25m), preserving at-least-once. Error classification is load-bearing: 5xx/timeouts retry; 4xx (410 Unregistered, 400 BadDeviceToken) are permanent and route to pruning. P0 fails over across channels: OTP push falls back to SMS after two failed attempts within 30s.

💡 Push has no alternative carrier: only Apple delivers to iPhones. Redundancy is across channels, not providers.

Quiet Hours: Local Time, Deferred Not Dropped

#9
Non-urgent sends respect 22:00-08:00 in the recipient's IANA timezone (stored per user, refreshed each app launch), never the server's timezone or UTC. In-window messages are deferred with a delivery-after timestamp at next local 08:00, then re-enter the pipeline and draw from that day's budget. The morning boundary creates a rolling thundering herd: deferred volume becomes eligible at 08:00 per timezone. Flatten it with 30-60 min jittered release and pre-sharding campaigns by timezone bucket ("recipient's 9 AM" = 24+ staggered waves). P0 always bypasses: a 3 AM OTP was requested at 3 AM.

💡 Order matters: defer first, budget-check at release. Otherwise deferred messages double-draw budgets.

The Five Metrics That Matter

#10
End-to-end latency per tier: P0 p99 under 5s (alert at 8s), measured ingestion to provider-accepted. Delivery rate: provider-accepted / attempted, target 99%+ after pruning. Provider error rate: rolling per-provider; feeds the circuit breaker and pages at 10%. Token invalidation rate: baseline 1.5%/week; a spike means a bad app release or provider incident. Opt-out rate: the silent quality metric; a step change after a campaign means fatigue budgets failed. Weekly: reconcile provider-accepted counts vs receipts to catch silent delivery loss.

💡 Opt-out rate is the metric nobody dashboards until it is too late. A push disabled is a user lost.