Notifications Cheat Sheet

Key concepts, trade-offs, and quick-reference notes for your interview prep.

Three Priority Tiers, Three Physical Queues

We split traffic into physically separate Kafka topics with dedicated worker pools. P0 transactional (OTP, security alerts, ride updates): 50M/day, p99 under

5\text{s}

, bypasses budgets and quiet hours. P1 engagement (messages, mentions): 3B/day, target 30s, coalescing applies. P2 marketing and digests (7B/day): batched, deferred by quiet hours, capped by per-user budgets. A priority FIELD in a shared queue does not work: Kafka consumes partitions in order, so 100M campaign messages ahead of an OTP delay it regardless of any field. Isolation makes starvation structurally impossible.

💡 One shared queue with a priority field is the classic wrong answer. Physical isolation per tier is the fix.

Scale Math: 10B/day to Gateway Fleet Size

Volume:

10\text{B} / 86{,}400 = 116\text{K/sec}

average, 580K/sec at 5x peak (breaking news plus campaign overlap). Provider round trip is ~50ms median, so peak in-flight requests:

580\text{K} \times 0.05 = 29\text{K}

. One gateway server with 8 APNs HTTP/2 connections at ~4,000 streams each holds 32K streams theoretical, ~300 concurrent requests practical per worker process. We provision ~100 gateway servers for 3x headroom across regions and providers. Kafka ingest:

116\text{K/sec} \times 500\text{B} = 58\text{ MB/sec}

, trivial for a modest cluster.

💡 Derive the fleet from concurrency (rate x latency), not from raw QPS. 580K/sec x 50ms = 29K in flight.

Device Token Lifecycle and the 410 Feedback Loop

500M users hold ~1B device tokens (2 devices each):

1\text{B} \times 150\text{B} = 150\text{ GB}

. Churn is 1.5% per week = 15M dead tokens/week from uninstalls and OS rotation. Never-pruned tokens waste throughput AND damage sender reputation with Apple/Google, throttling delivery for everyone. The loop: APNs 410 Unregistered or FCM UNREGISTERED at send time publishes an invalidation event; a pruning consumer deletes the token within minutes. Clients re-register their current token on every app launch; a 270-day last_seen backstop expires abandoned devices.

💡 410/Unregistered responses are permanent errors: route them to token pruning, never to retry.

At-Least-Once + Deterministic Idempotency Keys

Exactly-once is impossible across the provider boundary (no transactional API at APNs/FCM), and at-most-once drops OTPs. We run at-least-once with a deterministic idempotency key:

\text{hash}(\text{user\_id}, \text{event\_id}, \text{channel})

. The gateway does Redis SETNX with 24h TTL before each send; a crashed worker's re-read batch hits existing keys and sends nothing. Key must be derived from the event, not a random UUID: a random key changes on retry and defeats dedup. The OS collapse identifier is the client-side second seatbelt.

💡 Random UUIDs as idempotency keys are a bug: retries mint new keys. Hash the event identity instead.

Coalescing Windows + Provider Collapse Keys

50 likes in 2 minutes must not be 50 buzzes. Server side: a 30-120s coalescing window keyed by (user, event_type, object) folds events into one payload: "Priya and 49 others liked your post". Provider side: apns-collapse-id / collapse_key makes queued messages replace each other on the device, so a phone leaving a dead zone gets the final state, not a 20-push burst. Coalescing removes ~60% of engagement volume (3B to 1.2B P1 sends/day), directly cutting provider spend. Direct messages never coalesce; likes and badge counts always do. FCM keeps only 4 collapse keys per device offline.

💡 Coalescing is a latency trade: up to 120s added to collapsible types. Classify types explicitly.

Per-User Daily Budgets (Fatigue Control)

Uncoordinated product teams collectively overwhelm users, and a user who disables push is unreachable forever, including for future campaigns. We enforce per-user, per-tier token buckets at send time: P2 marketing 2/day, P1 engagement 10/day after coalescing, P0 unlimited. Counters: one Redis hash per user per day,

500\text{M} \times 40\text{B} = 20\text{ GB}

, 48h TTL. Budget exhausted means degrade down the channel ladder (push to in-app inbox to nothing), not silent drop. Check at send time, after coalescing and quiet-hours deferral, so deferred messages draw from the day they actually deliver.

💡 Opt-out is the real cost function. Past ~2 marketing pushes/day, disable rates climb sharply.

Two-Stage Campaign Fanout

One campaign call must become 100M device sends within a 10-minute window:

100\text{M} / 600\text{s} = 167\text{K/sec}

of fanout on top of baseline. Stage one: resolve the segment into 10,000-user recipient chunks (10K chunks), each a single compact message. Stage two: fanout workers expand chunks, re-check preferences and budgets at send time, emit per-device messages to P2. Chunking gives parallelism, checkpointed retries (one chunk of blast radius, not the campaign), and a place to filter close to the data. 200 workers x 850 sends/sec sustains the wave.

💡 Chunk size balances checkpoint granularity vs message overhead. 10K users/chunk is the sweet spot.

Gateway Backpressure: Circuit Breakers + Retry Topics

When FCM error rates spike to 40%, naive immediate retries triple traffic against a struggling dependency. Correct posture: each provider pool tracks a rolling error rate; past threshold the circuit opens and calls fail fast for a cool-down. Unsent messages park in a retry topic with exponential backoff and jitter (1m, 5m, 25m), preserving at-least-once. Error classification is load-bearing: 5xx/timeouts retry; 4xx (410 Unregistered, 400 BadDeviceToken) are permanent and route to pruning. P0 fails over across channels: OTP push falls back to SMS after two failed attempts within 30s.

💡 Push has no alternative carrier: only Apple delivers to iPhones. Redundancy is across channels, not providers.

Quiet Hours: Local Time, Deferred Not Dropped

Non-urgent sends respect 22:00-08:00 in the recipient's IANA timezone (stored per user, refreshed each app launch), never the server's timezone or UTC. In-window messages are deferred with a delivery-after timestamp at next local 08:00, then re-enter the pipeline and draw from that day's budget. The morning boundary creates a rolling thundering herd: deferred volume becomes eligible at 08:00 per timezone. Flatten it with 30-60 min jittered release and pre-sharding campaigns by timezone bucket ("recipient's 9 AM" = 24+ staggered waves). P0 always bypasses: a 3 AM OTP was requested at 3 AM.

💡 Order matters: defer first, budget-check at release. Otherwise deferred messages double-draw budgets.

The Five Metrics That Matter

#10

End-to-end latency per tier: P0 p99 under 5s (alert at 8s), measured ingestion to provider-accepted. Delivery rate: provider-accepted / attempted, target 99%+ after pruning. Provider error rate: rolling per-provider; feeds the circuit breaker and pages at 10%. Token invalidation rate: baseline 1.5%/week; a spike means a bad app release or provider incident. Opt-out rate: the silent quality metric; a step change after a campaign means fatigue budgets failed. Weekly: reconcile provider-accepted counts vs receipts to catch silent delivery loss.

💡 Opt-out rate is the metric nobody dashboards until it is too late. A push disabled is a user lost.