Notifications Failure Modes
What breaks, how to detect it, and how to fix it. Every failure includes detection metrics, mitigations, and severity rating.
Provider outage (FCM returns 40% errors for 2 hours)
FCM has a regional incident. Naive gateways retry immediately, tripling traffic against a struggling dependency; worker threads pile up on timeouts and the gateway fleet itself becomes unavailable, taking APNs delivery down with it.
- Circuit breaker opens per provider pool: fail fast, stop burning threads on a known-bad dependency
- Park unsent messages in the retry topic with exponential backoff + jitter (1m, 5m, 25m); 24h buffer capacity
- P0 fails over across channels after two attempts: OTP push becomes SMS within 30 seconds
- Isolation by provider pool: APNs delivery is unaffected by the FCM breaker
Campaign burst starves transactional sends
A 100M-recipient campaign coincides with peak organic traffic. In a shared-queue design, OTPs queue behind millions of promotional messages and login flows time out platform-wide.
- Physical tier isolation: campaigns can only enter the P2 topic and P2 worker pool by API contract (403 otherwise)
- Campaign chunk emission throttled to 167K/sec regardless of requested window
- P0 worker pool autoscales on latency, not throughput, and never shares hosts with P2 workers
Duplicate notification storm after consumer crash
A fanout worker sends a batch, crashes before committing its Kafka offset, and the replacement re-reads the batch. Without dedup, hundreds of users get doubles; payment confirmations look like double charges and support tickets spike.
- Redis SETNX on the deterministic idempotency key before every provider call; replayed batches skip
- Collapse IDs on device as second defense: same-key notifications replace, not stack
- Offset commit AFTER dedup-mark, so the mark itself is what makes replays safe
Dead-token backlog degrades sender reputation
The 410 pruning consumer silently falls behind (or a deploy breaks it). Sends to uninstalled apps climb toward double digits; Apple and Google interpret the unregistered rate as spam behavior and throttle the sender, cutting delivery for every user.
- Prune on 410/UNREGISTERED within 5 minutes via a dedicated consumer with its own lag alarm
- 270-day last_seen expiry as the backstop for tokens that die without a send ever observing it
- Weekly hygiene job reconciles provider unregistered rates against pruning throughput
Quiet-hours timezone bug wakes users at 3 AM
A scheduler change computes quiet hours in UTC instead of recipient local time. The next campaign wave lands at 3 AM across half the planet; opt-outs spike within the hour and the damage is permanent for every user who disables push.
- Local-hour assertion in the send path itself: a P2 send computed to land in the recipient's quiet window is rejected, not just logged
- Canary waves: campaigns release to 1% of each timezone bucket first with an opt-out-rate gate before the full wave
- Deferred-release jitter (30-60 min) keeps the 08:00 boundary from becoming its own incident