Notifications Failure Modes

What breaks, how to detect it, and how to fix it. Every failure includes detection metrics, mitigations, and severity rating.

Failure Modes

HIGH

Provider outage (FCM returns 40% errors for 2 hours)

FCM has a regional incident. Naive gateways retry immediately, tripling traffic against a struggling dependency; worker threads pile up on timeouts and the gateway fleet itself becomes unavailable, taking APNs delivery down with it.

Per-provider rolling error rate crosses 10% (pages), circuit-breaker state change events, retry-topic depth growth rate.

Mitigation

Circuit breaker opens per provider pool: fail fast, stop burning threads on a known-bad dependency
Park unsent messages in the retry topic with exponential backoff + jitter (1m, 5m, 25m); 24h buffer capacity
P0 fails over across channels after two attempts: OTP push becomes SMS within 30 seconds
Isolation by provider pool: APNs delivery is unaffected by the FCM breaker

CRITICAL

Campaign burst starves transactional sends

A 100M-recipient campaign coincides with peak organic traffic. In a shared-queue design, OTPs queue behind millions of promotional messages and login flows time out platform-wide.

P0 end-to-end p99 breaches 5s (alert at 8s); per-tier queue depth divergence: P2 backlog growing while P0 latency rises is the smoking gun of leakage.

Mitigation

Physical tier isolation: campaigns can only enter the P2 topic and P2 worker pool by API contract (403 otherwise)
Campaign chunk emission throttled to 167K/sec regardless of requested window
P0 worker pool autoscales on latency, not throughput, and never shares hosts with P2 workers

HIGH

Duplicate notification storm after consumer crash

A fanout worker sends a batch, crashes before committing its Kafka offset, and the replacement re-reads the batch. Without dedup, hundreds of users get doubles; payment confirmations look like double charges and support tickets spike.

Duplicate-send counter (idempotency-key conflict rate) jumps above the 0.01% baseline; support-ticket keyword monitor on 'twice'.

Mitigation

Redis SETNX on the deterministic idempotency key before every provider call; replayed batches skip
Collapse IDs on device as second defense: same-key notifications replace, not stack
Offset commit AFTER dedup-mark, so the mark itself is what makes replays safe

MEDIUM

Dead-token backlog degrades sender reputation

The 410 pruning consumer silently falls behind (or a deploy breaks it). Sends to uninstalled apps climb toward double digits; Apple and Google interpret the unregistered rate as spam behavior and throttle the sender, cutting delivery for every user.

Token invalidation rate departs the 1.5%/week baseline; unregistered-error percentage of provider responses trends up week over week; pruning-consumer lag alert.

Mitigation

Prune on 410/UNREGISTERED within 5 minutes via a dedicated consumer with its own lag alarm
270-day last_seen expiry as the backstop for tokens that die without a send ever observing it
Weekly hygiene job reconciles provider unregistered rates against pruning throughput

HIGH

Quiet-hours timezone bug wakes users at 3 AM

A scheduler change computes quiet hours in UTC instead of recipient local time. The next campaign wave lands at 3 AM across half the planet; opt-outs spike within the hour and the damage is permanent for every user who disables push.

Send-volume histogram by recipient local hour (should be near zero in 22:00-08:00); opt-out rate step change; app-store review sentiment monitor.

Mitigation

Local-hour assertion in the send path itself: a P2 send computed to land in the recipient's quiet window is rejected, not just logged
Canary waves: campaigns release to 1% of each timezone bucket first with an opt-out-rate gate before the full wave
Deferred-release jitter (30-60 min) keeps the 08:00 boundary from becoming its own incident