Failure Modes

Notifications Failure Modes

What breaks, how to detect it, and how to fix it. Every failure includes detection metrics, mitigations, and severity rating.

Failure Modes
HIGH

Provider outage (FCM returns 40% errors for 2 hours)

FCM has a regional incident. Naive gateways retry immediately, tripling traffic against a struggling dependency; worker threads pile up on timeouts and the gateway fleet itself becomes unavailable, taking APNs delivery down with it.

Per-provider rolling error rate crosses 10% (pages), circuit-breaker state change events, retry-topic depth growth rate.
Mitigation
  1. Circuit breaker opens per provider pool: fail fast, stop burning threads on a known-bad dependency
  2. Park unsent messages in the retry topic with exponential backoff + jitter (1m, 5m, 25m); 24h buffer capacity
  3. P0 fails over across channels after two attempts: OTP push becomes SMS within 30 seconds
  4. Isolation by provider pool: APNs delivery is unaffected by the FCM breaker
CRITICAL

Campaign burst starves transactional sends

A 100M-recipient campaign coincides with peak organic traffic. In a shared-queue design, OTPs queue behind millions of promotional messages and login flows time out platform-wide.

P0 end-to-end p99 breaches 5s (alert at 8s); per-tier queue depth divergence: P2 backlog growing while P0 latency rises is the smoking gun of leakage.
Mitigation
  1. Physical tier isolation: campaigns can only enter the P2 topic and P2 worker pool by API contract (403 otherwise)
  2. Campaign chunk emission throttled to 167K/sec regardless of requested window
  3. P0 worker pool autoscales on latency, not throughput, and never shares hosts with P2 workers
HIGH

Duplicate notification storm after consumer crash

A fanout worker sends a batch, crashes before committing its Kafka offset, and the replacement re-reads the batch. Without dedup, hundreds of users get doubles; payment confirmations look like double charges and support tickets spike.

Duplicate-send counter (idempotency-key conflict rate) jumps above the 0.01% baseline; support-ticket keyword monitor on 'twice'.
Mitigation
  1. Redis SETNX on the deterministic idempotency key before every provider call; replayed batches skip
  2. Collapse IDs on device as second defense: same-key notifications replace, not stack
  3. Offset commit AFTER dedup-mark, so the mark itself is what makes replays safe
MEDIUM

Dead-token backlog degrades sender reputation

The 410 pruning consumer silently falls behind (or a deploy breaks it). Sends to uninstalled apps climb toward double digits; Apple and Google interpret the unregistered rate as spam behavior and throttle the sender, cutting delivery for every user.

Token invalidation rate departs the 1.5%/week baseline; unregistered-error percentage of provider responses trends up week over week; pruning-consumer lag alert.
Mitigation
  1. Prune on 410/UNREGISTERED within 5 minutes via a dedicated consumer with its own lag alarm
  2. 270-day last_seen expiry as the backstop for tokens that die without a send ever observing it
  3. Weekly hygiene job reconciles provider unregistered rates against pruning throughput
HIGH

Quiet-hours timezone bug wakes users at 3 AM

A scheduler change computes quiet hours in UTC instead of recipient local time. The next campaign wave lands at 3 AM across half the planet; opt-outs spike within the hour and the damage is permanent for every user who disables push.

Send-volume histogram by recipient local hour (should be near zero in 22:00-08:00); opt-out rate step change; app-store review sentiment monitor.
Mitigation
  1. Local-hour assertion in the send path itself: a P2 send computed to land in the recipient's quiet window is rejected, not just logged
  2. Canary waves: campaigns release to 1% of each timezone bucket first with an opt-out-rate gate before the full wave
  3. Deferred-release jitter (30-60 min) keeps the 08:00 boundary from becoming its own incident