TRICKYwalkthrough
Provider Gateways and Backpressure
Everything downstream of our queues ends at someone else's API: APNs for Apple devices, FCM for Android, an email provider, an SMS aggregator. The gateway fleet is where our throughput meets their rules.
At a 50ms median provider round trip, sustaining the 580K/sec global peak needs requests in flight, so roughly 100 gateway servers give us the concurrency with 3x headroom across regions and providers. Now the failure: FCM has a partial outage and error rates jump to 40%.
“APNs runs over HTTP/2 and allows on the order of 4,000 concurrent streams per connection; a gateway server holding 8 connections can keep about 32K requests in flight.”
What does a naive gateway do? Retry immediately, tripling traffic against a struggling dependency and burning worker threads on timeouts, a self-inflicted retry storm.
The correct posture is backpressure plus circuit breaking. Each provider pool tracks a rolling error rate; past a threshold the circuit opens and the gateway stops calling FCM entirely for a cool-down, failing fast.
Unsent messages are not dropped: they park in a retry topic with exponential backoff and jitter (1m, 5m, 25m), preserving at-least-once semantics. P0 messages get one extra move: after two failed provider attempts they fail over to the secondary channel, an OTP falls back from push to SMS within 30 seconds.
Error classification is the subtle part: 5xx and timeouts are retryable; 4xx like 410 Unregistered or 400 BadDeviceToken are permanent and must route to token pruning, never to retry. Retrying permanent errors is the classic mistake that melts a gateway during provider incidents.
What if the interviewer asks: why not multi-provider redundancy for push? Unlike email or SMS, push has no alternative carrier: only Apple can deliver to an iPhone.
Redundancy exists across channels, not providers.
Related concepts