TRICKYwalkthrough

Provider Gateways and Backpressure

7 of 8

3 related

Everything downstream of our queues ends at someone else's API: APNs for Apple devices, FCM for Android, an email provider, an SMS aggregator. The gateway fleet is where our throughput meets their rules.

At a 50ms median provider round trip, sustaining the 580K/sec global peak needs

580\text{K} \times 0.05 = 29\text{K}

requests in flight, so roughly 100 gateway servers give us the concurrency with 3x headroom across regions and providers. Now the failure: FCM has a partial outage and error rates jump to 40%.

“APNs runs over HTTP/2 and allows on the order of 4,000 concurrent streams per connection; a gateway server holding 8 connections can keep about 32K requests in flight.”

What does a naive gateway do? Retry immediately, tripling traffic against a struggling dependency and burning worker threads on timeouts, a self-inflicted retry storm.

The correct posture is backpressure plus circuit breaking. Each provider pool tracks a rolling error rate; past a threshold the circuit opens and the gateway stops calling FCM entirely for a cool-down, failing fast.

Unsent messages are not dropped: they park in a retry topic with exponential backoff and jitter (1m, 5m, 25m), preserving at-least-once semantics. P0 messages get one extra move: after two failed provider attempts they fail over to the secondary channel, an OTP falls back from push to SMS within 30 seconds.

Error classification is the subtle part: 5xx and timeouts are retryable; 4xx like 410 Unregistered or 400 BadDeviceToken are permanent and must route to token pruning, never to retry. Retrying permanent errors is the classic mistake that melts a gateway during provider incidents.

What if the interviewer asks: why not multi-provider redundancy for push? Unlike email or SMS, push has no alternative carrier: only Apple can deliver to an iPhone.

Redundancy exists across channels, not providers.

Why it matters in interviews

This concept tests integration discipline: HTTP/2 stream budgets, circuit breakers, retry topics with backoff, and the retryable vs permanent error split. It is where candidates reveal whether they have actually operated a system that depends on someone else's API.

Related concepts

← PreviousPer-User Rate Limiting and Fatigue Control Next →Quiet Hours and Timezone Batching