Rate Limiter Failure Modes
What breaks, how to detect it, and how to fix it. Every failure includes detection metrics, mitigations, and severity rating.
Redis cluster failure causes total rate limiter outage
Without fail-open, a Redis blip causes a full API outage for all users, not only abusers. Recovery: automatic failover in 15 seconds. User impact with fail-open: none visible. User impact with fail-closed: complete API outage.
Constraint: all rate-check decisions depend on Redis being reachable within 5ms. What breaks: a network partition isolates the Redis primary from all replicas. The Redis Cluster enters split-brain: the primary accepts writes on one side, a replica gets promoted on the other. When the partition heals, one side's counters are lost. During the outage (15-60 seconds), all rate-check requests time out. If the gateway is configured fail-closed, every API request returns 429 even though no user is actually over their limit.
- Fail-open policy: if Redis is unreachable within 5ms, allow the request and log a warning
- Local in-memory fallback counters with conservative limits (global limit / number of servers)
- Redis Sentinel or Cluster automatic failover promotes a replica within 15 seconds
- Alert on-call if fallback mode persists for more than 60 seconds
Race condition on concurrent counter updates
Users consistently get 2-3x their rate limit under concurrent load. Defeats the purpose of rate limiting. Recovery: deploy the Lua script. User impact before fix: abusers overwhelm protected endpoints.
Constraint: multiple gateway nodes can read and increment the same counter simultaneously. What breaks: two requests from the same user arrive at two different gateway nodes at the same millisecond. Both read the counter as 99 (limit: 100). Both increment to 100 and allow the request. The actual count is now 101, but neither node rejected. At high concurrency, this check-then-act race lets users exceed their limit by the number of concurrent requests, potentially 2-3x the intended rate.
- Lua script that atomically reads, checks, and increments: EVAL 'local c = redis.call("INCR", KEYS[1]); if c == 1 then redis.call("EXPIRE", KEYS[1], ARGV[1]) end; return c'
- The script returns the new count. The gateway compares it to the limit. If over, the request was already counted but gets rejected. The counter is slightly inflated but never under-counted.
- INCR-first approach: increment first, check after. Worst case: one extra count. No under-counting.
Clock skew across gateway nodes breaks window alignment
Silently doubles the effective rate limit. Hard to detect because each individual counter looks correct. Recovery: switch to Redis TIME in the Lua script. User impact: none during the fix rollout.
Constraint: all gateway nodes must compute the same window timestamp for a given moment in time. What breaks: gateway node A thinks it is 14:00:58, node B thinks it is 14:01:02 (4-second clock skew). They compute different window timestamps for the Redis key. Node A writes to key rl:user_42:1710428400 (the 14:00 window). Node B writes to rl:user_42:1710428460 (the 14:01 window). The user's requests split across two counters, and neither reaches the limit. The user effectively gets 2x the allowed rate.
- Use the Redis server's time (TIME command) instead of the gateway's local clock for window computation
- Alternatively, embed the TIME call in the Lua script so the timestamp and increment are atomic
- Run NTP daemon on all gateway nodes with drift alerts at 500ms threshold
Hot key on celebrity or viral endpoint
Affects all users on the same shard, not only the hot key owner. Cross-user impact makes this urgent. Recovery: deploy key splitting within hours. User impact: elevated latency for co-located users until fix is deployed.
Constraint: all rate-check operations for one identifier:endpoint pair hash to one Redis shard. What breaks: a single API key belonging to a large enterprise customer generates 50K RPS. That shard handles 50K ops/s while the other 6 shards are nearly idle. The hot shard's CPU hits 100%, latency spikes to 50ms, and rate checks for other users on the same shard start timing out.
- Key splitting: append a random suffix (0-9) to hot keys, creating 10 sub-counters. Sum them in the Lua script. Spreads load across shards.
- Local counter tier: each gateway node tracks the hot key locally, syncing to Redis every 100ms instead of every request. Reduces Redis ops by 99%.
- Dedicated Redis instance for known hot keys (enterprise customers with guaranteed high traffic)
Rule misconfiguration allows unbounded traffic
Does not cause data loss, but can lead to cascading service degradation if exploited. Recovery: revert the rule via the admin API (takes effect within 5 seconds). User impact: search degradation until the rule is reverted.
Constraint: rate limit rules are mutable via the admin API with no guardrails. What breaks: an engineer updates the rule for /v1/search from 100 req/min to 100,000 req/min, intending to test a batch job. They forget to revert. A week later, an abusive client discovers this and sends 50K req/min to the search endpoint, overloading the search service and the database behind it.
- Require two-person approval (pull request review) for rule changes in production
- Automated guardrails: reject any rule where maxRequests exceeds 10x the current value without explicit override flag
- Rule change audit log with automatic revert after a configurable TTL (e.g., test rules expire in 1 hour)
- Alerting on rule changes that increase limits by more than 5x
Counter desync after Redis failover
A few extra requests per user during a rare failover event. Recovery: automatic, no manual intervention needed. User impact: negligible. Not worth adding complexity to prevent unless protecting a financial API.
Constraint: Redis replication is asynchronous by default, so the replica can be seconds behind the primary. What breaks: the Redis primary fails. A replica is promoted, but it was 2 seconds behind due to async replication. Those 2 seconds of counter increments are lost. Users who were at 98/100 in their window are now at 96/100 on the new primary. They get 4 extra requests before being throttled.
- Accept the small overshoot: 2-4 extra requests per user per failover is acceptable for most systems
- Redis WAIT command forces at least one replica to acknowledge each write (semi-synchronous), reducing lag to near zero at the cost of ~0.5ms extra latency
- After failover, briefly tighten limits by 10% for one window duration to compensate for lost counts