Whiteboard ScaleRate LimiterFailure Modes
Failure Modes

Rate Limiter Failure Modes

What breaks, how to detect it, and how to fix it. Every failure includes detection metrics, mitigations, and severity rating.

Failure Modes
CRITICAL

Redis cluster failure causes total rate limiter outage

Without fail-open, a Redis blip causes a full API outage for all users, not only abusers. Recovery: automatic failover in 15 seconds. User impact with fail-open: none visible. User impact with fail-closed: complete API outage.

Constraint: all rate-check decisions depend on Redis being reachable within 5ms. What breaks: a network partition isolates the Redis primary from all replicas. The Redis Cluster enters split-brain: the primary accepts writes on one side, a replica gets promoted on the other. When the partition heals, one side's counters are lost. During the outage (15-60 seconds), all rate-check requests time out. If the gateway is configured fail-closed, every API request returns 429 even though no user is actually over their limit.

Redis Cluster health check failures. Rate-check latency spikes above 100ms. Sudden spike in 429 responses with no corresponding increase in inbound traffic. User impact: with fail-closed, every user sees 429 errors. With fail-open, users are unaffected but rate limits are not enforced.
Mitigation
  1. Fail-open policy: if Redis is unreachable within 5ms, allow the request and log a warning
  2. Local in-memory fallback counters with conservative limits (global limit / number of servers)
  3. Redis Sentinel or Cluster automatic failover promotes a replica within 15 seconds
  4. Alert on-call if fallback mode persists for more than 60 seconds
HIGH

Race condition on concurrent counter updates

Users consistently get 2-3x their rate limit under concurrent load. Defeats the purpose of rate limiting. Recovery: deploy the Lua script. User impact before fix: abusers overwhelm protected endpoints.

Constraint: multiple gateway nodes can read and increment the same counter simultaneously. What breaks: two requests from the same user arrive at two different gateway nodes at the same millisecond. Both read the counter as 99 (limit: 100). Both increment to 100 and allow the request. The actual count is now 101, but neither node rejected. At high concurrency, this check-then-act race lets users exceed their limit by the number of concurrent requests, potentially 2-3x the intended rate.

Counter values in Redis exceeding the configured max_requests. Rate limit overshoot metrics (actual allowed / intended limit) trending above 1.05. User impact: abusers get 2-3x their limit, which can overload downstream services.
Mitigation
  1. Lua script that atomically reads, checks, and increments: EVAL 'local c = redis.call("INCR", KEYS[1]); if c == 1 then redis.call("EXPIRE", KEYS[1], ARGV[1]) end; return c'
  2. The script returns the new count. The gateway compares it to the limit. If over, the request was already counted but gets rejected. The counter is slightly inflated but never under-counted.
  3. INCR-first approach: increment first, check after. Worst case: one extra count. No under-counting.
MEDIUM

Clock skew across gateway nodes breaks window alignment

Silently doubles the effective rate limit. Hard to detect because each individual counter looks correct. Recovery: switch to Redis TIME in the Lua script. User impact: none during the fix rollout.

Constraint: all gateway nodes must compute the same window timestamp for a given moment in time. What breaks: gateway node A thinks it is 14:00:58, node B thinks it is 14:01:02 (4-second clock skew). They compute different window timestamps for the Redis key. Node A writes to key rl:user_42:1710428400 (the 14:00 window). Node B writes to rl:user_42:1710428460 (the 14:01 window). The user's requests split across two counters, and neither reaches the limit. The user effectively gets 2x the allowed rate.

NTP drift alerts exceeding 1 second. Users consistently hitting exactly 2x their expected limit. Counter keys with unexpectedly low values despite high traffic. User impact: rate limits silently fail for all users on skewed nodes.
Mitigation
  1. Use the Redis server's time (TIME command) instead of the gateway's local clock for window computation
  2. Alternatively, embed the TIME call in the Lua script so the timestamp and increment are atomic
  3. Run NTP daemon on all gateway nodes with drift alerts at 500ms threshold
HIGH

Hot key on celebrity or viral endpoint

Affects all users on the same shard, not only the hot key owner. Cross-user impact makes this urgent. Recovery: deploy key splitting within hours. User impact: elevated latency for co-located users until fix is deployed.

Constraint: all rate-check operations for one identifier:endpoint pair hash to one Redis shard. What breaks: a single API key belonging to a large enterprise customer generates 50K RPS. That shard handles 50K ops/s while the other 6 shards are nearly idle. The hot shard's CPU hits 100%, latency spikes to 50ms, and rate checks for other users on the same shard start timing out.

Redis SLOWLOG showing repeated operations on the same key prefix. Per-shard CPU utilization diverging (one shard at 90%+, others at 20%). User impact: all users whose keys hash to the hot shard experience elevated latency and potential timeouts.
Mitigation
  1. Key splitting: append a random suffix (0-9) to hot keys, creating 10 sub-counters. Sum them in the Lua script. Spreads load across shards.
  2. Local counter tier: each gateway node tracks the hot key locally, syncing to Redis every 100ms instead of every request. Reduces Redis ops by 99%.
  3. Dedicated Redis instance for known hot keys (enterprise customers with guaranteed high traffic)
MEDIUM

Rule misconfiguration allows unbounded traffic

Does not cause data loss, but can lead to cascading service degradation if exploited. Recovery: revert the rule via the admin API (takes effect within 5 seconds). User impact: search degradation until the rule is reverted.

Constraint: rate limit rules are mutable via the admin API with no guardrails. What breaks: an engineer updates the rule for /v1/search from 100 req/min to 100,000 req/min, intending to test a batch job. They forget to revert. A week later, an abusive client discovers this and sends 50K req/min to the search endpoint, overloading the search service and the database behind it.

Traffic volume on /v1/search exceeding historical baseline by 100x. Search service CPU and latency alerts. Rule change audit log showing the modification. User impact: search endpoint degrades for all users due to resource exhaustion.
Mitigation
  1. Require two-person approval (pull request review) for rule changes in production
  2. Automated guardrails: reject any rule where maxRequests exceeds 10x the current value without explicit override flag
  3. Rule change audit log with automatic revert after a configurable TTL (e.g., test rules expire in 1 hour)
  4. Alerting on rule changes that increase limits by more than 5x
LOW

Counter desync after Redis failover

A few extra requests per user during a rare failover event. Recovery: automatic, no manual intervention needed. User impact: negligible. Not worth adding complexity to prevent unless protecting a financial API.

Constraint: Redis replication is asynchronous by default, so the replica can be seconds behind the primary. What breaks: the Redis primary fails. A replica is promoted, but it was 2 seconds behind due to async replication. Those 2 seconds of counter increments are lost. Users who were at 98/100 in their window are now at 96/100 on the new primary. They get 4 extra requests before being throttled.

Rate limit overshoot metric spikes immediately after a failover event. Redis replication lag metric exceeded 1 second before the failover. User impact: each user gets a few extra requests through. At scale (10M active counters), the aggregate overshoot is noticeable but not dangerous.
Mitigation
  1. Accept the small overshoot: 2-4 extra requests per user per failover is acceptable for most systems
  2. Redis WAIT command forces at least one replica to acknowledge each write (semi-synchronous), reducing lag to near zero at the cost of ~0.5ms extra latency
  3. After failover, briefly tighten limits by 10% for one window duration to compensate for lost counts