Whiteboard ScaleRate LimiterAnti-Patterns
Anti-Patterns

Rate Limiter Anti-Patterns

Common design mistakes candidates make. Learn what goes wrong and how to avoid each trap in your interview.

Rate Limiting at App Layer Instead of Gateway

Very CommonFORMULA

We centralize rate limiting at the API gateway (not inside each microservice) because per-service implementations duplicate logic, create inconsistency, and let malicious traffic penetrate deep into the system before being rejected.

Why: Teams add rate limiting where the pain is felt: inside the service that is overloaded. But this means every service implements its own version with different algorithms, different Redis keys, and different limits. A request that passes the gateway still consumes network, TLS, deserialization, and auth resources before being rejected.

WRONG: Add rate-limit middleware to each microservice individually. Service A uses fixed window, Service B uses token bucket, and Service C has no rate limiting at all. Inconsistent behavior across the API surface.
RIGHT: We centralize rate limiting at the API gateway (Kong, Envoy, or AWS API Gateway, not custom middleware per service). One configuration file defines all rules. Requests are rejected at the edge before consuming any backend resources. Trade-off: the gateway becomes a critical dependency, but we mitigate this with fail-open fallback.

Fixed Window Without Understanding Boundary Burst

Very CommonFORMULA

We chose sliding window counter (not fixed window) because fixed window counters allow up to 2x the intended rate at window boundaries, which can overwhelm downstream services.

Why: Fixed window is the simplest algorithm: one counter per minute, reset at :00. Candidates pick it for simplicity and never consider what happens when 100 requests arrive at 0:59 and another 100 at 1:00. Both windows allow 100 each, so 200 requests pass in 2 seconds against a 100/min limit.

WRONG: Use a fixed window counter and set the limit to 100 requests per minute. At the window boundary, a client sends 100 requests at 0:59 and 100 more at 1:00. All 200 pass, doubling the intended rate.
RIGHT: We use a sliding window counter that weights the previous window by its overlap. If 70% of the current second falls in the old window, the effective count is 0.7×old+new0.7 \times \text{old} + \text{new}. This eliminates the boundary burst with only 2 counters instead of a full log. Trade-off: slightly more computation per check (one multiply and one add), but the accuracy improvement is worth it.

No Retry-After Header in 429 Response

Very CommonFORMULA

We always include Retry-After on every 429 (not as optional) because returning 429 without it forces clients into blind retry loops, amplifying the very overload we are trying to prevent.

Why: Developers return the 429 status code and consider their job done. But without Retry-After, every SDK, mobile app, and script retries immediately. A well-intentioned exponential backoff still starts at 1 second. Multiply that by thousands of throttled clients and the retry storm is worse than the original traffic.

WRONG: Return 429 Too Many Requests with a generic error body. No Retry-After header. Clients retry immediately or with aggressive backoff, creating a retry storm that doubles the request volume.
RIGHT: We include Retry-After: N (seconds until window reset) in every 429 response. Well-behaved clients wait exactly N seconds. We also add X-RateLimit-Remaining: 0 so clients see they are at the limit before the next request. Trade-off: none meaningful. This is a zero-cost improvement.

Local Memory State Instead of Distributed Store

Very CommonFORMULA

We use Redis as the shared counter store (not in-memory HashMaps per server) because local counters let each server track independently, allowing N times the intended limit across N servers.

Why: In-memory counters are fast and straightforward. On a single server, they work perfectly. But behind a load balancer with 10 servers, a user hitting round-robin gets 10 separate counters. A 100 req/min limit becomes effectively 1,000 req/min because no server knows about the others.

WRONG: Store counters in a HashMap on each app server. With 10 servers behind a load balancer, a user's requests spread across all 10. Each server sees only 10% of the traffic, so the user gets 10x the allowed rate.
RIGHT: We use Redis as the shared counter store (not local memory, not MySQL). All servers read and write the same key. A Lua script guarantees atomicity. One Redis node handles 100K+ rate-check operations per second with sub-millisecond latency. Trade-off: we add a network hop per check (~0.5ms), but we gain accurate global counting.

Distributed Locks Instead of Atomic Redis Operations

CommonFORMULA

We use Lua scripts (not distributed locks) for counter updates because Redlock or ZooKeeper adds 5-10ms of latency per request and creates a bottleneck.

Why: Candidates think: 'concurrent counter updates need a lock.' They reach for Redlock or ZooKeeper. But a distributed lock requires 3 round-trips (acquire, operate, release) at 2-3ms each. At 100K RPS, that is 100K lock acquisitions per second, and lock contention causes queuing that pushes p99 above 50ms.

WRONG: Acquire a Redlock before incrementing the counter. Each lock acquisition takes 5-10ms. At 100K RPS, 100K locks per second causes massive contention. p99 latency spikes to 200ms+.
RIGHT: We use a Lua script that reads, checks, and increments the counter atomically in 0.1ms (not Redlock at 5-10ms, not ZooKeeper at 10-20ms). Redis is single-threaded, so the script executes without contention. No locks needed. p99 stays under 2ms. Trade-off: Lua scripts are limited to one Redis key (or keys on the same shard), but our key design already ensures this.

Same Limit for All API Endpoints

CommonFORMULA

We define per-endpoint limits (not a blanket global limit) because a single rate for all endpoints either throttles cheap reads too aggressively or lets expensive writes run unchecked.

Why: It is simpler to configure one rule: 100 req/min for everything. But a GET /user/profile costs 1ms while a POST /search costs 200ms and scans millions of rows. If both share the same limit, the search endpoint can be hammered 100 times/min, consuming 100×200ms=20s100 \times 200\text{ms} = 20\text{s} of server time per user per minute.

WRONG: One global rule: 100 req/min per user for all endpoints. The /search endpoint (200ms200\text{ms} per query) gets the same allowance as /health (1ms1\text{ms}). A single user can consume 20 seconds of compute per minute on search alone.
RIGHT: We define per-endpoint limits: 1000 req/min for cheap reads, 10 req/min for expensive queries, 50 req/min for writes. We store rules in a rules table keyed by (endpoint, user_tier). Trade-off: more rules to manage and a slightly larger rule cache in gateway memory, but the protection for expensive endpoints is non-negotiable.

Fail-Closed Without Fallback

CommonFORMULA

We chose fail-open with a local fallback (not bare fail-closed) because when the rate limiter's Redis cluster is unreachable, rejecting all requests turns a cache failure into a full API outage.

Why: Fail-closed sounds secure: if we cannot check the rate, block the request. But Redis goes down for reasons unrelated to abuse (network partition, maintenance, OOM). During those minutes, every legitimate user gets 429, and the API is effectively offline. The rate limiter, a protective system, becomes the single point of failure.

WRONG: Fail-closed on Redis timeout. A 30-second Redis maintenance window returns 429 to every user. Customer-facing dashboards go blank. Support tickets flood in. The rate limiter caused more damage than any abuser ever could.
RIGHT: We use fail-open with a local in-memory fallback (not bare fail-open with no limits). If Redis is unreachable, we switch to per-server local counters with conservative limits (global limit / number of servers). We log the fallback event. We alert the on-call engineer. We resume global limiting when Redis recovers. Trade-off: during fallback, limits are approximate (per-server, not global), but the API stays up.

Not Rate Limiting Internal Service Calls

CommonFORMULA

We apply service-level quotas to internal callers (not external-only limits) because a single runaway microservice can cascade-fail the entire backend when internal calls are unthrottled.

Why: Internal services are trusted. Teams assume internal calls are always well-behaved. Then a batch job spins up 100 threads and hammers the user service at 50K RPS. The user service saturates its connection pool, starts timing out, and every other service that depends on it fails too.

WRONG: Rate limit only external API traffic. Internal services call each other with no limits. A misconfigured batch job sends 50K RPS to the user service, exhausting its connection pool and cascading failures to 5 downstream services.
RIGHT: We apply service-level quotas (not external-only limits): each internal caller gets a named quota (e.g., batch-service: 1000 RPS, web-service: 5000 RPS). We enforce via service mesh (Istio or Envoy sidecar proxies, not application-level middleware). A runaway batch job gets throttled without affecting web traffic. Trade-off: service mesh adds ~1ms of latency per hop, but we prevent cascading failures.