Cheat Sheet

Rate Limiter Cheat Sheet

Key concepts, trade-offs, and quick-reference notes for your interview prep.

Token Bucket vs Sliding Window

#1
We chose token bucket (not sliding window counter) for APIs that tolerate short bursts. A bucket holds up to BB tokens and refills at RR tokens/sec. Each request costs 1 token. Sliding window counter gives smoother throttling with no burst allowance. Trade-off: token bucket lets up to B requests through in a single second, which can spike downstream load, but it matches real user behavior better (e.g., 10 requests in 1 second against a 60/min limit).

💡 AWS API Gateway and Stripe chose token bucket. Cloudflare chose sliding window for DDoS protection. Name these in the interview to show awareness.

Redis INCR + EXPIRE Atomic Pair

#2
We use MULTI/EXEC or a Lua script (not separate INCR and EXPIRE commands) because two separate commands create a crash window. If the process dies between INCR and EXPIRE, the key lives forever and the user is blocked permanently. Trade-off: Lua scripts block the Redis event loop during execution, but at \sim0.1ms per script, that cost is negligible compared to the crash-safety guarantee.
Running INCR and EXPIRE as separate commands. A crash between them creates an immortal key that permanently blocks the user.

Memory: Fixed Window Counter

#3
We considered fixed window first because it is the cheapest by memory: one counter per key per window. Each entry is user_id + window_ts as key, count as value, costing \sim20 bytes in Redis (8B key hash + 8B value + 4B overhead). For 10M users, that is 10M×20B=200 MB10M \times 20B = 200\text{ MB}, which fits on a single Redis node. Trade-off: we gave up boundary accuracy. At window edges, 2x the limit can pass in 2 seconds.

💡 Fixed window is the cheapest algorithm by memory. We chose sliding window counter instead because the boundary burst problem is unacceptable for our use case.

Sliding Window Counter Saves 86% Memory

#4
We chose sliding window counter (not sliding window log) because it keeps two fixed window counters and weights them by overlap. Memory is 2×20B=40B2 \times 20B = 40B per key, versus a full sliding log that stores every timestamp at \sim300B per key (for 30 entries at 10B each). That is 40B÷300B=86%{40B \div 300B = 86\%} less memory. Trade-off: the weighted approximation can be off by a few percent at the boundary, but in practice the accuracy is within 99.7%.

💡 Sliding window counter is the sweet spot: nearly as accurate as sliding log at a fraction of the memory. Name the 86% savings in the interview.

HTTP 429 + Retry-After Header

#5
We always return HTTP 429 Too Many Requests with a Retry-After header (in seconds). Without Retry-After, clients guess when to retry and often retry immediately, creating a thundering herd. We chose to include Retry-After on every 429 (not as optional) because the header costs zero extra bytes of state and prevents retry storms. Trade-off: none meaningful, which is why omitting it is a red flag in interviews.
Returning 429 without Retry-After. Clients have no idea when to retry, so they retry immediately in a tight loop, amplifying the overload.

X-RateLimit Headers

#6
We include three headers on every response (not only on 429): X-RateLimit-Limit (max allowed), X-RateLimit-Remaining (left in window), X-RateLimit-Reset (UTC epoch when window resets). These let well-behaved clients self-throttle before hitting 429. We chose to include them on all responses (not only throttled ones) because proactive clients reduce our 429 rate by up to 40%. Trade-off: three extra headers per response add ~80 bytes to every HTTP response.

💡 GitHub, Twitter, and Stripe all return these three headers. They are not in any RFC but are a de facto standard.

Lua Script for Atomicity

#7
We chose a Lua script (not MULTI/EXEC, not distributed locks) because it runs atomically inside Redis: read the counter, check the limit, increment, and set TTL in one shot. No race window. The script takes \sim0.1\textmstoexecute.At100KRPS,thatis to execute. At {100K RPS}, that is 100K \times 0.1\textms = 10\texts$ of single-threaded Lua time per second, leaving 90% of Redis capacity for other operations. Trade-off: Lua scripts are harder to debug than simple commands, but the atomicity guarantee eliminates an entire class of race conditions.

💡 EVALSHA caches the script server-side. We send the SHA (not the full script) on every request to save bandwidth.

Fail-Open vs Fail-Closed

#8
We chose fail-open (not fail-closed) because a Redis blip should not cause a full API outage for all users. Fail-open: if Redis is down, we allow all requests. The API stays up but is unprotected. Fail-closed: if Redis is down, we reject all requests. We are safe from abuse but the API is effectively down. Trade-off: during a Redis outage, abusers get through unchecked. We accept this because outages last seconds to minutes, while a fail-closed incident causes visible customer impact.
Defaulting to fail-closed without understanding that a Redis blip now causes a full API outage for all users, not only abusers.

Per-API vs Per-User vs Per-IP

#9
We layer three granularity levels (not a single flat limit): per-IP catches unauthenticated abuse (DDoS), per-user prevents authenticated abuse (scraping), per-API protects expensive endpoints. A single user might be allowed 100 req/min globally but only 10 req/min on the /search endpoint because it runs a 200ms200\text{ms} query. Trade-off: three layers mean three Redis lookups per request instead of one, adding ~1.5ms of latency. We accept this because the per-endpoint protection prevents a single user from overloading expensive backends.

💡 Layer all three: per-IP at the load balancer, per-user at the gateway, per-API at the service level.

Local + Global Two-Tier Limiting

#10
We chose a two-tier approach (not pure Redis, not pure local) to cut Redis round-trips by 80%. Each app server keeps a local in-memory counter for the first 80% of the limit. When the local counter is exhausted, we check the global Redis counter. For a 100 req/min limit across 5 servers, each server allows 16 locally before consulting Redis. Trade-off: a user could get up to 5 x 16 = 80 requests through before any Redis check. We accept this because the limit is soft and the latency savings are substantial.

💡 The two-tier approach trades precision for latency. Acceptable when the limit is advisory, not a hard security boundary.