Failure Modes

URL Shortener Failure Modes

What breaks, how to detect it, and how to fix it. Every failure includes detection metrics, mitigations, and severity rating.

Failure Modes
HIGH

Cache stampede on popular URL TTL expiry

Can cascade to full database outage within minutes if multiple popular keys expire at once.

Constraint: we set a 24-hour TTL on Redis cache entries to bound memory usage. What breaks: a viral tweet shares a short URL that gets 50K clicks per minute. The Redis cache entry for that URL expires. All 50K requests simultaneously miss the cache and hit MySQL. The read replica, designed for 100K QPS of mixed traffic, receives 50K requests for one row in a single second. User impact: redirect latency spikes from 10ms to 500ms+ for all users on that replica, not only those clicking the viral link.

MySQL read latency spikes above 100ms combined with cache hit ratio dropping below 85%.
Mitigation
  1. Request coalescing: only one goroutine fetches from DB, others wait for the result
  2. Stale-while-revalidate: serve the expired cache entry while refreshing in the background
  3. TTL jitter: randomize expiry by +/- 10% to prevent synchronized expiration across keys
CRITICAL

Counter service crash loses ID range state

Collisions mean two different long URLs share one short code. One of them silently breaks.

Constraint: we pre-allocate ranges of 10,000 IDs to app servers for zero-contention writes. What breaks: the counter service crashes halfway through allocating a range. On restart, it does not know which IDs were already handed out. If it re-issues the same range, two different app servers now hold overlapping IDs. Both generate URLs, and the second INSERT fails with a duplicate key error, or worse, silently overwrites the first URL. User impact: one of the two short URLs silently redirects to the wrong destination. The original creator's link breaks with no error message.

Duplicate key errors on INSERT to the urls table. Counter service health check failures in ZooKeeper.
Mitigation
  1. Write-Ahead Log (WAL) persists each range allocation before acknowledging
  2. On recovery, skip to the next range beyond the last WAL entry to avoid reissuing any IDs
  3. ZooKeeper ephemeral nodes detect counter service failure within 3 heartbeat intervals (~6 seconds)
HIGH

MySQL primary failure during write spike

New URLs cannot be created during the failover window (up to 30 seconds). Reads continue from cache and replicas.

Constraint: all writes go to a single MySQL primary for strong consistency. What breaks: the primary MySQL node goes down at peak write volume (87K RPS at 3x). Writes queue up in the application layer. Counter ranges keep incrementing locally, but the URLs are not persisted. If the app servers exhaust their current range before the primary recovers, they request new ranges, leaving gaps in the keyspace. User impact: new URL creation fails for up to 30 seconds during failover. URLs created but not yet persisted return 404 on redirect.

Write error rate exceeds 1%. Replication lag on replicas stops advancing.
Mitigation
  1. Semi-synchronous replication so at least one replica has the latest committed writes
  2. Orchestrator promotes a read replica to primary within 30 seconds
  3. Application retries failed writes with exponential backoff (max 3 retries, 1s base delay)
MEDIUM

Hot shard from viral URL creation campaign

Affects one shard. Other shards continue serving reads and writes normally.

Constraint: we shard by short_code hash to distribute writes evenly. What breaks: if someone mistakenly deploys a user_id-based sharding scheme, a marketing platform creating 10 million URLs in one hour sends all rows to a single shard. That shard's write throughput maxes out while the other 15 shards sit idle. User impact: URL creation latency for all users whose short_codes hash to the overloaded shard increases from 5ms to 200ms+.

Single shard write latency exceeds 200ms while other shards report normal latency. Shard CPU at 95%+.
Mitigation
  1. Shard by short_code hash, not user_id, so counter-based codes distribute evenly across all shards
  2. Rate limit URL creation to 100/min/user at the API gateway
  3. Queue burst writes in Kafka and drain at a controlled rate to the database
HIGH

Redis cluster node failure drops cache partition

Temporary surge in DB load until cache warms back up, typically 2 to 5 minutes.

Constraint: we use Redis to absorb 95% of read traffic (289K RPS). What breaks: one of 8 Redis nodes fails. Without consistent hashing, a full rehash redistributes the entire keyspace. Every node misses on keys that previously belonged to the failed node, and MySQL sees a sudden surge in read traffic. At 289K RPS, even a partial cache miss wave can push MySQL past its 100K QPS capacity. User impact: redirect latency spikes from 10ms to 100ms+ for 2 to 5 minutes while the cache warms back up.

Cache hit ratio drops from 95% to 60% within seconds. Redis Cluster MOVED errors spike.
Mitigation
  1. Consistent hashing with 150 virtual nodes per physical node limits key remapping to 1/N of the keyspace
  2. Redis Cluster automatic failover promotes a replica within 15 seconds
  3. Circuit breaker on the cache path: if cache is down, fall back to DB with a connection pool limit to prevent overload
MEDIUM

Stale read after write due to replication lag

Affects only the user who created the URL, and only for the first few seconds.

Constraint: we use asynchronous replication from MySQL primary to read replicas for performance. What breaks: a user creates a short URL via the POST endpoint. The write goes to the MySQL primary. The user immediately clicks the short link, but the redirect request hits a read replica that has not yet replicated the new row. The user sees a 404 for a URL they created one second ago. User impact: the creating user sees a broken link for the first few seconds after creation, eroding trust in the service.

Support tickets for '404 on newly created URL.' Replication lag metric exceeding 1 second on any replica.
Mitigation
  1. Write-through cache: on successful POST, we write the URL mapping to Redis immediately so reads bypass the replica entirely
  2. Read-your-writes consistency: route the creating user's reads to the primary for 5 seconds after a write
  3. Monitor replication lag and alert if it exceeds 500ms. Switch to semi-synchronous replication if lag persists.