Whiteboard ScaleNews FeedFailure Modes
Failure Modes

News Feed Failure Modes

What breaks, how to detect it, and how to fix it. Every failure includes detection metrics, mitigations, and severity rating.

Failure Modes
CRITICAL

Celebrity fanout storm overwhelms the write queue

Affects timeline freshness for all users, not only the celebrity's followers. A 5-minute delay makes the platform feel broken.

Constraint: fanout-on-write capped at 10K followers. What breaks: misconfiguration lets a celebrity with 30M followers trigger push fanout into 30M timeline caches. At 100K writes/sec per shard, that is 300 seconds of sustained writes for one tweet. Every other tweet sits in the Kafka fanout queue, unable to drain. Detect: consumer lag above 1M messages, fanout latency above 30s, Redis write throughput at 100% on multiple shards. Recover: circuit breaker switches ALL users to fanout-on-read until queue drains below 100K. Impact: all users' tweets delayed 5+ minutes.

Kafka consumer lag exceeding 1M messages. Fanout latency (tweet-to-timeline-appearance) spiking above 30 seconds. Redis write throughput saturating at 100% on multiple shards.
Mitigation
  1. Hybrid fanout: users with 10K+ followers skip fanout-on-write. Their tweets are merged at read time.
  2. Priority queues in Kafka: normal users' fanout gets high priority, celebrity fanout gets low priority so it does not block others.
  3. Fanout rate limiting: we cap the write rate per-tweet at 50K writes/sec, spreading a 30M-follower fanout over 10 minutes instead of attempting it all at once.
  4. Circuit breaker on the fanout service: if queue depth exceeds 5M, we temporarily switch ALL users to fanout-on-read until the queue drains.
HIGH

Timeline cache eviction causes cold-start latency spikes

Cascading load on the database can trigger a full outage if rebuilds are not throttled.

Constraint: Redis holds 10M timeline caches per node for sub-10ms reads. What breaks: a node restarts after a crash. Those 10M users now have empty timelines. Each miss triggers a cold rebuild: query followees, fetch tweets per shard, merge, sort, populate. Each rebuild takes 200-500ms and hits the DB. If 1M users open the app within 5 minutes, the DB gets 3,333 QPS of multi-shard reads on top of normal load. Detect: miss rate spiking from 2% to 50%+, DB CPU rising, timeline p99 jumping from 50ms to 500ms+. Recover: RDB snapshots reload in 30 seconds, recovering 95%+ of caches.

Cache miss rate spiking from baseline 2% to 50%+ on the affected Redis shard. Database CPU and query latency increasing. Timeline API p99 jumping from 50ms to 500ms+.
Mitigation
  1. Redis persistence (RDB snapshots every 60 seconds + AOF): restarts reload from disk, recovering 95%+ of timeline caches in under 30 seconds.
  2. Warm-up service: on Redis restart, we proactively rebuild timelines for the most active users (top 1M by last_active) before they request them.
  3. Stale-while-revalidate: we serve the last known cached timeline (from a secondary replica) while rebuilding in the background.
  4. Rate-limit cold rebuilds to 1,000/sec to protect the database, returning a 'timeline loading' placeholder for excess requests.
MEDIUM

Social graph inconsistency after unfollow

Temporary inconsistency (2-5 seconds). Filtered at read time as a safety net. No data loss.

Constraint: follows sync across MySQL and Redis via async Kafka events. What breaks: user A unfollows B. The API deletes from MySQL and removes B from A's following set. But the Kafka event to update B's followers set lags 3 seconds. B tweets in that window. Fanout reads B's followers from Redis, still sees A, pushes B's tweet into A's timeline. A sees a tweet from someone they unfollowed. Detect: unfollowed-account tweet reports, MySQL follows vs Redis SET cardinality delta above 0.1% for 30+ seconds. Recover: read-time filter drops tweets from users not in the current following set.

User reports seeing tweets from unfollowed accounts. Inconsistency metric: delta between MySQL follows count and Redis SET cardinality exceeding 0.1% for more than 30 seconds.
Mitigation
  1. On unfollow, we synchronously remove from both MySQL and the local Redis following set (A's side). The followers set (B's side) can be async.
  2. At timeline read time, we filter out tweets from users not in the reader's current following set. One extra SISMEMBER check per tweet in the response.
  3. Periodic reconciliation job: every 10 minutes, we sample 10K users and compare MySQL follow counts with Redis SET cardinality. Fix discrepancies.
HIGH

Snowflake clock skew causes duplicate or out-of-order IDs

Duplicate IDs cause INSERT failures (lost tweets). Out-of-order IDs cause timeline misordering. Both erode user trust.

Constraint: Snowflake IDs depend on monotonic timestamps across generator nodes. What breaks: a node's NTP sync drifts by 2 seconds, generating IDs with past timestamps that sort before IDs from other nodes. If the clock jumps forward after correction, the node reuses a timestamp, creating duplicate IDs. Two tweets sharing an ID cause a primary key collision on INSERT. Detect: PK collision errors, Snowflake timestamp diverging from wall-clock by 500ms+, NTP drift alerts. Recover: affected node halts ID generation until the clock catches up. Impact: lost tweets and timeline misordering.

Primary key collision errors on tweet INSERT. Snowflake ID timestamp component diverging from wall-clock time by more than 500ms. NTP drift monitoring alerts.
Mitigation
  1. Snowflake clock guard: if the system clock moves backward, the ID generator refuses to issue IDs until the clock catches up. This halts writes on that node but prevents duplicates.
  2. We run NTP daemon with aggressive sync (every 30 seconds) and drift alerting at 100ms threshold.
  3. Each Snowflake node tracks its last issued timestamp. If the current timestamp equals the last, increment the sequence. If current < last, wait or reject.
  4. We deploy Snowflake nodes on machines with hardware-synced clocks (AWS Time Sync Service, Google TrueTime) for sub-millisecond accuracy.
HIGH

Fanout queue backlog from Kafka consumer lag

Users experience stale timelines for several minutes. During major events, the backlog compounds rapidly.

Constraint: 12 Kafka partitions process fanout writes. What breaks: one consumer pod crashes, K8s takes 60s to reschedule. The dead consumer's 4 partitions accumulate 4/12×5,787×60=115,7404/12 \times 5{,}787 \times 60 = 115{,}740 unprocessed tweets. Each fans out to 200 followers: 23M pending writes. Replacement needs 4 min to drain at 100K writes/sec. Detect: consumer lag above 50K, pod restarts, fanout latency above 60s. Recover: auto-scaling spins up replacements, cooperative-sticky rebalancing reassigns partitions in 5 seconds. Impact: stale timelines for minutes, compounding during major events.

Kafka consumer group lag exceeding 50K messages. Consumer pod restart events in Kubernetes. Fanout latency (publish-to-cache-write) exceeding 60 seconds.
Mitigation
  1. Over-provision consumer pods: we run 2x the minimum needed so losing one pod still leaves enough capacity to handle the full stream.
  2. Kafka partition rebalancing with cooperative-sticky assignor: reduces rebalance time from 30 seconds to under 5 seconds.
  3. Backpressure mechanism: if lag exceeds 100K, we temporarily skip fanout for users with <10 followers (they rarely check their timeline in real time).
  4. Auto-scaling consumer pods based on lag metrics: scale up when lag > 20K, scale down when lag < 1K.
LOW

Stale timeline from eventual consistency window

Expected behavior in an eventually consistent system. Noticeable only if the user posts and immediately checks another device.

Constraint: eventual consistency for fanout with a 2-5 second target. What breaks: user A posts a tweet, written to MySQL and published to Kafka. User B refreshes 1 second later. The fanout consumer has not processed A's tweet yet (latency is 2-5 seconds). B sees a stale feed, refreshes 10 seconds later, tweet appears. Detect: fanout latency p50 at 2s and p99 at 8s, or timelines served within 1 second of a tweet missing it 90% of the time. Recover: write-through inserts A's tweet into A's own timeline immediately; client-side optimistic insertion shows it locally before fanout confirms.

User-reported 'missing tweets' that appear on refresh. Fanout latency histogram showing p50 at 2 seconds and p99 at 8 seconds.
Mitigation
  1. Write-through: when user A posts, we immediately add the tweet to A's own timeline cache (so A sees their own tweet instantly).
  2. On timeline read, if the user's last_tweet_at is within 5 seconds, we append the user's own recent tweets to the response (self-merge).
  3. Client-side optimistic insertion: the mobile app inserts the tweet into the local timeline immediately after a successful POST, before the server confirms fanout.
  4. We document the 2-5 second delivery SLA. We do not promise real-time delivery. Set user expectations correctly.