Failure Modes

Photo Sharing Failure Modes

What breaks, how to detect it, and how to fix it. Every failure includes detection metrics, mitigations, and severity rating.

Failure Modes
CRITICAL

Celebrity fanout storm

Affects feed freshness for all users, not just the celebrity's followers.

Constraint: fanout-on-write processes 100K writes/sec per Redis shard. A celebrity with 50M followers posts a photo, pushing the photo ID into 50M timeline caches. At 100K writes/sec, that is 50M/100K=50050M / 100K = 500 seconds (8+ minutes) for one photo. What breaks: every other user's fanout queues behind this backlog. Feed freshness degrades platform-wide. Detection: Kafka consumer lag exceeds 1M messages, fanout latency spikes above 30s. Recovery: hybrid fanout routes 10K+ follower accounts to fanout-on-read. User impact: stale feeds for 8+ minutes until the queue drains.

Kafka consumer lag exceeding 1M messages. Fanout latency (post-to-cache-appearance) spiking above 30 seconds. Redis write throughput saturating at 100% on multiple shards.
Mitigation
  1. Hybrid fanout: users with 10K+ followers skip fanout-on-write. Their photos are merged at read time.
  2. Priority queues in Kafka: normal users' fanout gets high priority, celebrity fanout gets low priority so it does not block others.
  3. Rate limiting per-post fanout at 50K writes/sec, spreading a 50M-follower fanout over 16 minutes instead of attempting it all at once.
  4. Circuit breaker: if queue depth exceeds 5M, temporarily switch ALL users to fanout-on-read until the queue drains.
HIGH

Image processing backlog

Photos appear as placeholders for extended periods. Users think the upload failed and retry, compounding the problem.

Constraint: resize workers handle 12K jobs/sec (normal: 9,260/sec). A viral event spikes uploads from 2,315/sec to 15K/sec. Each upload triggers 4 resize jobs, so demand jumps to 60,000 jobs/sec. What breaks: queue depth grows by 48K/sec. After 5 minutes, 14.4M jobs are queued. Users see grey placeholders instead of photos. Detection: queue depth exceeds 100K, p99 processing time exceeds 30s, thumbnail availability drops below 95%. Recovery: auto-scale workers, prioritize thumbnails. User impact: grey placeholders for 5-15 minutes, users retry thinking uploads failed.

Processing queue depth exceeding 100K. P99 processing time exceeding 30 seconds. Thumbnail availability rate dropping below 95%.
Mitigation
  1. Auto-scale resize workers based on queue depth. Scale up when queue > 50K, scale down when queue < 5K.
  2. Process thumbnail first (highest priority). Users see the grid immediately. Full-size variants process in background.
  3. Backpressure on upload endpoint: if queue depth exceeds 500K, return 503 with Retry-After header. Better to reject uploads temporarily than show broken feeds.
  4. Pre-warm worker capacity during predictable events (New Year's Eve, Super Bowl).
HIGH

CDN cache miss thundering herd

Origin overload cascades: S3 throttling causes 5xx errors, CDN retries amplify the load, feed pages show broken images.

Constraint: CDN serves 350K reads/sec at 95%+ hit ratio, leaving origin at ~17K QPS. A viral photo hits 10M views/hour. Its cache entry gets evicted. What breaks: 1,000 concurrent requests arrive in 100ms, all miss cache, all hit S3 origin simultaneously. If multiple popular photos expire together, origin QPS jumps from 17K to 200K+. Detection: origin QPS spikes above 5x baseline, CDN miss rate exceeds 10% for 60+ seconds. Recovery: request coalescing collapses concurrent misses into one origin fetch. User impact: photos load in 200-500ms instead of sub-50ms for 1-2 seconds.

Origin QPS spiking above 5x baseline. CDN cache miss rate exceeding 10% for more than 60 seconds. S3 5xx error rate increasing.
Mitigation
  1. Enable request coalescing on the CDN: collapse concurrent requests for the same URL into one origin fetch.
  2. Use stale-while-revalidate: serve the stale cached copy to all current requesters while fetching a fresh copy in the background.
  3. Add an origin shield layer (CDN mid-tier cache) between edge POPs and S3 to absorb miss storms.
  4. For the most popular photos (top 0.01%), proactively warm the CDN cache before TTL expiry.
MEDIUM

Feed staleness after follow

Temporary UX gap (2-5 seconds). No data loss. Resolved on next feed refresh.

Constraint: fanout-on-write builds timeline caches at post time, not follow time. User A follows B and opens their feed expecting B's photos. What breaks: A's cache was built before the follow, so B's photos are missing. The feed looks unchanged; A thinks the follow failed. Detection: user reports of missing content post-follow, cache miss rate for newly followed accounts, A/B tests showing engagement drops 60 seconds after follow. Recovery: backfill B's last 20 photo IDs into A's cache on follow. User impact: stale feed for 2-5 seconds until backfill completes.

User reports of missing content after following new accounts. Cache miss rate for newly followed accounts. A/B testing showing feed engagement drops in the 60 seconds after a follow action.
Mitigation
  1. On follow, backfill: fetch B's last 20 photo IDs and inject them into A's timeline cache with their original timestamps.
  2. Async merge: publish a 'follow' event to Kafka, a consumer fetches B's recent photos and merges them into A's cache within 2-5 seconds.
  3. Client-side: after a follow, the mobile app fetches B's recent photos directly and merges them into the local feed view immediately.
MEDIUM

Like counter race condition

Incorrect like counts erode user trust but do not affect core functionality.

Constraint: 146K likes/sec peak requires atomic increments. Two users like the same photo at the same millisecond. Without atomics, both threads read count=100, increment to 101, write 101. One like is lost. What breaks: at 146K/sec, race conditions compound. After 24 hours, Redis and PostgreSQL counts drift by thousands. Users see inconsistent counts across views. Detection: count drift between Redis and PostgreSQL exceeds 1%, reconciliation job flags divergent counts. Recovery: reconciliation runs every 10 minutes, trusting the higher count. User impact: incorrect like counts erode trust.

Count drift between Redis and PostgreSQL exceeding 1% for any photo. Periodic reconciliation job flagging divergent counts. User reports of like counts decreasing after refresh.
Mitigation
  1. Use Redis INCR for all like increments. INCR is atomic and O(1), no read-modify-write cycle, no race conditions.
  2. Periodic reconciliation: every 10 minutes, sample 10K photos and compare Redis count with PostgreSQL. Fix discrepancies by trusting the higher count.
  3. Async flush from Redis to PostgreSQL every 30 seconds via a background worker. PostgreSQL count is the durable source of truth for analytics.
MEDIUM

S3 upload partial failure

No user-facing impact if cleanup runs properly. Storage cost grows if orphaned objects are not cleaned up.

Constraint: mobile uploads face 0.5% connection drop rate. A user uploads a 15MB photo; connection drops at 80% (12MB sent). What breaks: S3 multipart upload has 12MB of parts but no completion signal. The photo row stays status='uploading'. User retries, creating a duplicate. Without cleanup, orphans accumulate: at 0.5% failure rate, that is 1M orphaned objects/day, wasting 15TB of S3 daily. Detection: upload completion rate below 99.5%, orphaned S3 object count growing, photo rows stuck in 'uploading' 1+ hours. Recovery: S3 lifecycle aborts incomplete uploads after 24 hours.

Upload completion rate dropping below 99.5%. Orphaned S3 objects count growing (S3 lifecycle report). Photo rows stuck in 'uploading' status for more than 1 hour.
Mitigation
  1. Use S3 multipart upload with resumable uploads. The client can resume from the last completed part, not restart from scratch.
  2. S3 lifecycle policy: abort incomplete multipart uploads after 24 hours. This automatically cleans up orphaned parts.
  3. Cleanup job: every hour, find photo rows with status='uploading' older than 1 hour, abort their S3 multipart upload, and delete the photo row.
  4. Client-side retry with idempotency key: the same upload attempt always writes to the same photo_id, preventing duplicates.