Photo Sharing Anti-Patterns

Common design mistakes candidates make. Learn what goes wrong and how to avoid each trap in your interview.

Fanning Out Full Photo Objects

Very CommonFORMULA

Pushing full photo data into every follower's cache instead of photo_ids. Candidates think it saves a read, but a single 4KB photo object in 10M caches = 40GB of redundant data.

Why: It feels like an optimization: skip the hydration step and serve the photo metadata directly from the timeline cache. With 500 entries per user, storing 4KB objects means $500 \times 4\text{KB} = 2\text{MB}$ per user. For 500M users: $500M \times 2\text{MB} = 1\text{PB}$ . Storing 8-byte IDs: $500M \times 4\text{KB} = 2\text{TB}$ . That is a 500x memory difference. The hydration step (batch MGET for 20 photo objects) adds only 2-3ms.

WRONG: Store full photo objects (4KB each) in every follower's timeline cache. When a popular user posts, the fanout pushes 4KB per follower instead of 8 bytes. With 10M followers, one post writes 40GB of redundant data across the Redis cluster. This approach copies Memcached-style denormalization, which fails at Instagram's scale.

RIGHT: Store only photo IDs (8 bytes each) in the timeline cache. On feed read, batch-fetch the top 20 photo objects from a separate photo metadata cache via MGET. One extra round-trip (~2ms) saves 500x memory. We chose IDs-only because memory is the binding constraint at 500M users, and 2ms of hydration latency is invisible to users. Trade-off accepted: every feed read requires one extra cache round-trip.

Storing Only Original Resolution

Very CommonFORMULA

Not generating multiple sizes upfront. Candidates assume the CDN or client can resize on the fly. Reality: real-time resizing at 350K reads/sec costs ~35,000 CPU cores.

Why: Dynamic resizing sounds elegant: store one file, resize at the CDN edge or via an image proxy. Some CDNs offer this feature. But at 350K reads/sec, each resize takes 50-200ms of CPU time. That is $350K \times 100\text{ms} = 35{,}000$ CPU-seconds per second, roughly 35,000 cores dedicated to resizing. Pre-generating 4 sizes at upload time costs 4 seconds of CPU once, then serves immutable files from cache forever.

WRONG: Store only the original 3MB photo and rely on a CDN image transformation service (like Cloudinary or Imgix) for on-the-fly resizing. At 350K reads/sec, the CDN spins up 35,000 CPU cores for real-time resizing. Monthly compute bill: millions of dollars. Cache hit ratio drops because each resolution+quality combination creates a unique cache key.

RIGHT: Pre-generate 4 resolution variants at upload time (one-time cost: ~4 seconds). Store all 5 files (original + 4 variants) in S3. CDN serves immutable files with 1-year TTL. Zero compute on the read path. 95%+ cache hit ratio. We chose pre-generation over dynamic resizing because the read path runs 50x hotter than the write path, and amortizing resize cost across one write beats paying it on every read. Trade-off accepted: 37% more storage per photo (4.1MB vs 3MB).

Using UUIDs for Photo IDs

Very CommonFORMULA

UUIDs are 128 bits, not time-sortable, and cause B-tree fragmentation. Instagram's 64-bit IDs are half the size, time-sortable, and generated inside PostgreSQL with no external coordinator.

Why: UUIDs are the default choice in many frameworks. UUID v4 is random, globally unique without coordination. But 128 bits is twice the size of a 64-bit ID. In indexes holding billions of rows, that doubles memory usage. Worse, random UUIDs fragment B-tree indexes because inserts land at random leaf pages. Sequential (time-sorted) IDs append to the end of the B-tree, keeping write amplification low.

WRONG: Use UUID v4 (the framework default) for photo IDs. Each ID is 128 bits (16 bytes vs 8 bytes). Billions of photos with random UUIDs cause B-tree page splits: write amplification increases 3-5x. To display photos chronologically, you need a separate created_at index, doubling index storage.

RIGHT: Use Instagram's 64-bit ID: 41-bit timestamp + 13-bit shard + 10-bit sequence. Half the bytes, time-sortable by default (no secondary index needed), generated inside PostgreSQL with zero external coordination. B-tree inserts are sequential, minimizing page splits. We chose this over Snowflake IDs because we avoid the dependency on an external coordination service. Trade-off accepted: the ID format is coupled to PostgreSQL's PL/pgSQL, making a future database migration harder.

Pure Fanout-on-Write for All Users

Very CommonFORMULA

One celebrity photo upload triggers 50M cache writes, blocking the fanout queue for minutes. We use hybrid: push for users under 10K followers, pull for celebrities.

Why: Fanout-on-write is clean and makes the read path trivial. Candidates test it with typical users (200-500 followers) and it works perfectly. Then a celebrity with 50M followers posts. At 100K writes/sec, pushing that one photo ID into 50M caches takes $50M / 100K = 500$ seconds (over 8 minutes). Every other user's fanout queues behind this backlog. The entire platform's feed freshness degrades.

WRONG: Apply fanout-on-write uniformly to all users, treating a 50M-follower celebrity the same as a 200-follower user. The fanout service queues 50M cache inserts. At 100K writes/sec, the queue drains in 8+ minutes. Every other user's photo delivery stalls behind this single post.

RIGHT: Use a hybrid model with a 10K follower threshold. Users below 10K: fanout-on-write (instant push). Users above: fanout-on-read, their photos are fetched and merged at request time. 99% of users get instant push. We chose 10K as the threshold (not 1K or 100K) because it balances write amplification against read-time merge cost. Trade-off accepted: celebrity followers see posts with ~50ms additional read-time merge latency.

Synchronous Image Resizing on Upload

CommonFORMULA

Making the user wait for all 4 variants to finish before returning success. We return 202 immediately after S3 write. Resizing happens async via an event-triggered worker.

Why: It feels correct to wait until the photo is fully processed before confirming the upload. But generating 4 variants takes 4-10 seconds. On a mobile network with spotty connectivity, holding the HTTP connection open for 10 seconds risks timeout. The user stares at a spinner. If the connection drops at second 8, the upload appears to fail even though S3 already has the original.

WRONG: The upload handler writes to S3, generates all 4 resolution variants synchronously, then returns 200 OK. Total response time: 10+ seconds. On mobile networks, the connection times out. Users retry, creating duplicate uploads. This mirrors the synchronous processing pattern that works at low QPS but collapses at 2,300 uploads/sec.

RIGHT: Write the original to S3, return 202 Accepted with the photo_id immediately (under 500ms). An S3 event triggers an async resize worker. Show a placeholder until the thumbnail is ready (~1 second). We chose 202 (not 200) because the resource is not yet fully created. Trade-off accepted: users see a placeholder for ~1 second, but the upload response is 20x faster than synchronous processing.

Serving Photos Directly from S3

CommonFORMULA

S3 can handle the throughput, but latency from a single region is 200-500ms globally. We place a CDN in front to reduce latency to sub-50ms with 95%+ cache hit ratio.

Why: S3 is durable (11 nines) and can handle massive throughput. Candidates reason that S3 is already 'in the cloud' and skip the CDN layer. But S3 buckets live in one region. A user in Tokyo requesting a photo from us-east-1 sees 200-500ms latency per image. A feed page with 20 photos means 4-10 seconds of load time. CDN edge POPs in 300+ cities serve cached photos in sub-50ms.

WRONG: Serve all photos directly from S3 in us-east-1, assuming S3's throughput is sufficient. Users in Asia, Europe, and South America experience 200-500ms latency per photo. A feed page with 20 photos takes 4-10 seconds to load. This mirrors a common single-region architecture mistake.

RIGHT: Place a CDN (we chose CloudFront, not Akamai, for native S3 integration) in front of S3. Photos are immutable, so the CDN caches them with a 1-year TTL. 300+ edge POPs serve photos in sub-50ms globally. 95%+ of reads never hit origin. Trade-off accepted: CDN costs at 70GB/sec bandwidth, but origin infrastructure stays small (~17K QPS instead of 350K QPS).

Storing Follower Graph in MySQL

CommonFORMULA

Follow/unfollow is write-heavy and the graph has 100B edges. We chose Cassandra (not MySQL) because its LSM-tree handles writes at $O(1)$ regardless of table size.

Why: MySQL is the default relational choice. The follows table is straightforward: (follower_id, followee_id, created_at). But follow/unfollow is write-heavy: 50M follow events per day, each requiring an INSERT or DELETE. With 100B total edges, the table is massive. MySQL secondary indexes slow writes as the table grows. Cassandra's Log-Structured Merge-tree (LSM-tree) storage engine handles writes at $O(1)$ (append-only) regardless of table size.

WRONG: Store all follow relationships in MySQL, the default relational choice. With 100B edges, the secondary index on followee_id is 800GB+. Every INSERT triggers an index update on two indexes. At 50M writes/day, MySQL replication lag grows to minutes. Follower count queries require COUNT(*) scans across the index.

RIGHT: Use Cassandra with two denormalized tables: followers_of (partition by followee_id) for fanout lookups, and following (partition by follower_id) for the following list. We chose Cassandra over DynamoDB because we need tunable consistency and control over partition layout. Cassandra's LSM-tree handles writes at

O(1)

. Trade-off accepted: eventual consistency means a 1-2 second delay in follower count updates, but that is invisible to users.

Missing CDN Request Coalescing

CommonFORMULA

When a viral photo's cache expires, thousands of simultaneous requests hit origin. Without request coalescing, the origin gets a thundering herd that can cascade into 5xx errors.

Why: CDN cache entries have a TTL. Even with a 1-year TTL, popular photos eventually expire or get purged. When a viral photo with millions of views per hour has its cache entry expire, the next thousand requests all miss the cache simultaneously. Each miss forwards to the origin (S3). S3 handles it, but the latency spikes from sub-50ms (CDN hit) to 200ms+ (S3 direct) for those requests. Worse, the origin bandwidth spikes.

WRONG: No request coalescing configured on the CDN, relying on the default pass-through-on-miss behavior. A viral photo's cache expires. 1,000 concurrent requests all miss and all 1,000 hit the origin simultaneously. Origin bandwidth spikes by 1,000x for that object. If multiple popular photos expire in the same window, origin gets overwhelmed.

RIGHT: Enable request coalescing (also called request collapsing) on the CDN. When multiple requests arrive for the same URL during a cache miss, the CDN sends only one request to the origin and holds the others. Also enable stale-while-revalidate: serve the stale cached copy while fetching a fresh one in the background. We chose both mechanisms together (not just one) because coalescing handles burst misses while stale-while-revalidate prevents the miss from happening at all. Trade-off accepted: stale-while-revalidate serves slightly outdated content for a few seconds, but photos are immutable so staleness does not matter.