Whiteboard Scale›Photo Sharing›Cheat Sheet

Cheat Sheet

Photo Sharing Cheat Sheet

Key concepts, trade-offs, and quick-reference notes for your interview prep.

Photo Resolution Variants

We pre-generate 4 sizes per photo (not dynamic resizing) because at 350K reads/sec, real-time resizing would cost ~35,000 CPU cores. Sizes: 150px thumbnail (15KB), 320px small (50KB), 640px medium (200KB), 1080px full (800KB). Original 3MB stored in S3 but never served directly. Total storage per photo:

3\text{MB} + 15\text{KB} + 50\text{KB} + 200\text{KB} + 800\text{KB} \approx 4.1\text{MB}

. The CDN URL encodes the resolution: /photos/id/1080.jpg. Trade-off: we spend 4.1MB storage per photo instead of 3MB, but eliminate all compute on the read path.

💡 We generate the thumbnail first (not all 4 in parallel from the start) because the grid view needs the 150px variant within 1 second. The other 3 sizes process in parallel after the thumbnail is ready.

Daily Storage Growth

200M photos/day at 4.1MB each:

200M \times 4.1\text{MB} = 820\text{TB/day}

. That is

820 \times 365 = 299\text{PB/year}

. We chose S3 lifecycle policies (not a single storage tier) because 80% of photo accesses happen within the first 48 hours. Move originals to S3 Glacier after 90 days (80% cost reduction), keep only the 4 variants in standard storage. Trade-off: retrieving a glaciered original takes minutes, but photos older than 90 days are requested less than 0.1% of the time.

⚠ Not accounting for all 4 variants in the storage estimate. Candidates often quote 3MB x 200M = 600TB, missing the 220TB of resized variants that actually get served.

Instagram 64-bit ID

We chose Instagram's 64-bit ID (not UUID v4) because UUIDs are 128 bits, not time-sortable, and cause B-tree fragmentation. Layout: 41-bit timestamp (milliseconds since 2011-01-01, good for 69 years) + 13-bit shard (8,192 logical shards) + 10-bit sequence (1,024 IDs per millisecond per shard). Generated inside PostgreSQL via PL/pgSQL. No external coordinator (not ZooKeeper, not a dedicated ID service), zero network hop. Total capacity:

8{,}192 \times 1{,}024 \times 1{,}000 = 8.39\text{ billion IDs/sec}

. Trade-off: we tie our ID format to PostgreSQL, but avoid the operational complexity of a separate coordination service.

💡 The timestamp occupies the high-order bits, so IDs are time-sortable by default. ORDER BY photo_id = ORDER BY created_at. We skip the secondary index on created_at entirely.

Celebrity Fanout Threshold

We chose a hybrid fanout model (not pure push or pure pull) with a 10K follower threshold. Below 10K: fanout-on-write pushes the photo ID into every follower's timeline cache at post time. Above 10K: fanout-on-read, merged at request time. Why 10K and not 1K or 100K? A celebrity with 50M followers would trigger 50M cache writes for one photo. At 100K writes/sec, that is 500 seconds of sustained writes, blocking the entire fanout queue. Setting it at 1K pushes too many users into pull mode, degrading feed latency for millions. 99% of users fall below 10K, so push works for the vast majority. Trade-off: celebrity followers see posts with ~50ms extra latency due to the read-time merge.

⚠ Setting the threshold too low (e.g., 1K). That forces too many users into pull mode, degrading feed latency for millions of accounts that could easily use push.

Feed Cache Sizing

We store each user's feed in a Redis sorted set (not Memcached) because we need sorted range queries for chronological pagination. Each set holds 500 photo IDs at 8 bytes each. Memory per user:

500 \times 8\text{B} = 4\text{KB}

. For 500M users:

500M \times 4\text{KB} = 2\text{TB}

. At 64GB per Redis node:

2{,}000 / 64 \approx 32

nodes minimum. We plan for ~50 nodes with replication (1 primary + 1 replica per shard). Why 500 IDs and not 1,000? Most users never scroll past 200 photos. 500 gives a 2.5x buffer without doubling memory cost. Trade-off: users who scroll past 500 photos hit a cold read from the database.

💡 We store photo IDs only (not photo objects) in the feed cache. Hydrate at read time via a batch MGET from the photo metadata cache. One extra round-trip (~2ms) saves 500x memory.

CDN Cache Hit Ratio

We target 95%+ cache hit ratio by exploiting photo immutability: once uploaded, bytes never change, so we set Cache-Control: max-age=31536000 (1 year). We chose path-based resolution URLs (/photos/id/1080.jpg) instead of query parameters because some CDNs strip query strings and would serve the wrong size. At 350K peak reads/sec and ~200KB average photo size, total bandwidth is

350K \times 200\text{KB} = 70\text{GB/sec}

. CDN edge Points of Presence (POPs) distribute this globally. Origin sees only the 5% misses: ~17K QPS. Trade-off: we cannot update a photo after upload (no edits), but this lets us cache aggressively.

⚠ Forgetting that CDN URLs must be unique per resolution. If all 4 sizes share the same URL with a query param, some CDNs strip query strings and serve the wrong size.

Upload and Read QPS

Uploads:

200M / 86{,}400 = 2{,}315\text{ QPS}

, peak at 3x: 6,945 QPS. Reads:

500M \times 20 / 86{,}400 = 115{,}741\text{ QPS}

, peak at 3x: 347,222 QPS. Read-to-write ratio:

115{,}741 / 2{,}315 \approx 50:1

. This is a read-heavy system, which is why we invest in CDN and caching before anything else. The CDN absorbs 95%+ of reads, so origin servers handle only ~17K QPS at peak. We sized the upload servers for 7K concurrent multipart uploads (not more) because writes are the smaller problem. Trade-off: we over-provision the read path and accept higher CDN costs, but keep origin infrastructure lean.

💡 The 50:1 ratio means optimizing the read path (CDN, cache) has 50x more impact than optimizing writes. We design the CDN layer first, upload pipeline second.

Like Counter Strategy

4.2B likes/day:

4.2B / 86{,}400 = 48{,}611\text{ QPS}

, peak at 3x: 145,833 QPS. We chose Redis INCR (not application-level read-modify-write) because INCR is atomic

O(1)

with no race conditions. Async flush to PostgreSQL every 30 seconds via a background worker. If Redis crashes, the last 30 seconds of likes are lost but recovered from the Write-Ahead Log (WAL). Why not direct database writes? 146K/sec of UPDATE ... SET likes = likes + 1 would saturate any PostgreSQL cluster. Trade-off: like counts in PostgreSQL lag by up to 30 seconds, but users see real-time counts from Redis.

⚠ Using SELECT + UPDATE in the application layer. Two concurrent likes read the same count, both increment by 1, and one like is lost. Always use atomic INCR.

Image Processing Target

All 4 resolution variants must be ready within 10 seconds of upload. We chose an async pipeline (not synchronous processing): (1) upload original to S3, return photo_id immediately, (2) S3 event triggers a resize worker, (3) generate thumbnail first (150px, ~200ms) for instant grid display, (4) process remaining 3 sizes in parallel (~2-3 seconds each). We chose libvips (not ImageMagick) because libvips is 4-8x faster and uses 10x less memory. At 2,315 uploads/sec with 4 resize jobs each, the cluster runs ~9,260 resize jobs/sec. Trade-off: users see a placeholder for ~1 second until the thumbnail is ready, but the upload response returns in under 500ms.

💡 The user sees a placeholder until the thumbnail is ready. Once the 150px variant exists (under 1 second), the grid view populates. Full-size loads lazily when the user taps.

Logical vs Physical Sharding

#10

We chose logical sharding (not physical hash-mod-N sharding) with 8,192 logical shards mapped to ~30 physical PostgreSQL servers. Each physical server hosts ~273 logical shards. To scale: add a new physical server and move entire logical shards to it. No row-level resharding needed. Each logical shard is a separate PostgreSQL schema: shard0001.photos, shard0001.users. The shard ID is embedded in the 64-bit photo ID (bits 10-22), so the application routes queries to the correct physical server by extracting the shard from the ID. Trade-off: we accept 8,192 as a hard upper limit on logical shards, but that supports billions of users before we hit the ceiling.

⚠ Using physical sharding (hash mod N servers). Adding one server rehashes every row. Logical sharding lets you move atomic units (entire shards) to new hardware with zero downtime.