STANDARDwalkthrough

Follower Graph Storage

6 of 8

3 related

At 500M users with 200 average follows each, we have 100 billion edges. We need to answer "who follows User X?" in under 5ms for every photo upload's fanout.

MySQL's B-tree indexes degrade as the table grows past 10B rows, slowing writes from 1ms to 50ms+. Cassandra's LSM-tree storage handles writes at

O(1)

regardless of table size.

“We store the follower graph in Cassandra (not MySQL) because follow/unfollow is write-heavy at 50M events/day.”

Trade-off: we accept eventual consistency, meaning a follower count might lag by 1-2 seconds after a follow action. We denormalize into two tables: followers_of (partitioned by followee_id) and following (partitioned by follower_id).

Why two tables instead of one? Because a single table partitioned by followee_id answers "who follows X" but requires a full-cluster scan for "who does X follow." Duplication means every follow writes twice, but reads are always single-partition lookups in under 5ms.

Follower counts live in the users table, updated asynchronously by a counter service. Why not COUNT queries?

Because counting 10M followers on read would take seconds. What if the interviewer asks: why not a graph database like Neo4j?

Neo4j works well for multi-hop traversals (friends-of-friends), but our access pattern is single-hop (direct followers only). Cassandra's partition reads are faster for this pattern and scale horizontally.

Why it matters in interviews

Interviewers test whether you understand the denormalized dual-table design and can justify the write duplication. Explaining why we avoid COUNT queries and instead update counts asynchronously shows you know how to prevent hot-partition bottlenecks at scale.

Related concepts

← PreviousImage Processing Pipeline Next →Feed Ranking Algorithm