HTTP Polling Instead of WebSocket
Very CommonFORMULA
Polling for new messages every few seconds wastes bandwidth and adds latency. At 500M users polling every 5 seconds, that is 100M HTTP requests/sec for mostly empty responses.
Why: HTTP polling is simpler to implement: standard request-response, no connection state, works with any load balancer. Candidates default to what they know. But at 500M users polling every 5 seconds: 500M/5=100M requests/sec. With 200 bytes of HTTP headers per request, that is 20 GB/sec of wasted bandwidth for mostly empty responses ('no new messages'). Message delivery latency averages half the polling interval (2.5 seconds), unacceptable for a chat system targeting 200ms.
WRONG: Poll the server via HTTP GET /messages every 5 seconds. At 500M users, that is 100M requests/sec with 20 GB/sec of header overhead. Average delivery latency: 2.5 seconds. Over 95% of responses are empty. The server wastes CPU parsing and responding to 95M useless requests per second.
RIGHT: Use WebSocket for persistent full-duplex connections. After the initial HTTP upgrade handshake (one-time cost), messages flow with 2-6 bytes of frame overhead instead of 200 bytes of HTTP headers. Delivery latency: under 200ms. Zero wasted requests. We chose WebSocket over Server-Sent Events (SSE) because SSE is unidirectional (server-to-client only), so sending a message would still require a separate HTTP POST. Trade-off accepted: WebSocket connections are stateful, requiring a connection registry for routing.
Global Sequence Counter for Message Ordering
Very CommonFORMULA
A single auto-increment counter for all messages creates a serialization bottleneck at 231K/sec. Per-conversation counters eliminate cross-conversation contention entirely.
Why: A global counter gives total ordering: every message has a unique, strictly increasing number. It is the simplest mental model. But at 231K messages/sec, every increment serializes behind a single lock. Even Redis INCR, which handles 100K ops/sec per shard, needs 3 dedicated shards. More critically, the counter becomes a single point of failure: if it goes down, no messages can be ordered. Candidates do not realize that chat messages only need ordering within a conversation, not across all conversations.
WRONG: Use a global Redis counter incremented for every message. At 231K/sec, the counter serializes writes with 5-10ms of contention latency per increment. If the counter node fails, all messages across all conversations are unorderable until failover completes.
RIGHT: Use per-conversation sequence numbers: one Redis counter per conversation. At 20B messages across ~10B conversations, each counter increments ~2 times/day on average. Contention per counter is effectively zero. We chose this over Lamport timestamps because Lamport clocks provide causal ordering but not total ordering, leaving ambiguity for concurrent messages. Trade-off accepted: no global ordering across conversations, but users never compare message order across conversations.
Eager Presence Broadcast
Very CommonFORMULA
Broadcasting every status change to all contacts generates 417M events/sec at morning peak. Lazy presence with subscriber-only push reduces this by 100x.
Why: It feels correct: when User A comes online, notify all 500 contacts immediately so their UI updates. The feature works perfectly in testing with 10 users. But at morning peak, 500M users come online over 10 minutes. Each triggers 500 notifications: 500M×500/600s=417M presence events/sec. No pub/sub system handles this. Even Kafka at 10M/sec would need 42 partitions just for presence, and consumers could not keep up.
WRONG: When a user's status changes, broadcast to all 500 contacts via the pub/sub layer. At morning peak: 417M events/sec. The pub/sub system (Redis Pub/Sub, Kafka) saturates. Presence updates back up. Users see stale online/offline status for minutes.
RIGHT: Use lazy presence: store status in Redis with 60-second TTL. Only push status changes to users who are actively viewing a chat with that contact. When User A opens a chat with User B, A subscribes to B's presence channel. This reduces presence traffic by 100x because only active viewers (~0.1% of contacts) receive push updates. We chose TTL-based auto-expiry (not explicit offline notifications) because network disconnects are common and the TTL handles implicit offline detection without extra logic. Trade-off accepted: presence accuracy has a 30-60 second lag.
Storing Messages in MySQL
Very CommonFORMULA
MySQL's B-tree indexes require random disk I/O for each insert. At 231K writes/sec, replication lag grows to minutes. Cassandra's LSM-tree handles append-only writes at O(1).
Why: MySQL is the default choice for structured data. A messages table with indexes on (conversation_id, created_at) seems straightforward. At low volume, it works. But at 231K inserts/sec, each insert triggers a B-tree page split. Random I/O kills throughput. MySQL replication lag grows from milliseconds to minutes. Secondary indexes double the write cost. Cassandra's LSM-tree converts random writes to sequential appends, handling 231K writes/sec on a modest cluster.
WRONG: Store all messages in MySQL with indexes on (conversation_id, created_at). At 231K writes/sec, B-tree indexes require random disk I/O for each insert. Replication lag grows to minutes. A 3 TB/day growth rate fills a MySQL instance in days.
RIGHT: Use Cassandra with partition key = conversation_id, clustering key = message_id DESC. Cassandra's LSM-tree converts random writes to sequential appends at O(1) amortized. Co-locating messages by conversation_id means 'load chat history' is a single-partition sequential read. We chose Cassandra over DynamoDB because DynamoDB's provisioned throughput pricing at 231K writes/sec is ~$100K/month, and we lose control over compaction strategy. Trade-off accepted: Cassandra is eventually consistent, so two users may see slightly different message counts for a brief window. Broadcasting Messages to All Servers
CommonFORMULA
Without a connection registry, every message is broadcast to all 10K gateway servers. That is 2.31B network calls/sec. A Redis lookup costs 1ms and eliminates 9,999 wasted calls per message.
Why: Broadcasting avoids maintaining a connection registry. Every server receives every message and checks if the recipient is connected locally. This works with 10 servers. At 10,000 servers and 231K messages/sec: 231K×10,000=2.31B network calls/sec. Each server receives 231K messages/sec regardless of how many recipients it holds. CPU is wasted on 99.99% of messages that have no local recipient.
WRONG: Broadcast every message to all 10,000 gateway servers. Each server checks if the recipient is locally connected. Total network calls: 2.31B/sec. Each server processes 231K irrelevant messages/sec, wasting CPU on discard logic.
RIGHT: Maintain a connection registry in Redis: key = conn:user_id, value = server_id, TTL = 90 seconds. On each message, one Redis GET (1ms) identifies the recipient's gateway. Forward directly. We chose Redis over an in-memory service mesh because Redis handles 100K+ lookups/sec per shard with sub-millisecond latency and provides TTL-based auto-cleanup when connections drop. Trade-off accepted: the registry adds a 1ms lookup per message and a Redis dependency, but eliminates 9,999 wasted network calls per message.
Synchronous Cassandra Write Before Ack
CommonFORMULA
Writing to Cassandra before acking the sender adds 50ms of p99 latency to delivery. Kafka decouples delivery from persistence, keeping ack time at 12ms.
Why: It feels safe: persist the message before telling the sender it was delivered. Guarantees durability. But Cassandra write latency is 5ms at p50 and 50ms at p99. Adding 50ms to the delivery path pushes total latency above 60ms at p99, and during compaction storms, Cassandra writes can spike to 200ms+. The sender's UI freezes until the ack arrives. Users perceive the chat as 'laggy'.
WRONG: Chat service writes to Cassandra synchronously, waits for ack, then delivers to the recipient and acks the sender. Path: routing (5ms) + Cassandra write (50ms p99) + gateway push (5ms) = 60ms minimum. During compaction: 200ms+. Sender perceives lag.
RIGHT: Publish to Kafka (2ms), then deliver to recipient and ack the sender in parallel. Total: routing (5ms) + Kafka publish (2ms) + gateway push (5ms) = 12ms. Cassandra consumers write asynchronously from Kafka with days of retention as buffer. We chose Kafka over an in-memory queue because Kafka provides durable retention: if Cassandra is down for hours, no messages are lost. Trade-off accepted: there is a brief window where the message is delivered to the recipient but not yet persisted in Cassandra. If Cassandra is down during that window and the Kafka consumer crashes, the message exists only in Kafka, which has its own replication (RF=3).
No Deduplication on Message Retry
CommonFORMULA
Without dedup, a lost ack causes the sender to retry and the recipient sees the same message twice. Idempotency keys in Redis prevent duplicate delivery with zero user impact.
Why: In the happy path, messages are delivered exactly once. Candidates test the happy path and move on. But network is unreliable. The server delivers to User B, writes to Cassandra, but the ack to User A is lost in transit. A's client retries after 3 seconds. Without dedup, B sees the message twice. At 0.1% retry rate and 231K messages/sec, that is 231 duplicate messages per second, roughly 20M duplicates per day visible to users.
WRONG: No deduplication logic. The server processes every incoming message as new. At 0.1% retry rate: 20M duplicate messages/day visible to recipients. Users see repeated messages and lose trust in the platform.
RIGHT: Client attaches an idempotency_key (UUID) to every message. Server checks Redis SET (dedup:key, TTL 24h) before processing. If found, return original ack without re-delivering. Dedup cache: 231K×86,400×40B≈800 GB across ~15 Redis nodes. We chose client-generated UUIDs over server-generated dedup tokens because the client needs the key before the first send attempt, not after. Trade-off accepted: 800 GB of Redis for the dedup cache, but this prevents 20M daily visible duplicates. Treating Gateways as Stateless
CommonFORMULA
Gateways hold WebSocket connections in memory. Round-robin load balancing sends a message to a random gateway that does not hold the recipient's connection. Use the connection registry for targeted routing.
Why: Stateless services are the default in modern architectures. Load balancers distribute requests evenly via round-robin. But gateways are inherently stateful: each holds a set of WebSocket connections. A message for User B must reach the specific gateway holding B's connection. Round-robin sends the message to any of 10,000 gateways, and 9,999 of them cannot deliver it. The message is lost or must be re-routed, adding 10-50ms of latency.
WRONG: Deploy gateway servers behind a round-robin load balancer like any stateless service. Messages for User B land on a random gateway. 99.99% of the time, that gateway does not hold B's connection. Messages are lost or require an internal re-routing layer that adds 10-50ms latency.
RIGHT: Accept that gateways are stateful. Use a connection registry (Redis) to map user_id to gateway server_id. Route messages directly to the correct gateway. The load balancer is only used for initial connection establishment (via least-connections algorithm, not round-robin). Once connected, all messages for that user bypass the load balancer and go directly to the gateway via internal routing. Trade-off accepted: stateful gateways complicate deployments (rolling restarts disconnect users), mitigated by graceful drain.