Chat System System Design Walkthrough
Complete design walkthrough with animated diagrams, capacity math, API design, schema, and failure modes.
Solution PathTarget: 30 min
We designed a chat system for 500M DAU exchanging 20B messages/day at 231K/sec. 10,000 stateful gateway servers hold 50K WebSocket connections each. Per-conversation sequence numbers (not a global counter) order messages with zero contention. At-least-once delivery with client-side idempotency keys gives exactly-once semantics. Cassandra stores 3 TB/day partitioned by conversation_id. Redis handles presence at 16.7M heartbeats/sec with 60-second TTL auto-expiry. Kafka decouples delivery latency from storage, keeping p99 under 200ms.
1/10
1.
What is Chat System?
We are designing a real-time messaging system. Think WhatsApp: you type a message and it appears on your friend's screen in under 200ms. 100 billion messages per day flow through WhatsApp alone, with 500M users connected simultaneously.
The system sounds simple, but the real challenge is the tension between statefulness and scalability. Each user holds a persistent WebSocket to a specific server.
When that server crashes, 50,000 users lose their connections simultaneously, reconnect in seconds, and any messages sent during that window must not be lost. The second challenge is message ordering.
When two people type at the same time, every device must display messages in the same order. A global counter would work, but at 231K msg/sec it serializes every message.
We need ordering consistent within each conversation without global coordination.
20B messages/day, 500M concurrent WebSocket connections, two hard problems: stateful connection management across 10K servers + message ordering without global coordination.
We are designing a real-time messaging system where users send and receive text messages instantly. The system sounds simple, but the real challenge is maintaining 500 million persistent WebSocket connections across 10,000 servers while guaranteeing that every message arrives exactly once, in the correct order, even when servers crash mid-delivery. Each message must be stored durably, delivered to the recipient in under 200ms if online, or queued for push notification if offline. The engineering tension is between connection statefulness (each server holds user connections in memory) and horizontal scalability (we need thousands of servers that can fail independently).
- 500M DAU exchanging 20B messages/day at 231K messages/sec
- WebSocket connections held across 10K servers at 50K connections/server
- 3 TB/day message storage in Cassandra partitioned by conversation_id
- Sub-200ms delivery latency with at-least-once guarantee and client-side dedup