Chat System / Messenger
VERY COMMONChat system design comes up at every FAANG company. It is how WhatsApp delivers 100 billion messages per day to 2 billion users with sub-200ms latency. You will design a WebSocket connection layer that holds 500 million persistent connections, a message ordering system using per-conversation sequence numbers, and a Cassandra-backed storage layer that handles 231K writes per second.
- Design WebSocket connection management across 10,000 stateful servers
- Build message ordering with per-conversation sequence numbers and at-least-once delivery
- Avoid the presence broadcast storm that melts the pub/sub layer
Visual Solutions
Step-by-step animated walkthroughs with capacity estimation, API design, database schema, and failure modes built in.
Cheat Sheet
Key concepts, trade-offs, and quick-reference notes for Chat System. Everything you need at a glance.
Anti-Patterns
Common design mistakes candidates make. Wrong approaches vs correct approaches for each trap.
Failure Modes
What breaks in production, how to detect it, and how to fix it. Detection metrics, mitigations, and severity ratings.
Start simple. Build to staff-level.
“Chat system for 500M DAU exchanging 20B messages/day at 231K msg/sec. Each message averages 150 bytes; daily storage is 3 TB in Cassandra partitioned by conversation_id. 500M WebSocket connections across 10K servers at 50K each. Per-conversation sequence numbers (not global counters) for ordering because a global counter bottlenecks at 231K writes/sec. At-least-once delivery with client-side idempotency keys for dedup. Presence on Redis with 30-second heartbeats. Kafka buffers for async persistence, keeping delivery under 200ms p99.”
WebSocket Connection Management
STANDARDWe chose WebSocket (not HTTP polling, not SSE) for full-duplex messaging. 500M connections across 10K servers at 50K each. Connection registry in Redis maps user to gateway for targeted routing.
High Level DesignMessage Ordering & Delivery
TRICKYWe chose per-conversation sequence numbers (not a global counter) because contention per counter is near zero at 2 increments/day/conversation. Idempotency keys for dedup give exactly-once semantics.
Core Feature DesignPresence Service
STANDARDWe chose lazy presence (not eager broadcast) because broadcasting to all contacts generates 417M events/sec. Redis TTL with 30s heartbeat. Only push to active viewers.
High Level DesignChat Server Architecture
STANDARDSeparated stateful gateways (connection-bound, epoll, 50K conn) from stateless chat services (CPU-bound, routing logic). Different scaling axes, independent failure modes.
High Level DesignMessage Storage (Cassandra)
EASYWe chose Cassandra (not MySQL) because 231K writes/sec are append-only and reads are sequential per conversation. LSM-tree at amortized. Partition by conversation_id.
Database SchemaGroup Chat Fan-Out
STANDARDStore once, deliver many: write message body once to Cassandra, fan out only 16-byte delivery notifications to online members. Cap at 500 members to bound linear fan-out cost.
Core Feature DesignEnd-to-End Encryption
TRICKYSignal Protocol (Double Ratchet + X3DH). Server is a blind relay: stores only ciphertext. Forward secrecy: compromising one key reveals nothing else. Trade-off: no server-side search or moderation.
Core Feature DesignPush Notifications (Offline)
EASYOffline users get push via APNs/FCM within 5 seconds. On reconnection, catch-up sync: client sends last_seen_seq, server replays from Cassandra. No messages lost.
Replication and Fault Tolerance