Chat System / Messenger

VERY COMMON

Chat system design comes up at every FAANG company. It is how WhatsApp delivers 100 billion messages per day to 2 billion users with sub-200ms latency. You will design a WebSocket connection layer that holds 500 million persistent connections, a message ordering system using per-conversation sequence numbers, and a Cassandra-backed storage layer that handles 231K writes per second.

Design WebSocket connection management across 10,000 stateful servers
Build message ordering with per-conversation sequence numbers and at-least-once delivery
Avoid the presence broadcast storm that melts the pub/sub layer

MetaGoogleMicrosoftAmazonSlackDiscord

Concepts

Deep dives

Cheat Items

Quick ref

▶

Visual Solutions

Step-by-step animated walkthroughs with capacity estimation, API design, database schema, and failure modes built in.

AnimatedWatch solutions →

📋

Cheat Sheet

Key concepts, trade-offs, and quick-reference notes for Chat System. Everything you need at a glance.

Quick referenceView cheat sheet →

⚠

Anti-Patterns

Common design mistakes candidates make. Wrong approaches vs correct approaches for each trap.

8 anti-patternsLearn pitfalls →

🔥

Failure Modes

What breaks in production, how to detect it, and how to fix it. Detection metrics, mitigations, and severity ratings.

6 failure modesStudy failures →

Difficulty Ladder

Start simple. Build to staff-level.

Level 1

Junior / Basics

Core concepts, single-service design, straightforward requirements

Level 2

Mid-Level Interview

Multi-service architecture, trade-off discussions, standard scaling

Level 3

Senior / Deep Dive

Complex distributed systems, failure modes, consistency guarantees

Level 4

Staff+ / FAANG Hard

Planet-scale design, novel architectures, cross-cutting concerns

Elevator Pitch3-minute interview summary

“Chat system for 500M DAU exchanging 20B messages/day at 231K msg/sec. Each message averages 150 bytes; daily storage is 3 TB in Cassandra partitioned by conversation_id. 500M WebSocket connections across 10K servers at 50K each. Per-conversation sequence numbers (not global counters) for ordering because a global counter bottlenecks at 231K writes/sec. At-least-once delivery with client-side idempotency keys for dedup. Presence on Redis with 30-second heartbeats. Kafka buffers for async persistence, keeping delivery under 200ms p99.”

Concepts Unlocked8 concepts in this topic

WebSocket Connection Management

STANDARD

We chose WebSocket (not HTTP polling, not SSE) for full-duplex messaging. 500M connections across 10K servers at 50K each. Connection registry in Redis maps user to gateway for targeted routing.

High Level Design

Message Ordering & Delivery

TRICKY

We chose per-conversation sequence numbers (not a global counter) because contention per counter is near zero at 2 increments/day/conversation. Idempotency keys for dedup give exactly-once semantics.

Core Feature Design

Presence Service

STANDARD

We chose lazy presence (not eager broadcast) because broadcasting to all contacts generates 417M events/sec. Redis TTL with 30s heartbeat. Only push to active viewers.

High Level Design

Chat Server Architecture

STANDARD

Separated stateful gateways (connection-bound, epoll, 50K conn) from stateless chat services (CPU-bound, routing logic). Different scaling axes, independent failure modes.

High Level Design

Message Storage (Cassandra)

EASY

We chose Cassandra (not MySQL) because 231K writes/sec are append-only and reads are sequential per conversation. LSM-tree at $O(1)$ amortized. Partition by conversation_id.

Database Schema

Group Chat Fan-Out

STANDARD

Store once, deliver many: write message body once to Cassandra, fan out only 16-byte delivery notifications to online members. Cap at 500 members to bound linear fan-out cost.

Core Feature Design

End-to-End Encryption

TRICKY

Signal Protocol (Double Ratchet + X3DH). Server is a blind relay: stores only ciphertext. Forward secrecy: compromising one key reveals nothing else. Trade-off: no server-side search or moderation.

Core Feature Design

Push Notifications (Offline)

EASY

Offline users get push via APNs/FCM within 5 seconds. On reconnection, catch-up sync: client sends last_seen_seq, server replays from Cassandra. No messages lost.

Replication and Fault Tolerance