Chat System Failure Modes

What breaks, how to detect it, and how to fix it. Every failure includes detection metrics, mitigations, and severity rating.

Failure Modes

CRITICAL

Chat server crash loses 50K active connections

50K users lose connectivity. Without catch-up sync, messages during the outage window are lost.

Constraint: each gateway holds 50K WebSocket connections in memory. A crash (OOM, bad deploy) drops all 50K instantly. Redis connection registry still points to the dead server for up to 90s (TTL). What breaks: routing layer sends messages to a dead server. 50K users are unreachable until reconnection. Detection: gateway heartbeat stops, health checker marks it dead within 10s. Recovery: clients reconnect with exponential backoff + jitter in 2-5s. Gap messages are stored in Cassandra. On reconnection, the client sends last_seen_sequence_number and the server replays missed messages.

Gateway heartbeat stops. Service registry health check fails within 10 seconds. Connection registry entries for the dead server are not refreshed. Undeliverable message count spikes.

Mitigation

Clients reconnect with exponential backoff + jitter (base 1s, max 30s, random 0-1s jitter) to prevent thundering herd.
Cassandra stores all messages durably. Catch-up sync on reconnection replays missed messages using last_seen_sequence_number.
Connection registry TTL (90s) auto-expires stale entries, preventing persistent misrouting.
Graceful drain for planned restarts: stop accepting new connections, wait 30 seconds for existing connections to naturally close or client-initiated migration.

HIGH

Message delivered but ack lost (duplicate delivery)

Duplicate messages erode user trust. Without dedup, 20M duplicates/day are visible to users.

Constraint: network between server and client is unreliable. Server delivers a message to User B, writes to Cassandra, sends ack to User A. The ack packet is lost (network blip, mobile handoff). A's client retries after 3s. Without dedup, B sees the message twice. What breaks: at 0.1% retry rate and 231K messages/sec, that is ~20M duplicates per day. Detection: dedup cache hit rate in Redis exceeds 0.1% (normal is near 0%). Recovery: idempotency_key check in Redis dedup cache prevents re-delivery. Fallback: client-side dedup using conversation sequence number.

Dedup cache hit rate exceeding 0.1%. Client-reported duplicate message rate. User complaints about repeated messages.

Mitigation

Client generates a UUID idempotency_key per message. Server checks Redis SET (dedup:key, TTL 24h) before processing.
If dedup cache is unavailable, client-side dedup: drop messages with sequence number <= last_seen.
Periodic reconciliation: sample conversations and verify no duplicate sequence numbers.

HIGH

Presence heartbeat storm during mass reconnect

Millions of users appear offline to their contacts despite being connected. Creates confusion and reduces engagement.

Constraint: presence Redis is sized for 16.7M heartbeats/sec steady state. A major ISP partition disconnects 10M users at once. When connectivity restores, all 10M reconnect within 5s: $10M / 5 = 2M$ heartbeats/sec burst on a few shards. What breaks: Redis CPU spikes above 90%, heartbeat latency exceeds 50ms, new heartbeats time out. TTLs expire for still-online users, making them appear offline. Detection: Redis CPU >90% on presence nodes. Heartbeat timeout rate exceeds 1%. Mass false-offline events despite active connections. Recovery: jittered backoff spreads reconnection over 30s.

Redis CPU spiking above 90% on presence nodes. Heartbeat processing latency exceeding 50ms. Mass false-offline events despite active WebSocket connections.

Mitigation

Clients use jittered exponential backoff on reconnection: spread heartbeats over a 30-second window instead of all at T+0.
Gateway rate-limits heartbeat forwarding to presence service: max 100K heartbeats/sec per gateway, buffering excess.
Presence TTL set to 60 seconds (2x heartbeat interval) provides buffer. Missing one heartbeat does not trigger offline status.
Circuit breaker: if heartbeat error rate exceeds 5%, gateways stop forwarding heartbeats for 10 seconds, letting the Redis cluster recover.

MEDIUM

Cassandra write timeout at peak message volume

Messages are still delivered in real time via WebSocket. Only persistence is delayed. Risk of data loss only if Kafka retention is also exceeded.

Constraint: Cassandra is sized for 694K writes/sec (3x peak). A viral event (New Year's Eve, sports final) spikes volume 5x to 1.16M writes/sec. Memtables fill faster than flushes. Compaction falls behind. Write latency climbs from 5ms to 500ms+. Kafka consumers cannot keep up; lag grows. What breaks: messages are delivered in real time via WebSocket but not persisted. If a consumer crashes, messages beyond Kafka retention are lost. Detection: write latency p99 exceeding 100ms. Compaction pending exceeding 100 SSTables. Kafka consumer lag exceeding 1M.

Cassandra write latency p99 exceeding 100ms. Compaction pending SSTables exceeding 100. Kafka consumer lag growing steadily. Disk utilization above 80% on Cassandra nodes.

Mitigation

Auto-scale Cassandra cluster: add nodes when write latency p99 exceeds 50ms for 5 minutes. Cassandra rebalances automatically.
Kafka provides a durable buffer with days of retention. Even if Cassandra is slow, no messages are lost as long as Kafka retains them.
Reduce write consistency from LOCAL_QUORUM (2 replicas) to ONE (1 replica) temporarily. Faster writes, accept higher risk of data loss from node failure.
Pre-scale Cassandra capacity before predictable traffic events (holidays, major broadcasts).

MEDIUM

Split-brain: user connected to two servers

Messages may be delivered to a stale connection. No data loss because catch-up sync recovers them on the active connection.

Constraint: mobile devices switch networks frequently (Wi-Fi to cellular). User A's phone switches to cellular, but the old Wi-Fi WebSocket has not timed out (TCP keepalive 30s). The phone opens a new WebSocket. Now A has connections on two gateways. The registry points to the newer one, but the old gateway holds a stale connection. What breaks: messages are delivered to the old connection the phone is no longer reading. Detection: registry overwrite for the same user within 60s. Recovery: on new connection, the chat service sends a CLOSE frame to the old gateway, terminating the stale socket.

Connection registry overwrite for the same user within 60 seconds. Old gateway detecting a CLOSE frame sent by the routing layer. Message delivery ack timeout on the old connection.

Mitigation

Session token with monotonically increasing version. New connection always has a higher version. Old connection is invalidated on version mismatch.
On new connection, chat service sends CLOSE to old gateway. Old gateway terminates the stale WebSocket and removes its local state.
Client-side: on reconnect, the client discards the old WebSocket and only reads from the new one.
Catch-up sync: even if messages were delivered to the stale connection, the client replays them from Cassandra on the new connection using last_seen_sequence_number.

MEDIUM

Group message fan-out delay for 500-member groups

Group messages are delayed by seconds, but 1:1 messages are unaffected if queues are separated.

Constraint: each group message fans out to 499 members. A 500-member group at 100 messages/min generates $499 \times 100 = 49{,}900$ deliveries/min. Multiple active large groups push total fan-out into hundreds of thousands/sec. What breaks: fan-out queries conversation_members, checks presence per member, routes to gateways. If undersized, group messages lag 5-10s while 1:1 stays fast. Detection: group fan-out latency exceeding 2s. Queue depth growing. Recovery: dedicated fan-out workers for groups with priority queues so 1:1 is never blocked.

Group message delivery latency exceeding 2 seconds. Fan-out queue depth for group conversations growing. 1:1 message latency degrading (shared queue).

Mitigation

Separate fan-out queues for 1:1 and group messages. Group fan-out never blocks 1:1 delivery.
Cache conversation_members for active groups (invalidate on member join/leave). Avoid re-querying MySQL on every message.
For very active groups, batch fan-out: accumulate 5 seconds of messages and deliver them as a batch to each member, reducing per-message routing overhead.
Cap group size at 500 members. For larger communities, switch to a pub/sub channel model (like Discord) instead of direct fan-out.