TRICKYwalkthrough
Message Ordering and Delivery
Two users send messages at the same millisecond in the same conversation. Which one appears first?
We chose per-conversation sequence numbers (not a global counter, not Lamport timestamps) because each conversation is an independent ordering domain. The chat server increments a conversation-level counter in Redis using INCR, which is atomic and O(1).
“A global sequence counter would give a definitive answer, but at 231K messages/sec, a single counter becomes a contention bottleneck: every message write must acquire a lock, serialize, and increment.”
At our scale, 20B messages across 10B conversations means each conversation averages 2 messages/day, so per-conversation contention is near zero. The sender's chat server assigns the sequence number, attaches it to the message, writes to Kafka, and sends the ack back to the sender.
The recipient receives the message with its sequence number and inserts it in order. What if the ack is lost?
The sender retries with the same idempotency key (a client-generated UUID). The server checks a dedup cache (Redis SET with 24-hour TTL): if the key exists, it returns the original ack without re-processing.
This gives us at-least-once delivery at the transport layer with exactly-once semantics at the application layer. Why not exactly-once at the transport layer?
Because network partitions make true exactly-once delivery impossible without two-phase commit, which adds 50-100ms of latency per message. Trade-off: clients must handle dedup logic, but this is a simple check against the last-seen sequence number.
Related concepts