TRICKYwalkthrough
At-Least-Once Delivery with Idempotency Keys
A fanout worker sends 300 pushes, then crashes before committing its Kafka offset. A replacement worker re-reads the same batch and sends them again: 300 users get duplicate notifications.
Can we just deliver exactly once? Three positions.
“For a like alert that is annoying; for a payment confirmation it looks like a double charge.”
Option one: at-most-once: commit the offset before sending. A crash after commit but before send silently drops notifications.
Dropping an OTP or a security alert is the worst possible failure, rejected. Option two: exactly-once: end-to-end exactly-once requires the provider call, the log write, and the offset commit to share one transaction.
APNs and FCM are external systems with no transactional API, so true exactly-once is impossible past our boundary. Option three: at-least-once plus idempotency, the standard answer.
Every notification carries a deterministic idempotency key: . Before sending, the gateway does a conditional insert into a dedup store (Redis SETNX with 24h TTL, 116K/sec average writes, roughly of keys at peak retention, trimmed by TTL).
If the key exists, the send is skipped. A retried batch after a crash hits 300 existing keys and sends nothing.
The client adds a second seatbelt: the OS collapses notifications carrying the same collapse identifier, so even a dedup-store miss usually does not produce two banners. The trade-off: the dedup check adds one Redis round trip (~1ms) to every send, and Redis loss downgrades us to raw at-least-once until it refills.
We accept both. What if the interviewer asks: why hash instead of a random UUID?
A random key changes on every retry, which defeats dedup. The key must be derived from the event, so every retry of the same logical notification produces the same key.
Related concepts