Notifications Anti-Patterns

Common design mistakes candidates make. Learn what goes wrong and how to avoid each trap in your interview.

One Shared Queue for All Priorities

Very CommonFORMULA

All notification types flow through a single queue, with (at best) a priority field in the payload that consumers are supposed to respect.

Why: One queue is simpler to build and the starvation failure only appears the first time a large campaign coincides with transactional traffic.

WRONG: Single Kafka topic for everything. A 100M-recipient campaign enqueues ahead of an OTP; partitions are consumed in order, so the OTP waits 20 minutes. The user's login timed out at 60 seconds.

RIGHT: Physically separate topics and worker pools per tier: P0 transactional (p99 <5s), P1 engagement (<30s), P2 bulk (minutes, deferrable). Starvation becomes structurally impossible instead of operationally managed.

Synchronous Send in the Request Path

Very CommonFORMULA

The application calls APNs/FCM inline while handling a user request, coupling checkout latency to a third-party push provider.

Why: It is the shortest code path: order placed, call FCM, return. Works in the demo, fails the first time the provider has a slow day.

WRONG: Checkout handler awaits the push provider. FCM p99 degrades to 4 seconds during an incident and checkout p99 follows it; a provider outage takes payments down with it.

RIGHT: The request path only enqueues (one fast local write to Kafka) and returns. Delivery is asynchronous with its own retries and failover. User-facing latency never depends on a provider.

Retries Without Idempotency Keys

Very CommonFORMULA

At-least-once delivery is configured (correctly), but nothing dedupes retried sends, so crash-replays produce duplicate notifications.

Why: Duplicates are invisible in low-volume testing; the first consumer crash during a burst sends hundreds of doubles, including payment confirmations.

WRONG: Worker sends 300 pushes, crashes before committing its offset; the replacement re-reads and re-sends all 300. Some carry random UUIDs as "idempotency" keys, which change on retry and dedupe nothing.

RIGHT: Deterministic key derived from event identity: hash(user_id, event_id, channel). Redis SETNX with 24h TTL before each send; retried batches hit existing keys and skip. Collapse IDs on the device as the second seatbelt.

No Per-User Send Budget

Very CommonFORMULA

Every product team sends what it wants; no aggregate cap exists per user, so attention is spent like a free resource.

Why: Each team's sends look individually justified, and the cost (opt-outs) lands on a shared metric no single team owns.

WRONG: Growth, commerce, and social each send 2 pushes on the same Tuesday. The user gets 6, disables push permanently, and is now unreachable even for OTPs delivered via push.

RIGHT: Per-user, per-tier daily token buckets enforced at send time (marketing ~2/day), with channel degradation (push -> in-app -> drop) instead of silent loss. Governance: per-team quotas inside the shared budget.

Never Pruning Dead Device Tokens

CommonFORMULA

The token table only ever grows. Uninstalls and OS token rotation leave millions of dead rows that the system keeps sending to.

Why: Sends to dead tokens do not error loudly in aggregate dashboards; the damage (wasted throughput, sender reputation) accrues silently.

WRONG: Ignore 410 Unregistered responses. Within a year, roughly half of sends target dead devices; Apple and Google notice the unregistered rate and throttle the sender, degrading delivery for every real user.

RIGHT: Treat 410/UNREGISTERED as a permanent signal: publish an invalidation event, prune within minutes. Re-register the current token on every app launch; expire tokens unseen for 270 days.

One Push per Event, No Coalescing

CommonFORMULA

Every like, follow, and badge update becomes its own push notification, with no aggregation window and no provider collapse keys.

Why: Event-to-push is the natural first implementation, and the spam only appears when a post goes viral or a device reconnects after a dead zone.

WRONG: 50 likes in 2 minutes = 50 buzzes. A phone offline for a minute gets the whole backlog on reconnect. Users learn the app is noisy and disable push.

RIGHT: Coalescing window (30-120s) per (user, event_type, object) folds events server-side: "Priya and 49 others". apns-collapse-id / collapse_key makes queued provider messages replace each other. Cuts engagement volume ~60%.

Quiet Hours in Server Time

CommonFORMULA

The "do not disturb at night" window is computed in UTC or the company's home timezone instead of each recipient's local time.

Why: Server time is what the scheduler naturally has; per-user timezones require storing and refreshing device timezone data.

WRONG: Campaign scheduled for 9 AM Pacific fires globally at once: Mumbai receives it at 9:30 PM, Tokyo at 1 AM, and the 3 AM push in Sydney converts directly into opt-outs.

RIGHT: Store IANA timezone per user (refreshed on app launch). Defer in-window sends to next local 08:00 with jittered release; pre-shard campaigns into per-timezone waves. P0 bypasses quiet hours by definition.

Retrying Permanent Errors Like Transient Ones

CommonFORMULA

The gateway retries every provider failure identically, including 4xx responses that will never succeed, amplifying load during incidents.

Why: A single retry policy is easier to write than an error taxonomy, and the difference only matters under provider stress.

WRONG: FCM has a partial outage; the gateway retries 400s and 410s alongside 503s with no backoff. Traffic triples against a struggling provider, threads burn on futile calls, and the retry storm outlives the outage.

RIGHT: Classify: 5xx/timeout -> retry topic with exponential backoff and jitter (1m, 5m, 25m) behind a circuit breaker; 410/400 token errors -> pruning pipeline, never retried; P0 -> channel failover (push to SMS) after two attempts.