STANDARDwalkthrough

Notification Service

8 of 8
3 related
The constraint: when a user saves a file, their other 3 devices must learn about the change within 1 second. Polling every second wastes bandwidth (300M empty requests/sec) and still adds up to 1 second of latency.
We chose long polling (not WebSockets, not SSE) as the sweet spot: each device opens an HTTP request that the server holds open for up to 60 seconds. When a change occurs, the server responds immediately with a lightweight change event (file_id, new_version, timestamp).
WebSockets deliver instant push but require maintaining millions of stateful connections.
Trade-off: long polling has slightly higher latency than WebSockets (up to 1 second vs. near-instant) but eliminates connection affinity requirements and works through corporate firewalls that block WebSocket upgrades. At 100M DAU with 3 devices each, that is 300 million long-poll connections.
The notification service fans out through Kafka partitioned by user_id, so each partition handles changes for a subset of users. The hard case is a viral shared folder: one edit in a folder shared with 10,000 users triggers 10,000 notifications.
We rate-limit per-folder notifications to 1 per second and batch changes to prevent the notification storm that would otherwise overwhelm the fan-out layer. What if the interviewer asks: 'How do we handle 300M connections?' We distribute long-poll connections across hundreds of stateless notification servers behind a load balancer.
Each server holds ~1M connections. Since long-poll is stateless (no session affinity), any server can handle any reconnection.
Why it matters in interviews
Interviewers ask why we chose long polling over WebSockets. The answer is operational simplicity: stateless, no connection affinity, works through corporate firewalls. Describing the Kafka fan-out and per-folder rate limiting shows we think about the shared-folder edge case.
Related concepts