STANDARDwalkthrough
Notification Service
The constraint: when a user saves a file, their other 3 devices must learn about the change within 1 second. Polling every second wastes bandwidth (300M empty requests/sec) and still adds up to 1 second of latency.
We chose long polling (not WebSockets, not SSE) as the sweet spot: each device opens an HTTP request that the server holds open for up to 60 seconds. When a change occurs, the server responds immediately with a lightweight change event (file_id, new_version, timestamp).
“WebSockets deliver instant push but require maintaining millions of stateful connections.”
Trade-off: long polling has slightly higher latency than WebSockets (up to 1 second vs. near-instant) but eliminates connection affinity requirements and works through corporate firewalls that block WebSocket upgrades. At 100M DAU with 3 devices each, that is 300 million long-poll connections.
The notification service fans out through Kafka partitioned by user_id, so each partition handles changes for a subset of users. The hard case is a viral shared folder: one edit in a folder shared with 10,000 users triggers 10,000 notifications.
We rate-limit per-folder notifications to 1 per second and batch changes to prevent the notification storm that would otherwise overwhelm the fan-out layer. What if the interviewer asks: 'How do we handle 300M connections?' We distribute long-poll connections across hundreds of stateless notification servers behind a load balancer.
Each server holds ~1M connections. Since long-poll is stateless (no session affinity), any server can handle any reconnection.