STANDARDwalkthrough
WebSocket Connection Management
How do we hold 500 million persistent connections across 10,000 servers without losing track of who is connected where? We chose WebSocket (not HTTP long polling, not Server-Sent Events) because chat requires full-duplex communication: both client and server push messages at any time.
Server-Sent Events are unidirectional (server to client only), so sending a message still requires a separate HTTP POST. With WebSocket, the TCP connection stays open after the initial HTTP upgrade handshake, and both sides send frames with only 2-6 bytes of overhead per frame versus 200+ bytes of HTTP headers.
“Long polling wastes bandwidth with repeated HTTP handshakes at 500M users, that is 16.7M handshakes per second just for 30-second polls.”
Each server holds 50K connections in memory at roughly 10KB per connection (socket buffers + user metadata), totaling 500MB of RAM per server. A connection registry in Redis maps each user_id to the server_id holding their connection: key = conn:user_id, value = server_id, TTL = 90 seconds refreshed by heartbeat.
When a message arrives for User B, the chat service looks up B's server in Redis and forwards the message directly. Why not broadcast to all servers?
Because broadcasting a message to 10,000 servers when only one holds the recipient wastes 9,999 network calls. Trade-off: the Redis registry adds a lookup hop per message (~1ms), but eliminates broadcast overhead.
If Redis is unavailable, we fall back to consistent hashing of user_id to guess the server, accepting occasional misroutes that trigger client reconnection.
Related concepts