EASYwalkthrough
Serving the Numbers: OLAP for a Million Dashboards
The last mile is a query problem: an advertiser opens a dashboard and asks "clicks for campaign X, by hour, last 30 days, split by country": and ten thousand advertisers do this concurrently while the stream writes two million fresh aggregates a minute. Two wrong answers frame the right one. Querying the raw log recomputes billions of events per dashboard paint: the log is truth, not a serving layer. Pre-computing every report (the autocomplete move) fails here because the query space is not enumerable: dimensions multiply (ad x time x geo x device x placement...) and advertisers slice ad hoc.
Freshness comes from the upsert contract established by the watermark design: corrections overwrite (ad, window) cells, so a dashboard refreshed after a late-click correction shows the amended number with no special handling. The performance frame worth quoting: minute-grain aggregates for 2M active windows/minute is ~3 GB/day of aggregate rows: five hundred times smaller than the raw log: OLAP serves interactive queries precisely because the stream already paid the aggregation cost.
“The right shape is a columnar OLAP store (Druid, ClickHouse, Pinot): built for exactly this middle ground: ingesting the stream's (ad, minute, dimensions) aggregates continuously, storing them columnar (a query touching 3 of 40 columns reads 3), partitioned by time and sharded by ad, with rollups materializing the common coarser views (hourly, daily) so a 30-day dashboard reads ~720 hourly rows instead of 43,200 minute rows.”
The boundary that keeps it fast: dashboards query aggregates and rollups only; the rare investigation that needs raw events (a fraud dispute, a debugging session) goes to a separate slow path over the log: never through the dashboard store. What if the interviewer asks: why not a key-value store keyed by (ad, window)?
Point lookups, yes: but the moment queries slice by dimensions you did not key on ("all campaigns, by country"), KV degenerates to scans: columnar storage IS the index for ad-hoc analytical slicing.
Related concepts