Whiteboard ScaleClick AggregatorDesign Walkthrough

Click Aggregator System Design Walkthrough

Complete design walkthrough with animated diagrams, capacity math, API design, schema, and failure modes.

Solution PathTarget: 28 min
We designed an ad click aggregator for 8.6 billion clicks a day (500K/sec peak) serving three masters with incompatible needs: dashboards in seconds, budget enforcement where lag is priced in cash, and invoices exact under audit. The key insight: count twice: a Flink stream path (event-time windows, watermarks with 15-minute lateness, two-stage salted aggregation for viral ads, checkpoint-atomic exactly-once with an idempotent upsert sink) for decisions, and a nightly batch over the immutable raw log (exhaustive dedup, mature fraud verdicts) for money: with reconciliation paging past 0.1% divergence. A 1-2 second budget lane fails closed and paces predictively. Duplicates need identity; fraud needs judgment: marks, never deletes.
1/10
1.

What is Click Aggregator?

A user clicks an ad. That single event has to satisfy three masters with incompatible demands.
The advertiser's dashboard wants it counted within seconds: campaign managers watching a launch tune bids in real time, and a laggy dashboard is a product complaint. The budget system wants it counted even faster, because the click costs money against a finite cap: and every second of counting lag on a viral campaign is cash the platform eats when spend blows past the budget unnoticed.
The invoice wants it counted exactly: deduplicated, fraud-judged, late-arrivals included: because the count becomes a bill, bills get audited, and "our counter said so" is not an answer a dispute accepts. One event, three consumers, three different points on the speed-versus-certainty curve: that tension is the entire topic.
The scale makes it a systems problem: 8.6 billion clicks a day, 500K per second at peak, with the workload's special cruelty being the viral ad: a Super Bowl spot concentrating 100K clicks/sec onto a single aggregation key, exactly when everyone is watching the dashboard. And the data makes it an integrity problem: raw click streams run 10-30% fraud (bots, click farms, competitors draining budgets), mobile clicks arrive minutes late (the subway problem), and browsers double-fire events: so the pipeline must distinguish duplicates (the same click twice: solved by identity) from fraud (real clicks that must not be paid: solved by judgment), and must file every click into the minute it happened, not the minute it arrived.
The architecture that answers all of it: count twice: a stream path for speed, a batch path over an immutable raw log for truth, reconciliation between them: with a dedicated budget lane where latency is literally priced. This is the streaming-systems interview in its most concrete costume: every abstraction (watermarks, exactly-once, hot keys) has a dollar amount attached.
One click, three masters: dashboards (seconds), budgets (faster: lag is priced in cash), invoices (exact under audit). Scale: 8.6B/day, 500K/sec, viral ads concentrating 100K/sec on one key. Integrity: 10-30% fraud, minutes-late mobile clicks, browser doubles. Answer: count twice + a budget lane.
An ad click aggregator turns a firehose of raw click events: 500K per second at peak: into three products with incompatible requirements: dashboards that must be fresh within seconds, budget enforcement where every second of lag is literally priced in dollars, and invoices that must be exactly right under audit. The resolving idea is to count twice: a stream path (Kafka -> Flink -> OLAP) that is fast and approximately right, a batch path that nightly recomputes truth from the immutable raw log, and reconciliation that pages when the two disagree. Around that skeleton live the streaming interview's greatest hits: event-time windows and watermarks, two-stage aggregation for viral ads, the three-link exactly-once chain, and fraud as judgment rather than dedup.
  • Scale: 8.6B clicks/day, 500K/sec peak; raw log 1.3 TB/day; aggregates 500x smaller
  • The key move: count twice: stream for dashboards and budgets, batch over the raw log for billing, reconciliation between
  • The money metric: budget lane staleness: seconds of lag x spend velocity = dollars the platform eats
  • Hygiene: duplicates need identity (click_id), fraud needs judgment (tiered verdicts, marks never deletes)