Ad Click Aggregator
COMMONClick aggregation is asked at Google, Meta, and Amazon because it is the streaming interview with dollar amounts attached: 8.6 billion clicks a day feeding dashboards that must be fresh in seconds, budget enforcement where every second of lag is priced in cash, and invoices that must be exact under audit. You will count twice (a Flink stream path for decisions, a nightly batch over the immutable raw log for money), survive the viral ad with two-stage salted aggregation, walk the three-link exactly-once chain, and separate duplicates (identity) from fraud (judgment).
- Count twice: stream for dashboards and budgets, batch over the raw log for billing, reconciliation between
- Survive the Super Bowl ad: two-stage salted aggregation bounds the merge at 16 msg/sec regardless of volume
- Exactly-once as a three-link chain: event identity, atomic state+offsets, and the sink everyone forgets
Visual Solutions
Step-by-step animated walkthroughs with capacity estimation, API design, database schema, and failure modes built in.
Cheat Sheet
Key concepts, trade-offs, and quick-reference notes for Click Aggregator. Everything you need at a glance.
Anti-Patterns
Common design mistakes candidates make. Wrong approaches vs correct approaches for each trap.
Failure Modes
What breaks in production, how to detect it, and how to fix it. Detection metrics, mitigations, and severity ratings.
Start simple. Build to staff-level.
“I would design a click aggregator for 8.6 billion clicks a day, 500K per second at peak, serving three masters on different points of the speed-certainty curve. The key move: count twice. A Flink stream path: event-time windows with 15-minute lateness, two-stage salted aggregation (a viral ad costs the merge layer 16 messages a second, not 100K), checkpoint-atomic state, idempotent upsert sink: feeds dashboards in seconds and a budget lane fresh within 1-2 seconds, because staleness times spend velocity is dollars the platform eats. A nightly batch over the immutable raw log (1.3 TB/day, about $30 of S3) recomputes truth with exhaustive dedup and mature fraud verdicts, and reconciliation pages past 0.1% divergence.”
Count Twice
TRICKYStream for decisions, batch over the raw log for money, reconciliation between
Core Aggregation DesignExactly-Once Counting
TRICKYThree links: event identity, atomic state+offsets, and the sink everyone forgets
Core Aggregation DesignWindows and Watermarks
STANDARDEvent time, not processing time: watermark -> fire -> lateness -> correction -> side-output
High Level System DesignTwo-Stage Aggregation
STANDARDSalt the hot ad across 16 partials; stage two merges 16 msg/sec no matter the volume
High Level System DesignThe Raw Log
STANDARDViews are disposable, truth is not: replay, reprocess, audit, bootstrap
Database SchemaThe Budget Race
TRICKYStaleness x spend velocity = dollars at risk: fail closed, pace predictively
Replication and Fault ToleranceDedup vs Fraud
STANDARDDuplicates need identity (click_id); fraud needs judgment (tiered verdicts, marks not deletes)
Monitoring and Complete SystemOLAP Serving
EASYColumnar + rollups: 30-day dashboards read 720 rows, not 43,200: or billions
What is an Ad Click Aggregator