Click Aggregator Cheat Sheet

Key concepts, trade-offs, and quick-reference notes for your interview prep.

Count Twice: The Lambda Answer, Defended

Stream path: Flink over Kafka, per-(ad, minute) counters, seconds fresh: powers dashboards and budgets; allowed to be approximately right. Batch path: nightly recompute over the immutable raw log with exhaustive dedup, mature fraud verdicts, all late events: the ONLY number billing sees. Reconciliation diffs them; divergence past 0.1% pages. Present kappa honestly (one exactly-once stream + replay) then defend lambda on audit grounds: "we re-ran arithmetic over the immutable log" convinces auditors; "we replayed the stream job" does not.

💡 Fast path for decisions, slow path for money. Two codebases is a real cost: name it before the interviewer does.

Exactly-Once: The Three-Link Chain

"Kafka has exactly-once" is a checkbox, not a design. The chain: 1) event identity: a click_id minted at capture (Snowflake ID) so duplication is detectable at all; 2) atomic state+offsets: Flink checkpoint barriers snapshot counter state aligned with input offsets: crash rewinds both together; 3) the sink: the forgotten link: emit non-transactionally and outputs duplicate even with perfect state. Our sink is idempotent by design: upsert keyed by (ad, window). Break any link and the pipeline silently downgrades to at-least-once.

💡 The sink is where exactly-once dies quietly. Upsert-by-window makes re-emission naturally safe.

Event Time, Watermarks, Late Clicks

The subway click: happens 12:00:59, arrives 12:03:31. Windows key by event time (billing owes the 12:00 window). The watermark asserts "all events up to T are (probably) here": when it passes 12:01, the window fires: then lingers for 15 min allowed lateness, during which stragglers emit corrections (hence the upsert sink). Later still -> side output: lost to the dashboard, never to billing (the batch has no deadline). Tumbling windows, never sliding: billing needs each click to have exactly one home.

💡 Watermark -> fire -> lateness -> correction -> side-output: know the full lifecycle, not just the word.

Two-Stage Aggregation: The Viral Ad

Keyed-by-ad_id routes a Super Bowl ad's 100K clicks/sec to ONE worker: the hot key melts the pipeline. Fix: salt: stage one keys by (ad_id, salt 0-15), each worker keeps a partial count, emitting every ~second; stage two keys by ad_id and merges at most 16 partials/sec regardless of click volume. Legal because counts are associative: partial sums merge losslessly. Distinct-counts need HyperLogLog sketches to stay mergeable; averages ship (sum, count) pairs: never average averages.

💡 This is MapReduce's combiner in streaming clothes. Write hot spots salt; read hot spots cache.

The Raw Log: Views Are Disposable, Truth Is Not

Every counter, rollup, and invoice is a derived view; the truth is the append-only raw log: Kafka 7-day hot, S3 archive partitioned by hour (~1.3 TB/day: cheap next to its value). Buys: replay (bug miscounts Tuesday -> recompute Tuesday), reprocessing (new fraud rules re-run over history), audit (disputes answered with specific clicks, not counter values), bootstrap (new consumers replay history). Disciplines: durable-first capture, never edit: fraud marks in a verdict stream, never deletes: additive schema only.

💡 Event sourcing, applied narrowly: aggregates are snapshots, the log is the events, replay is the recovery story.

The Budget Race: Latency Priced in Dollars

1,000 budget,

2 clicks, 500 clicks/sec viral: exhausted in one second: every click during the detection lag is money the platform eats. The budget lane: spend counters from stage-two aggregates within 1-2s, consulted by ad serving. Inverted properties: slightly-wrong is fine (err toward stopping early), slow is not, and it fails closed for money: unreachable budget service pauses high-spend campaigns. Residual overage (~1K clicks at true virality) is handled by predictive pacing (throttle before exhaustion) and published overage tolerance: an SLO with a literal invoice.

💡 Under-delivery costs goodwill; over-delivery costs cash. That asymmetry decides every tie.

Duplicates Need Identity; Fraud Needs Judgment

Duplicates: the same physical click twice (browser double-fire, retry, replay) -> fixed by click_id identity: capture-edge SETNX for the doubles, exhaustive batch dedup for the rest. Fraud: real clicks that must not be paid (bots, farms, budget-draining competitors) -> each is unique, so it must be judged, in latency tiers: inline rules at capture (no impression token, inhuman rates), near-real-time scoring feeding the budget lane, mature verdicts before billing. Fraud runs 10-30% of raw clicks: a first-class subtraction, not an edge case.

💡 Marks, never deletes: the fraudulent click stays in the log, flagged in the verdict stream. Evidence and judgment stay separate.

OLAP Serving: The Right Read Path

Dashboards slice ad hoc (campaign x hour x geo x device): not enumerable, so no precompute-everything; not raw-log queries either. Columnar OLAP (Druid/ClickHouse/Pinot): ingests stream aggregates continuously, reads only queried columns, partitions by time, and rolls up minute -> hour -> day so a 30-day dashboard reads 720 rows, not 43,200. Corrections upsert (ad, window) cells: amended numbers appear with no special handling. Aggregates run ~3 GB/day: 500x smaller than raw: the stream already paid the aggregation cost.

💡 KV stores die on dimensions you did not key. Columnar IS the index for ad-hoc slicing.

Capacity: 8.6B Clicks/Day

Peak 500K clicks/sec, average ~100K/sec = 8.6B/day. Events ~150B -> Kafka ingest 75 MB/sec peak, raw log 1.3 TB/day to S3. Active windows: ~2M (ad, minute) pairs/min -> aggregate stream ~3 GB/day into OLAP. Flink: 200 workers stage one (salted), 50 stage two; checkpoints every 10s (~seconds of replay on crash). Budget lane: 10M campaign counters x 50B = 500 MB in Redis. Batch: one nightly Spark pass over 1.3 TB: an hour on a modest cluster.

💡 The raw log dwarfs everything downstream: aggregation is a 500x compression that makes serving cheap.

The Metrics That Matter

#10

Reconciliation divergence (stream vs batch, per campaign): past 0.1% pages: one of the paths is lying. Watermark lag (event-time delay behind wall clock): the freshness truth behind every dashboard promise. Budget lane staleness: seconds of lag x spend velocity = dollars at risk, the only latency metric with a currency. Correction rate (late-click amendments): rising = upstream delays or watermark mistuning. Fraud subtraction rate by inventory source: a step change means a bot wave or a broken model. Checkpoint duration/failure: the exactly-once heartbeat. Hot-key detector: stage-two per-key load divergence.

💡 Budget staleness is the metric to show an interviewer: latency literally denominated in dollars.