Click Aggregator Anti-Patterns

Common design mistakes candidates make. Learn what goes wrong and how to avoid each trap in your interview.

UPDATE counters SET clicks = clicks + 1

Very CommonFORMULA

Counting clicks by incrementing a database row per (ad, day) directly in the click-serving path.

Why: It is the shortest path to a working demo, and row contention is invisible until an ad gets popular.

WRONG: Every click takes a row lock on its ad's counter; a viral ad serializes 100K clicks/sec onto one row: the database becomes the ad's bottleneck, and a crashed transaction mid-flight leaves counts silently wrong with no way back.

RIGHT: Capture appends to an immutable log (durable-first, microseconds, no contention); aggregation happens downstream in a stream processor built for keyed counting, with the log available to recompute anything.

Discarding Raw Events After Aggregation

Very CommonFORMULA

Keeping only the aggregated counts and letting raw click events age out of Kafka into nothing.

Why: Raw events look like exhaust once the counters exist: 1.3 TB/day of storage for data 'we already counted'.

WRONG: A fraud model update, an aggregation bug, or a billing dispute arrives three weeks later: and there is nothing to recompute from. The wrong numbers are permanent, and the dispute settles on apology terms.

RIGHT: S3 archive of the full raw log, partitioned by hour, kept for years: ~$30/day of storage that converts every downstream mistake from permanent to replayable, and every dispute from a shrug to an evidence list.

Exactly-Once by Checkbox

CommonFORMULA

Enabling the framework's exactly-once mode and assuming end-to-end correctness without examining the sink.

Why: The config flag is called exactly_once; the gap between state semantics and end-to-end semantics is documentation fine print.

WRONG: Flink checkpoints are pristine, but aggregates emit to the OLAP store via plain inserts. A crash between emit and checkpoint re-emits the window: dashboards and (if trusted) invoices double-count, while every internal metric says exactly-once.

RIGHT: Audit the full chain: identity (click_id), atomic state+offsets (checkpoints), AND a cooperative sink: idempotent upserts keyed by (ad, window), or transactional two-phase writes. Then let billing re-derive from the log anyway.

Windowing by Processing Time

CommonFORMULA

Assigning clicks to the minute they arrived at the aggregator rather than the minute they occurred.

Why: Processing time needs no watermarks, no lateness handling, no corrections: it is dramatically simpler, and it looks identical in a demo where events arrive instantly.

WRONG: A mobile network hiccup shifts thousands of 11:59 clicks into the 12:03 bucket. Budgets close on wrong minutes, dashboards show phantom spikes, and billing assigns spend to intervals where it did not happen: honestly counted, systematically misfiled.

RIGHT: Event-time windows with watermarks and allowed lateness; corrections upsert amended aggregates. Processing time is only acceptable for operational metrics where 'when we saw it' is genuinely the question.

Single-Stage Keyed Aggregation

CommonFORMULA

Partitioning the stream by ad_id alone, so each ad's counter lives on exactly one worker.

Why: It is the textbook keyed-stream topology, correct by construction, and load-skew only appears when an ad actually goes viral.

WRONG: The Super Bowl ad routes 100K clicks/sec to one worker while 199 idle. Backpressure cascades upstream, checkpoints time out, and the whole pipeline: every ad, not just the hot one: lags during the exact minutes the dashboard matters most.

RIGHT: Two-stage aggregation: stage one keyed by (ad_id, salt) emits partial counts; stage two merges at most salt-count partials per second per ad. The viral ad costs stage two the same 16 messages/sec as a sleepy one.

Enforcing Budgets from the Batch Path

CommonFORMULA

Checking campaign spend against budgets using the nightly (or hourly) batch totals.

Why: The batch numbers are the correct ones, and reusing them avoids building a separate fast lane.

WRONG: A $1,000 campaign goes viral at 9 AM; the hourly job notices at 10. Fifty thousand dollars of clicks ran on a thousand-dollar budget, none billable past the cap: the platform ate 49x the campaign's value because correctness arrived an hour late.

RIGHT: A dedicated budget lane: spend counters within 1-2 seconds of stage-two aggregates, fail-closed semantics, and predictive pacing that throttles before exhaustion. Slightly-wrong-but-immediate beats exactly-right-but-late when every second is priced.

Dashboards Querying Raw Events

CommonFORMULA

Pointing the advertiser dashboard's queries directly at the raw click log or event table.

Why: The raw data answers every possible question, so why maintain aggregates that answer only some?

WRONG: Every dashboard paint scans billions of rows; a conference demo with ten open dashboards takes down the cluster; and p95 dashboard latency is measured in coffee breaks. The truth layer was never built to be a serving layer.

RIGHT: Stream-maintained aggregates in a columnar OLAP store with time rollups: dashboards read hundreds of pre-aggregated rows. Raw events serve investigations through a separate, deliberately slow path.

Deduplicating by (user, ad) Heuristics

OccasionalFORMULA

Treating repeated clicks from the same user on the same ad as duplicates and dropping them.

Why: Without event identity, heuristics are all you have, and 'same user, same ad, same minute' feels like a duplicate.

WRONG: A user legitimately clicks a product ad twice while comparison shopping: dropped. A bot rotates devices and passes untouched. The heuristic under-counts honest engagement and over-trusts dishonest traffic: wrong in both directions at once.

RIGHT: Mint a click_id at capture: identity makes true duplicates (same physical click) exactly detectable. Repeat engagement is legitimate signal; abusive repetition is a FRAUD question for the judgment tiers, not a dedup rule.