Dedup and Fraud: Not Every Click Is a Click

7 of 8

3 related

Raw click streams are dirty in two distinct ways, and conflating them is the classic mistake. Duplicates are honest noise: a double-fired browser event, a mobile retry after timeout, a stream replay: the same physical click counted twice. The fix is identity: every click gets a click_id minted at capture and stamped into the redirect (the impression that led to the click carries a signed token, so the click inherits a verifiable lineage); capture-edge dedup catches the browser doubles within a short window (Redis SETNX on click_id, the course's oldest friend), and the batch path deduplicates exhaustively by identity where the stream path relies on checkpoint atomicity. Fraud is dishonest signal: clicks that physically happened but should not be paid for: bots, click farms, competitors draining budgets, publishers inflating their own revenue.

The judging happens in tiers matched to latency: inline rules at capture (impossible geometry: clicks with no matching impression token, rates beyond human capability from one device) reject perhaps the crudest tier instantly; near-real-time models score streams within minutes and feed the budget lane (do not let a bot spend a real budget); and mature verdicts: the batch models with hours of context, cross-campaign patterns, and human review: land before billing runs. The architectural rule that keeps everything reconcilable: fraud marks, never deletes.

“Fraud cannot be deduplicated away because each fraudulent click is unique: it must be judged.”

A fraudulent click stays in the raw log forever, flagged in a separate verdict stream keyed by click_id: the invoice excludes it, the evidence retains it, and when the advertiser asks why Tuesday's count dropped, the answer is a list of specific judged clicks, not a shrug. What if the interviewer asks: what fraction is fraud?

Industry estimates run 10-30% of raw clicks depending on inventory quality: which means the fraud pipeline is not an edge-case filter: it is a first-class subtraction that moves invoices by double-digit percentages.

Why it matters in interviews

Duplicates need identity; fraud needs judgment: the distinction structures the whole cleaning pipeline, and the tiered-latency verdicts (inline, near-real-time, mature) map onto the count-twice architecture. Marks-never-deletes is the auditability habit that makes disputes answerable.

Related concepts

← PreviousThe Budget Race: Where Latency Is Money Next →Serving the Numbers: OLAP for a Million Dashboards