Click Aggregator Failure Modes
What breaks, how to detect it, and how to fix it. Every failure includes detection metrics, mitigations, and severity rating.
Stream job crash and replay
A Flink task manager dies mid-window. The job restarts from the last checkpoint: state and offsets rewind together: and replays the last ~10 seconds of events into counters that must not double-count.
- Checkpoint-atomic state+offsets make replay idempotent for counter state (the exactly-once chain, links 1-2)
- Idempotent sink (upsert by (ad, window)) makes re-emission harmless (link 3)
- Billing is derived from the raw log regardless: stream imperfection moves dashboards, not invoices
Viral ad melts the hot key
A Super Bowl spot drives 100K clicks/sec into one ad_id. Single-stage keying routes it all to one worker; backpressure cascades and the whole pipeline lags during its most-watched minutes.
- Two-stage aggregation: (ad_id, salt 0-15) partials at stage one; stage two merges <=16 msg/sec per ad regardless of volume
- Adaptive salting where implemented: only detected-hot keys pay the second hop
- Budget lane consumes stage-two output, so spend control keeps its 1-2s freshness through the spike
Budget overspend during the detection lag
A 2. Even a 2-second lane leaves ~$2,000 of over-delivery; a degraded lane (10s+) turns a budget into a rounding error on the losses.
- Predictive pacing: probabilistic throttling begins when velocity says exhaustion falls within the lag horizon
- Fail closed: budget lane unreachable -> high-spend campaigns pause (under-delivery costs goodwill, over-delivery costs cash)
- Published overage tolerance; platform absorbs beyond-cap: an SLO priced and owned by finance
Late-event burst misfiles windows
A mobile carrier hiccup or SDK bug delays millions of clicks by 10+ minutes. Event-time windows have fired; the burst arrives as a wall of stragglers.
- 15-minute allowed lateness absorbs the common tail: corrections upsert amended aggregates, dashboards self-heal
- Beyond-lateness events side-output for investigation: and remain fully counted by the batch path (no deadline)
- Watermark tolerance is tunable per source: known-laggy inventory gets looser bounds
Reconciliation divergence past threshold
The nightly batch disagrees with stream totals by >0.1% for some campaigns: a dedup gap, a sink duplication, a fraud-verdict timing skew, or a genuine stream bug.
- Billing always ships batch numbers: divergence delays confidence, never corrupts invoices
- Explained categories (late events, fraud timing) auto-annotate; unexplained residue opens an incident with the campaign-day evidence set
- Replay tooling recomputes any window from the raw log to localize which path lied