Failure Modes

Click Aggregator Failure Modes

What breaks, how to detect it, and how to fix it. Every failure includes detection metrics, mitigations, and severity rating.

Failure Modes
MEDIUM

Stream job crash and replay

A Flink task manager dies mid-window. The job restarts from the last checkpoint: state and offsets rewind together: and replays the last ~10 seconds of events into counters that must not double-count.

Checkpoint failure/duration alerts; watermark lag spike during recovery; reconciliation catches anything the mechanisms missed.
Mitigation
  1. Checkpoint-atomic state+offsets make replay idempotent for counter state (the exactly-once chain, links 1-2)
  2. Idempotent sink (upsert by (ad, window)) makes re-emission harmless (link 3)
  3. Billing is derived from the raw log regardless: stream imperfection moves dashboards, not invoices
HIGH

Viral ad melts the hot key

A Super Bowl spot drives 100K clicks/sec into one ad_id. Single-stage keying routes it all to one worker; backpressure cascades and the whole pipeline lags during its most-watched minutes.

Per-key load divergence at stage two; backpressure and checkpoint-timeout alarms; watermark lag on the affected partition.
Mitigation
  1. Two-stage aggregation: (ad_id, salt 0-15) partials at stage one; stage two merges <=16 msg/sec per ad regardless of volume
  2. Adaptive salting where implemented: only detected-hot keys pay the second hop
  3. Budget lane consumes stage-two output, so spend control keeps its 1-2s freshness through the spike
CRITICAL

Budget overspend during the detection lag

A 1,000campaigngoesviralat500clicks/secx1,000 campaign goes viral at 500 clicks/sec x 2. Even a 2-second lane leaves ~$2,000 of over-delivery; a degraded lane (10s+) turns a budget into a rounding error on the losses.

Budget-lane staleness x spend velocity dashboard (dollars at risk, live); campaign spend crossing cap in stream aggregates; finance overage line item trending.
Mitigation
  1. Predictive pacing: probabilistic throttling begins when velocity says exhaustion falls within the lag horizon
  2. Fail closed: budget lane unreachable -> high-spend campaigns pause (under-delivery costs goodwill, over-delivery costs cash)
  3. Published overage tolerance; platform absorbs beyond-cap: an SLO priced and owned by finance
MEDIUM

Late-event burst misfiles windows

A mobile carrier hiccup or SDK bug delays millions of clicks by 10+ minutes. Event-time windows have fired; the burst arrives as a wall of stragglers.

Correction-rate spike; watermark lag vs wall clock; side-output volume alarm; per-source event-time skew profiling.
Mitigation
  1. 15-minute allowed lateness absorbs the common tail: corrections upsert amended aggregates, dashboards self-heal
  2. Beyond-lateness events side-output for investigation: and remain fully counted by the batch path (no deadline)
  3. Watermark tolerance is tunable per source: known-laggy inventory gets looser bounds
HIGH

Reconciliation divergence past threshold

The nightly batch disagrees with stream totals by >0.1% for some campaigns: a dedup gap, a sink duplication, a fraud-verdict timing skew, or a genuine stream bug.

The reconciliation job itself: per-campaign divergence, categorized (late events vs fraud timing vs unexplained); unexplained residue pages.
Mitigation
  1. Billing always ships batch numbers: divergence delays confidence, never corrupts invoices
  2. Explained categories (late events, fraud timing) auto-annotate; unexplained residue opens an incident with the campaign-day evidence set
  3. Replay tooling recomputes any window from the raw log to localize which path lied