Whiteboard ScaleAutocompleteFailure Modes
Failure Modes

Autocomplete Failure Modes

What breaks, how to detect it, and how to fix it. Every failure includes detection metrics, mitigations, and severity rating.

Failure Modes
HIGH

Hot-prefix melt during a breaking event

A global event funnels millions of users into the same few prefixes within seconds. Edge TTLs expire mid-spike; thousands of simultaneous misses converge on the shard owning the hot prefix: the classic cache stampede aimed at one node.

Per-shard QPS divergence from fleet median; edge hit-ratio drop on specific keys; origin p99 breach localized to one shard.
Mitigation
  1. Edge request coalescing: one origin fetch per (prefix, locale) key regardless of concurrent misses
  2. Shorter TTLs are on trending prefixes BY DESIGN, so the stampede window is small and coalesced
  3. In-process hot map on every node can serve any prefix during shard duress (all nodes hold the head)
MEDIUM

Index build fails or produces a bad artifact

The aggregation job crashes, or worse, succeeds with corrupted output: a truncated shard, a scoring regression that ranks garbage first, a blocklist that silently failed to apply.

Build validation gates: checksum, size-delta bounds (a shard 40% smaller than yesterday is wrong), blocklist regression tests, sampled score sanity checks. Post-promotion: canary CTR guardrails.
Mitigation
  1. Serve the previous snapshot indefinitely: stale suggestions beat broken ones; alert on index age past 2 cycles
  2. Promotion is canaried (5%) with automatic rollback on CTR/error guardrail breach
  3. Rollback is a pointer swap to the retained N-1 snapshot: seconds, not a rebuild
HIGH

Trending overlay manipulation campaign

A coordinated botnet sustains velocity on a chosen query: spam, a scam domain, or targeted harassment: attempting to buy a global suggestion slot for the price of some traffic.

Source-concentration analysis on spiking queries (organic spikes are source-diverse); velocity pattern anomalies (bot ramps are too smooth); kill-switch hit rate and abuse reports.
Mitigation
  1. Per-source contribution caps make concentrated campaigns expensive by construction
  2. Sanity delay + velocity-vs-baseline + the 2-3 slot merge cap bound what a successful campaign can even win
  3. Kill switch removes a surfaced entry in seconds; overlay decay erases the residue within minutes
CRITICAL

Offensive or defamatory suggestion surfaces

A filtering gap: novel slur, embedding trick, a person's name completing to an accusation: passes build-time gates, ranks on genuine frequency, and screenshots begin circulating.

Abuse-report spike on the suggest surface; per-prefix CTR anomaly (users recoil: impressions up, clicks down); social monitoring; the explain endpoint confirms source and build.
Mitigation
  1. Serve-time kill switch: deny entry propagated fleet-wide in under 60 seconds, no rebuild required
  2. Blocklist updated and the NEXT build re-validated against a regression test containing the incident
  3. Auditability (explain endpoint) answers "why did we say that" in minutes for comms and legal
MEDIUM

Shard loss (node or AZ)

A serving node dies or an AZ partition takes a replica set slice offline; the prefixes hashed to that shard lose capacity or, at 3-replica loss, availability.

Health checks and per-shard replica count; rising tail latency for the affected key range.
Mitigation
  1. 3x replication absorbs single losses with zero user impact
  2. Replacement nodes hydrate from the object-storage snapshot (durable copy) in minutes: no peer streaming
  3. Full-shard loss degrades gracefully: edge caches keep serving the head; origin returns empty-but-fast for the tail (never a spinner)