Autocomplete Failure Modes

What breaks, how to detect it, and how to fix it. Every failure includes detection metrics, mitigations, and severity rating.

Failure Modes

HIGH

Hot-prefix melt during a breaking event

A global event funnels millions of users into the same few prefixes within seconds. Edge TTLs expire mid-spike; thousands of simultaneous misses converge on the shard owning the hot prefix: the classic cache stampede aimed at one node.

Per-shard QPS divergence from fleet median; edge hit-ratio drop on specific keys; origin p99 breach localized to one shard.

Mitigation

Edge request coalescing: one origin fetch per (prefix, locale) key regardless of concurrent misses
Shorter TTLs are on trending prefixes BY DESIGN, so the stampede window is small and coalesced
In-process hot map on every node can serve any prefix during shard duress (all nodes hold the head)

MEDIUM

Index build fails or produces a bad artifact

The aggregation job crashes, or worse, succeeds with corrupted output: a truncated shard, a scoring regression that ranks garbage first, a blocklist that silently failed to apply.

Build validation gates: checksum, size-delta bounds (a shard 40% smaller than yesterday is wrong), blocklist regression tests, sampled score sanity checks. Post-promotion: canary CTR guardrails.

Mitigation

Serve the previous snapshot indefinitely: stale suggestions beat broken ones; alert on index age past 2 cycles
Promotion is canaried (5%) with automatic rollback on CTR/error guardrail breach
Rollback is a pointer swap to the retained N-1 snapshot: seconds, not a rebuild

HIGH

Trending overlay manipulation campaign

A coordinated botnet sustains velocity on a chosen query: spam, a scam domain, or targeted harassment: attempting to buy a global suggestion slot for the price of some traffic.

Source-concentration analysis on spiking queries (organic spikes are source-diverse); velocity pattern anomalies (bot ramps are too smooth); kill-switch hit rate and abuse reports.

Mitigation

Per-source contribution caps make concentrated campaigns expensive by construction
Sanity delay + velocity-vs-baseline + the 2-3 slot merge cap bound what a successful campaign can even win
Kill switch removes a surfaced entry in seconds; overlay decay erases the residue within minutes

CRITICAL

Offensive or defamatory suggestion surfaces

A filtering gap: novel slur, embedding trick, a person's name completing to an accusation: passes build-time gates, ranks on genuine frequency, and screenshots begin circulating.

Abuse-report spike on the suggest surface; per-prefix CTR anomaly (users recoil: impressions up, clicks down); social monitoring; the explain endpoint confirms source and build.

Mitigation

Serve-time kill switch: deny entry propagated fleet-wide in under 60 seconds, no rebuild required
Blocklist updated and the NEXT build re-validated against a regression test containing the incident
Auditability (explain endpoint) answers "why did we say that" in minutes for comms and legal

MEDIUM

Shard loss (node or AZ)

A serving node dies or an AZ partition takes a replica set slice offline; the prefixes hashed to that shard lose capacity or, at 3-replica loss, availability.

Health checks and per-shard replica count; rising tail latency for the affected key range.

Mitigation

3x replication absorbs single losses with zero user impact
Replacement nodes hydrate from the object-storage snapshot (durable copy) in minutes: no peer streaming
Full-shard loss degrades gracefully: edge caches keep serving the head; origin returns empty-but-fast for the tail (never a spinner)