Autocomplete Failure Modes
What breaks, how to detect it, and how to fix it. Every failure includes detection metrics, mitigations, and severity rating.
Hot-prefix melt during a breaking event
A global event funnels millions of users into the same few prefixes within seconds. Edge TTLs expire mid-spike; thousands of simultaneous misses converge on the shard owning the hot prefix: the classic cache stampede aimed at one node.
- Edge request coalescing: one origin fetch per (prefix, locale) key regardless of concurrent misses
- Shorter TTLs are on trending prefixes BY DESIGN, so the stampede window is small and coalesced
- In-process hot map on every node can serve any prefix during shard duress (all nodes hold the head)
Index build fails or produces a bad artifact
The aggregation job crashes, or worse, succeeds with corrupted output: a truncated shard, a scoring regression that ranks garbage first, a blocklist that silently failed to apply.
- Serve the previous snapshot indefinitely: stale suggestions beat broken ones; alert on index age past 2 cycles
- Promotion is canaried (5%) with automatic rollback on CTR/error guardrail breach
- Rollback is a pointer swap to the retained N-1 snapshot: seconds, not a rebuild
Trending overlay manipulation campaign
A coordinated botnet sustains velocity on a chosen query: spam, a scam domain, or targeted harassment: attempting to buy a global suggestion slot for the price of some traffic.
- Per-source contribution caps make concentrated campaigns expensive by construction
- Sanity delay + velocity-vs-baseline + the 2-3 slot merge cap bound what a successful campaign can even win
- Kill switch removes a surfaced entry in seconds; overlay decay erases the residue within minutes
Offensive or defamatory suggestion surfaces
A filtering gap: novel slur, embedding trick, a person's name completing to an accusation: passes build-time gates, ranks on genuine frequency, and screenshots begin circulating.
- Serve-time kill switch: deny entry propagated fleet-wide in under 60 seconds, no rebuild required
- Blocklist updated and the NEXT build re-validated against a regression test containing the incident
- Auditability (explain endpoint) answers "why did we say that" in minutes for comms and legal
Shard loss (node or AZ)
A serving node dies or an AZ partition takes a replica set slice offline; the prefixes hashed to that shard lose capacity or, at 3-replica loss, availability.
- 3x replication absorbs single losses with zero user impact
- Replacement nodes hydrate from the object-storage snapshot (durable copy) in minutes: no peer streaming
- Full-shard loss degrades gracefully: edge caches keep serving the head; origin returns empty-but-fast for the tail (never a spinner)