Whiteboard ScaleGoogle MapsFailure Modes
Failure Modes

Google Maps Failure Modes

What breaks, how to detect it, and how to fix it. Every failure includes detection metrics, mitigations, and severity rating.

Failure Modes
HIGH:

Tile CDN cache miss storm on map update rollout

Navigation becomes impossible without map rendering

Every 2 weeks, tiles are regenerated from updated satellite imagery and road data. If the CDN purges all tiles simultaneously across all POPs, 521K tile requests per second hit the origin, which is provisioned for only 10.4K/sec. The origin tile servers cannot render 500x their capacity, and users see blank tiles or loading spinners instead of maps.

CDN cache hit ratio drops below 80% across multiple POPs. Origin tile server CPU exceeds 90%. Tile load latency p99 exceeds 500ms. Client telemetry shows map rendering failures.
Mitigation
  1. Staggered rollout: invalidate tiles POP by POP over 24 hours, not globally
  2. Request coalescing: concurrent requests for the same uncached tile at one POP trigger a single origin fetch
  3. Keep old tiles as fallback: serve stale tiles with a warning badge rather than blank tiles during cache warm-up
  4. Pre-warm critical tiles: push zoom levels 0 to 12 (the most commonly viewed) to all POPs before invalidating higher zoom levels
HIGH:

CH graph stale after road closure

Safety risk and significant time waste for affected users

A highway closes for emergency construction. The CH graph still has this road with its original travel time weight. Routing sends drivers onto the closed road. Drivers encounter the closure and must U-turn, adding 15 to 30 minutes to their trip. The system has no mechanism to detect or respond to the closure for up to 24 hours (until the next graph re-contraction).

Traffic service detects zero GPS probes on a normally busy segment for 10+ minutes. Reports from Waze users flagging road closure. Sudden increase in re-route requests from drivers who encounter the closure.
Mitigation
  1. Set closed road's edge weight to infinity in the Redis overlay within 60 seconds of detection
  2. Automatic closure detection: if a segment with >100 historical probes/hour drops to zero probes for 10 minutes, flag for review
  3. Waze integration: crowd-sourced road closure reports trigger automatic edge weight override
  4. Nightly full re-contraction to incorporate permanent road changes into the CH structure
MEDIUM:

GPS probe pipeline lag causes stale traffic

Routes may not avoid new congestion. No safety risk but degrades user trust

Kafka consumer lag exceeds 2 minutes due to a slow processing node or partition rebalance. At 4M probes/sec, a 2-minute lag means 480 million unprocessed probes. Traffic segment speeds in Redis become stale, and the routing service uses outdated edge weights. ETAs become inaccurate, and routes may not avoid newly congested areas.

Kafka consumer lag monitoring alerts at 90 seconds. Redis traffic segment timestamps show freshness exceeding 2 minutes. ETA accuracy metric drops below 95% for recently started navigations.
Mitigation
  1. Auto-scale Kafka consumer group: add consumers when lag exceeds 60 seconds
  2. Temporarily increase partition count for the GPS probe topic during lag events
  3. Fall back to historical speed averages for segments whose Redis data is older than 3 minutes
  4. Page the traffic pipeline on-call at 3 minutes lag (ETA accuracy is degrading visibly)
MEDIUM:

ETA overshoot during unexpected congestion

No routing error (the route was correct when computed). User sees updated ETA within 90 seconds

A major accident causes traffic to stop instantly on a highway. The 60-second aggregation window means the system takes up to 60 seconds to reflect the sudden speed drop. During navigation, the client re-queries ETA every 30 seconds. In the worst case, a driver sees the correct ETA only after 90 seconds (60-second pipeline delay + 30-second client re-query interval). During those 90 seconds, the displayed ETA could be off by 15 to 30 minutes for a severely congested route.

Sudden speed drop on a segment from 65 mph to 5 mph in a single aggregation window. High variance between consecutive ETA estimates for active navigations on the affected route.
Mitigation
  1. Reduce aggregation window to 30 seconds for highway segments (higher probe density makes shorter windows viable)
  2. Reduce client ETA re-query interval from 30 to 15 seconds during active navigation
  3. Accept this as a fundamental trade-off: 60-second aggregation smooths noise but delays reflecting sudden events
  4. Display a 'traffic conditions changing' indicator when high-speed variance is detected on the route
LOW:

Geocoding ambiguity returns wrong location

No data loss or system outage. Mitigated by disambiguation UI and proximity ranking

A user searches for 'Springfield' and the geocoder returns Springfield, Missouri instead of Springfield, Illinois (there are 35 cities named Springfield in the US). The user starts navigation to the wrong city 300 miles away. They realize the error after driving for 10 minutes, wasting time and fuel.

High rate of navigation cancellations within 5 minutes of start for searches matching ambiguous place names. User feedback reports for incorrect geocoding results.
Mitigation
  1. Disambiguate using user's current location: rank Springfield closest to the user highest
  2. Use search history: if the user recently searched for businesses in Springfield IL, prefer that Springfield
  3. Population-based ranking: Springfield IL (population 115K) ranks above Springfield OR (population 62K) as a tiebreaker
  4. Show disambiguation UI: if confidence is below 80%, display 'Did you mean...' with the top 3 candidates instead of auto-selecting