Google Maps Failure Modes
What breaks, how to detect it, and how to fix it. Every failure includes detection metrics, mitigations, and severity rating.
Tile CDN cache miss storm on map update rollout
Navigation becomes impossible without map rendering
Every 2 weeks, tiles are regenerated from updated satellite imagery and road data. If the CDN purges all tiles simultaneously across all POPs, 521K tile requests per second hit the origin, which is provisioned for only 10.4K/sec. The origin tile servers cannot render 500x their capacity, and users see blank tiles or loading spinners instead of maps.
- Staggered rollout: invalidate tiles POP by POP over 24 hours, not globally
- Request coalescing: concurrent requests for the same uncached tile at one POP trigger a single origin fetch
- Keep old tiles as fallback: serve stale tiles with a warning badge rather than blank tiles during cache warm-up
- Pre-warm critical tiles: push zoom levels 0 to 12 (the most commonly viewed) to all POPs before invalidating higher zoom levels
CH graph stale after road closure
Safety risk and significant time waste for affected users
A highway closes for emergency construction. The CH graph still has this road with its original travel time weight. Routing sends drivers onto the closed road. Drivers encounter the closure and must U-turn, adding 15 to 30 minutes to their trip. The system has no mechanism to detect or respond to the closure for up to 24 hours (until the next graph re-contraction).
- Set closed road's edge weight to infinity in the Redis overlay within 60 seconds of detection
- Automatic closure detection: if a segment with >100 historical probes/hour drops to zero probes for 10 minutes, flag for review
- Waze integration: crowd-sourced road closure reports trigger automatic edge weight override
- Nightly full re-contraction to incorporate permanent road changes into the CH structure
GPS probe pipeline lag causes stale traffic
Routes may not avoid new congestion. No safety risk but degrades user trust
Kafka consumer lag exceeds 2 minutes due to a slow processing node or partition rebalance. At 4M probes/sec, a 2-minute lag means 480 million unprocessed probes. Traffic segment speeds in Redis become stale, and the routing service uses outdated edge weights. ETAs become inaccurate, and routes may not avoid newly congested areas.
- Auto-scale Kafka consumer group: add consumers when lag exceeds 60 seconds
- Temporarily increase partition count for the GPS probe topic during lag events
- Fall back to historical speed averages for segments whose Redis data is older than 3 minutes
- Page the traffic pipeline on-call at 3 minutes lag (ETA accuracy is degrading visibly)
ETA overshoot during unexpected congestion
No routing error (the route was correct when computed). User sees updated ETA within 90 seconds
A major accident causes traffic to stop instantly on a highway. The 60-second aggregation window means the system takes up to 60 seconds to reflect the sudden speed drop. During navigation, the client re-queries ETA every 30 seconds. In the worst case, a driver sees the correct ETA only after 90 seconds (60-second pipeline delay + 30-second client re-query interval). During those 90 seconds, the displayed ETA could be off by 15 to 30 minutes for a severely congested route.
- Reduce aggregation window to 30 seconds for highway segments (higher probe density makes shorter windows viable)
- Reduce client ETA re-query interval from 30 to 15 seconds during active navigation
- Accept this as a fundamental trade-off: 60-second aggregation smooths noise but delays reflecting sudden events
- Display a 'traffic conditions changing' indicator when high-speed variance is detected on the route
Geocoding ambiguity returns wrong location
No data loss or system outage. Mitigated by disambiguation UI and proximity ranking
A user searches for 'Springfield' and the geocoder returns Springfield, Missouri instead of Springfield, Illinois (there are 35 cities named Springfield in the US). The user starts navigation to the wrong city 300 miles away. They realize the error after driving for 10 minutes, wasting time and fuel.
- Disambiguate using user's current location: rank Springfield closest to the user highest
- Use search history: if the user recently searched for businesses in Springfield IL, prefer that Springfield
- Population-based ranking: Springfield IL (population 115K) ranks above Springfield OR (population 62K) as a tiebreaker
- Show disambiguation UI: if confidence is below 80%, display 'Did you mean...' with the top 3 candidates instead of auto-selecting