Ride Sharing Failure Modes
What breaks, how to detect it, and how to fix it. Every failure includes detection metrics, mitigations, and severity rating.
Driver location goes stale (ghost drivers)
Ghost drivers directly impact rider experience by adding 15 seconds of wasted wait time per ghost encounter. In dense areas with high driver turnover, this is the most common failure mode.
Constraint: drivers go offline without sending an explicit signal (phone dies, app crashes, dead zone). What breaks: the geo index still shows the driver as AVAILABLE. We dispatch a ride to them, the 15-second acceptance timer expires, and we waste 15 seconds before re-dispatching. If three ghost drivers are in the area, the rider waits 45 seconds. User impact: direct degradation of rider wait time, the most visible quality metric.
- Background job every 5 seconds scans for drivers with last_location_at older than 15 seconds. We mark them INACTIVE and remove from the geo index. Maximum ghost exposure: 15 seconds.
- On the next location update from a previously stale driver, we automatically re-add them to the geo index and set status back to AVAILABLE.
- Our dispatch algorithm skips drivers whose last update is older than 10 seconds even if they are still in the index. Belt-and-suspenders with the cleanup job.
- Client-side heartbeat: the driver app sends a lightweight ping every 10 seconds even when stationary. Absence of ping for 30 seconds triggers server-side cleanup.
Dispatch race condition double-books a driver
Double-booking erodes rider trust ('driver found' then 'searching again') and wastes driver time. The optimistic locking pattern must be correct on the first implementation.
Constraint: at peak, 35 matches/sec in dense areas where multiple riders see the same top-ranked driver. What breaks: without concurrency control, two dispatch requests claim the same driver. The second rider sees 'Driver found' for 2 seconds before we detect the conflict and revert to 'Searching.' User impact: trust erosion from the false-positive 'driver found' message.
- We use optimistic locking via Redis CAS: on accept, WATCH the driver status key, check it is AVAILABLE, then MULTI/EXEC to set MATCHED. If the key changed between WATCH and EXEC, the transaction fails and we re-route within 500ms.
- Pre-reservation: when our dispatch algorithm selects a driver, we immediately set a 5-second soft lock (PENDING status). This prevents other dispatchers from selecting the same driver. If the driver does not accept within 5 seconds, we release the lock.
- We never show 'Driver found' to the rider until the CAS succeeds. The rider sees 'Matching...' until the optimistic lock confirms. This prevents the false-positive UX.
- Idempotent accept: if a driver accidentally taps accept twice, the second call returns the same success response without creating a duplicate match.
Redis geo index partition failure
A geo index outage blocks all ride matching in the affected region. The 5-second failover window is the maximum acceptable downtime. Anything longer requires fallback to a secondary data source.
Constraint: a Redis Cluster master node hosting H3 cells for Manhattan goes down. What breaks: the sorted sets for those cells (containing ~50K drivers) are unavailable. All ride requests in Manhattan fail with 'no drivers found' even though 50K drivers are online. User impact: complete ride matching outage in the affected region for up to 5 seconds during failover.
- Redis Cluster automatic failover: replica promotes to master within 5 seconds. During the gap, our dispatch service retries failed queries with exponential backoff (100ms, 200ms, 400ms).
- Multi-region replication: we maintain a warm standby Redis Cluster in a second availability zone. If the primary cluster is unreachable for more than 10 seconds, we switch reads to the standby. Stale data (up to 3 seconds old) is acceptable for nearby-driver queries.
- Graceful degradation: if Redis is unavailable, we fall back to a direct query against the Cassandra location table with a geo-bounded scan. Latency degrades from 5ms to 200ms, but rides can still be matched.
- Circuit breaker: after 10 consecutive Redis failures within 1 second, we open the circuit and route all queries to the fallback path. We close the circuit after 3 successful Redis pings.
Kafka consumer lag causes stale location data
Stale location data causes bad dispatch decisions. Drivers decline rides they are too far from, increasing rider wait time. The 10-second staleness threshold is the maximum before dispatch quality degrades noticeably.
Constraint: a burst of 500K location updates per second (3x normal peak) overwhelms the Kafka consumer group. What breaks: consumers cannot keep up, lag grows to 30 seconds. The Redis geo index shows driver positions that are 30 seconds old. A driver who moved 2 km in 30 seconds appears at their old location. We dispatch to a driver who is no longer nearby. User impact: the driver declines because the pickup is now 2 km away, increasing rider wait time.
- Auto-scale Kafka consumers: when lag exceeds 5 seconds, we spin up additional consumer instances. Kafka rebalances partitions across the expanded consumer group within 30 seconds.
- Drop stale events: if a location update's timestamp is older than 10 seconds when our consumer processes it, we skip the Redis update. The driver's next update (3 seconds later) will be current.
- Increase Kafka partitions: more partitions allow more parallel consumers. For 500K/sec peak, we use 24 partitions (each handling ~21K events/sec) instead of 6.
- Back-pressure signaling: if consumer lag exceeds 15 seconds, the API gateway starts dropping every other location update from each driver (6-second interval instead of 3). This halves the write load while maintaining acceptable location freshness.
ETA estimation failure shows wildly wrong times
Wrong ETAs hurt rider trust but do not block ride matching. The fare cap policy limits financial impact. Our fallback chain ensures rides are always matchable, with decreasing accuracy at each tier.
Constraint: the ML model serving layer goes down after a bad deployment. What breaks: we fall back to Dijkstra + traffic (11% error). For a specific route through a construction zone not in the map data, the ETA shows 5 minutes but the actual drive takes 20 minutes. User impact: the rider expects a 5-minute pickup and gets a 20-minute wait. The fare estimate was also based on the wrong ETA.
- Three-tier fallback: (1) full pipeline (Dijkstra + traffic + ML), (2) Dijkstra + traffic only (11% error), (3) straight-line distance / average city speed of 25 km/h (40% overestimate). Each tier activates when the layer above is unavailable.
- We show ETA as a range instead of a point estimate: '3-5 min' instead of '4 min.' The range width increases when the ML layer is down, signaling lower confidence to the rider.
- Post-trip fare adjustment: if actual trip duration exceeds estimated duration by more than 50%, we automatically apply a fare cap at the estimated price. The rider does not pay for our ETA error.
- Canary deployments for the ML model: we roll out to 5% of traffic first. If ETA error rate increases, we roll back before affecting all riders.
Surge pricing oscillation destabilizes a region
Oscillation frustrates riders and drivers but does not prevent rides from completing. Our EMA smoothing eliminates the oscillation pattern in practice.
Constraint: a concert ends at Madison Square Garden and 20,000 people request rides simultaneously. What breaks: surge jumps to 5.0x. Riders see the high price and stop requesting. In the next 60-second computation window, demand drops to zero. Surge falls to 1.0x. Riders flood back. Surge spikes to 5.0x again. This oscillation repeats every 2-3 minutes. User impact: riders and drivers are frustrated by unpredictable pricing; drivers chase surge zones that disappear before they arrive.
- Exponential Moving Average (EMA) smoothing: . A sudden 5x demand spike is smoothed to a gradual ramp: 1.0 -> 2.2 -> 3.1 -> 3.7 -> 4.1. Takes 4 windows (4 minutes) to reach peak, preventing oscillation.
- Minimum surge duration: once surge activates, we keep it for at least 5 minutes even if demand drops. This gives drivers time to arrive in the area before the incentive disappears.
- Gradual ramp-down: surge decreases by at most 0.5x per computation window. A 5.0x surge takes at least 8 minutes to return to 1.0x, smoothing the transition.
- Demand prediction: for known events (concerts, sports games), we pre-position surge multipliers 15 minutes before the event ends based on historical patterns. This prevents the initial spike.