Whiteboard ScaleRide SharingFailure Modes
Failure Modes

Ride Sharing Failure Modes

What breaks, how to detect it, and how to fix it. Every failure includes detection metrics, mitigations, and severity rating.

Failure Modes
HIGH

Driver location goes stale (ghost drivers)

Ghost drivers directly impact rider experience by adding 15 seconds of wasted wait time per ghost encounter. In dense areas with high driver turnover, this is the most common failure mode.

Constraint: drivers go offline without sending an explicit signal (phone dies, app crashes, dead zone). What breaks: the geo index still shows the driver as AVAILABLE. We dispatch a ride to them, the 15-second acceptance timer expires, and we waste 15 seconds before re-dispatching. If three ghost drivers are in the area, the rider waits 45 seconds. User impact: direct degradation of rider wait time, the most visible quality metric.

We track last_location_at per driver. Alert when any AVAILABLE driver has no update for more than 15 seconds. We monitor the ratio of dispatch timeouts to total dispatches. A spike above 5% indicates ghost driver contamination in the geo index.
Mitigation
  1. Background job every 5 seconds scans for drivers with last_location_at older than 15 seconds. We mark them INACTIVE and remove from the geo index. Maximum ghost exposure: 15 seconds.
  2. On the next location update from a previously stale driver, we automatically re-add them to the geo index and set status back to AVAILABLE.
  3. Our dispatch algorithm skips drivers whose last update is older than 10 seconds even if they are still in the index. Belt-and-suspenders with the cleanup job.
  4. Client-side heartbeat: the driver app sends a lightweight ping every 10 seconds even when stationary. Absence of ping for 30 seconds triggers server-side cleanup.
CRITICAL

Dispatch race condition double-books a driver

Double-booking erodes rider trust ('driver found' then 'searching again') and wastes driver time. The optimistic locking pattern must be correct on the first implementation.

Constraint: at peak, 35 matches/sec in dense areas where multiple riders see the same top-ranked driver. What breaks: without concurrency control, two dispatch requests claim the same driver. The second rider sees 'Driver found' for 2 seconds before we detect the conflict and revert to 'Searching.' User impact: trust erosion from the false-positive 'driver found' message.

We monitor CAS failure rate on driver status transitions. A healthy system sees under 1% CAS failures. Above 5% indicates either too many concurrent dispatches in the same area or a bug in the locking mechanism.
Mitigation
  1. We use optimistic locking via Redis CAS: on accept, WATCH the driver status key, check it is AVAILABLE, then MULTI/EXEC to set MATCHED. If the key changed between WATCH and EXEC, the transaction fails and we re-route within 500ms.
  2. Pre-reservation: when our dispatch algorithm selects a driver, we immediately set a 5-second soft lock (PENDING status). This prevents other dispatchers from selecting the same driver. If the driver does not accept within 5 seconds, we release the lock.
  3. We never show 'Driver found' to the rider until the CAS succeeds. The rider sees 'Matching...' until the optimistic lock confirms. This prevents the false-positive UX.
  4. Idempotent accept: if a driver accidentally taps accept twice, the second call returns the same success response without creating a duplicate match.
CRITICAL

Redis geo index partition failure

A geo index outage blocks all ride matching in the affected region. The 5-second failover window is the maximum acceptable downtime. Anything longer requires fallback to a secondary data source.

Constraint: a Redis Cluster master node hosting H3 cells for Manhattan goes down. What breaks: the sorted sets for those cells (containing ~50K drivers) are unavailable. All ride requests in Manhattan fail with 'no drivers found' even though 50K drivers are online. User impact: complete ride matching outage in the affected region for up to 5 seconds during failover.

Redis Cluster node health checks every 1 second. CLUSTER INFO shows a node as PFAIL (possibly failed) within 2 seconds, then FAIL within 5 seconds after quorum agreement. Application-level monitoring: spike in 'no drivers found' responses for a specific geographic region.
Mitigation
  1. Redis Cluster automatic failover: replica promotes to master within 5 seconds. During the gap, our dispatch service retries failed queries with exponential backoff (100ms, 200ms, 400ms).
  2. Multi-region replication: we maintain a warm standby Redis Cluster in a second availability zone. If the primary cluster is unreachable for more than 10 seconds, we switch reads to the standby. Stale data (up to 3 seconds old) is acceptable for nearby-driver queries.
  3. Graceful degradation: if Redis is unavailable, we fall back to a direct query against the Cassandra location table with a geo-bounded scan. Latency degrades from 5ms to 200ms, but rides can still be matched.
  4. Circuit breaker: after 10 consecutive Redis failures within 1 second, we open the circuit and route all queries to the fallback path. We close the circuit after 3 successful Redis pings.
HIGH

Kafka consumer lag causes stale location data

Stale location data causes bad dispatch decisions. Drivers decline rides they are too far from, increasing rider wait time. The 10-second staleness threshold is the maximum before dispatch quality degrades noticeably.

Constraint: a burst of 500K location updates per second (3x normal peak) overwhelms the Kafka consumer group. What breaks: consumers cannot keep up, lag grows to 30 seconds. The Redis geo index shows driver positions that are 30 seconds old. A driver who moved 2 km in 30 seconds appears at their old location. We dispatch to a driver who is no longer nearby. User impact: the driver declines because the pickup is now 2 km away, increasing rider wait time.

We monitor Kafka consumer lag per partition. Alert when lag exceeds 5 seconds. We correlate with dispatch decline rate: a spike in declines after lag onset confirms stale data is causing bad matches.
Mitigation
  1. Auto-scale Kafka consumers: when lag exceeds 5 seconds, we spin up additional consumer instances. Kafka rebalances partitions across the expanded consumer group within 30 seconds.
  2. Drop stale events: if a location update's timestamp is older than 10 seconds when our consumer processes it, we skip the Redis update. The driver's next update (3 seconds later) will be current.
  3. Increase Kafka partitions: more partitions allow more parallel consumers. For 500K/sec peak, we use 24 partitions (each handling ~21K events/sec) instead of 6.
  4. Back-pressure signaling: if consumer lag exceeds 15 seconds, the API gateway starts dropping every other location update from each driver (6-second interval instead of 3). This halves the write load while maintaining acceptable location freshness.
MEDIUM

ETA estimation failure shows wildly wrong times

Wrong ETAs hurt rider trust but do not block ride matching. The fare cap policy limits financial impact. Our fallback chain ensures rides are always matchable, with decreasing accuracy at each tier.

Constraint: the ML model serving layer goes down after a bad deployment. What breaks: we fall back to Dijkstra + traffic (11% error). For a specific route through a construction zone not in the map data, the ETA shows 5 minutes but the actual drive takes 20 minutes. User impact: the rider expects a 5-minute pickup and gets a 20-minute wait. The fare estimate was also based on the wrong ETA.

We compare predicted ETA against actual pickup time for completed rides. Alert when the average absolute error exceeds 20% over a 5-minute window. We track model serving latency and error rate independently.
Mitigation
  1. Three-tier fallback: (1) full pipeline (Dijkstra + traffic + ML), (2) Dijkstra + traffic only (11% error), (3) straight-line distance / average city speed of 25 km/h (40% overestimate). Each tier activates when the layer above is unavailable.
  2. We show ETA as a range instead of a point estimate: '3-5 min' instead of '4 min.' The range width increases when the ML layer is down, signaling lower confidence to the rider.
  3. Post-trip fare adjustment: if actual trip duration exceeds estimated duration by more than 50%, we automatically apply a fare cap at the estimated price. The rider does not pay for our ETA error.
  4. Canary deployments for the ML model: we roll out to 5% of traffic first. If ETA error rate increases, we roll back before affecting all riders.
MEDIUM

Surge pricing oscillation destabilizes a region

Oscillation frustrates riders and drivers but does not prevent rides from completing. Our EMA smoothing eliminates the oscillation pattern in practice.

Constraint: a concert ends at Madison Square Garden and 20,000 people request rides simultaneously. What breaks: surge jumps to 5.0x. Riders see the high price and stop requesting. In the next 60-second computation window, demand drops to zero. Surge falls to 1.0x. Riders flood back. Surge spikes to 5.0x again. This oscillation repeats every 2-3 minutes. User impact: riders and drivers are frustrated by unpredictable pricing; drivers chase surge zones that disappear before they arrive.

We monitor surge multiplier variance per H3 cell over 5-minute windows. Alert when a cell oscillates more than 2x between consecutive computation windows. We track rider request cancellation rate during surge oscillation.
Mitigation
  1. Exponential Moving Average (EMA) smoothing: EMAt=0.3×ratiot+0.7×EMAt1\text{EMA}_t = 0.3 \times \text{ratio}_t + 0.7 \times \text{EMA}_{t-1}. A sudden 5x demand spike is smoothed to a gradual ramp: 1.0 -> 2.2 -> 3.1 -> 3.7 -> 4.1. Takes 4 windows (4 minutes) to reach peak, preventing oscillation.
  2. Minimum surge duration: once surge activates, we keep it for at least 5 minutes even if demand drops. This gives drivers time to arrive in the area before the incentive disappears.
  3. Gradual ramp-down: surge decreases by at most 0.5x per computation window. A 5.0x surge takes at least 8 minutes to return to 1.0x, smoothing the transition.
  4. Demand prediction: for known events (concerts, sports games), we pre-position surge multipliers 15 minutes before the event ends based on historical patterns. This prevents the initial spike.