Brute-Force Distance Scan
Very CommonFORMULA
Scanning all 500K drivers to find the nearest ones means computing haversine distance 500K times per query. At 58 nearby queries per second average, that is 29 million distance calculations per second. The constraint breaks at 10K+ drivers per city.
Why: The naive approach works in a prototype with 100 drivers. Candidates skip the spatial index step because the brute-force query returns correct results. But it does not handle the production write rate of 167K location updates per second.
WRONG: For each ride request, scan all 500K drivers, compute haversine distance for each, sort by distance, and return the top 10. At 58 QPS average: 29M distance calculations/sec. Each computation involves trigonometric functions. Latency: 200ms per query. At peak: 500ms+.
RIGHT: We use H3 hexagonal grid at resolution 9 (not QuadTree, not brute-force). Compute the rider's cell ID in O(1), query that cell plus 6 neighbors (k-ring). Each cell contains 5-20 drivers in urban areas. Total candidates scanned: 50-150 instead of 500K. Latency: <10ms per query. Trade-off: we gave up exact nearest-neighbor for approximate nearest-neighbor within one cell radius.
Matching by Nearest Distance Only
Very CommonFORMULA
Picking the geographically closest driver ignores traffic, direction of travel, and driver quality. The constraint: distance and ETA diverge in real traffic. A driver 500m away but facing the wrong direction on a one-way street may be 8 minutes away.
Why: Distance is the simplest metric to compute. Candidates default to 'find the nearest driver' without considering that distance and ETA diverge in cities with one-way streets, construction, and rush-hour congestion.
WRONG: Sort candidates by straight-line distance and dispatch to the closest. Driver A is 500m away but stuck in traffic facing the opposite direction (ETA: 8 min). Driver B is 900m away heading toward the rider on a clear road (ETA: 3 min). System picks A. Rider waits 5 minutes longer than necessary.
RIGHT: We rank by weighted score: ETA (50%), direction of travel (20%), driver rating (15%), acceptance rate (15%). We chose this weighting (not equal weights) because ETA has the strongest correlation with rider satisfaction. ETA computed via road graph with real-time traffic. Driver B scores higher despite being farther. Rider pickup in 3 minutes instead of 8. Trade-off: ranking takes 7ms vs 1ms for nearest-only.
QuadTree for Write-Heavy Location Updates
CommonFORMULA
QuadTree works well for read-heavy spatial queries on static datasets. The constraint: at 167K location updates per second, every update may trigger a node split or merge, causing lock contention and write latency spikes that break the real-time pipeline.
Why: QuadTree is the textbook answer for spatial queries and works well for static datasets like restaurant search. Candidates apply the same pattern to a dynamic dataset where 500K drivers move every 3 seconds without calculating the write throughput impact.
WRONG: Maintain an in-memory QuadTree for 500K drivers. Every 3 seconds, each driver crosses a cell boundary, triggering a remove + re-insert that may cause node splits or merges. At 167K updates/sec, the tree is rebalancing thousands of nodes per second. Lock contention on internal nodes causes p99 latency to spike to 50ms+.
RIGHT: We chose H3 hexagonal grid (not QuadTree) because cell ID is a pure function of (lat, lng). No tree, no splits, no merges. On move: ZREM from old cell's sorted set, ZADD to new cell's sorted set. Both are O(log n) in Redis with n < 50 drivers per cell. No lock contention. Trade-off: we gave up adaptive cell density for zero write contention.
No Optimistic Locking on Dispatch
CommonFORMULA
Without locking on driver status, two concurrent dispatches can match the same driver to two different riders. The constraint: during rush hour in dense areas, multiple riders within the same H3 cell request rides simultaneously, making collisions frequent, not rare.
Why: At 11.6 matches per second average, collisions seem rare. Candidates skip the concurrency control because the happy path works. But peak is 35 matches/sec, and in dense cells with 50+ drivers, two riders often pick the same top-ranked driver.
WRONG: Dispatch service queries available drivers, picks the best, and sends a ride request without checking or locking driver status. Two dispatches 200ms apart both see Driver B as available. Both send requests. Driver B accepts the first. The second rider's app shows 'driver found' then immediately switches to 'searching' again. Trust eroded.
RIGHT: We use optimistic locking (not pessimistic) because holding a lock for the full 15-second accept window would block other dispatchers. On accept, CAS (Compare-And-Swap) driver status from AVAILABLE to MATCHED in Redis. If CAS fails, the driver was already claimed. We re-dispatch to the next candidate within 500ms. The rider never sees a false 'driver found' message because we hold the UI update until CAS succeeds. Trade-off: occasional re-dispatch latency (under 1% of matches) for higher concurrent throughput.
Polling for Ride Status Updates
CommonFORMULA
The constraint: polling the server every second for ride status generates 100K requests/sec at peak concurrent rides. Over 99% of those polls return unchanged status, wasting bandwidth, battery, and server capacity.
Why: Polling is stateless and straightforward to implement. Candidates use it because WebSocket adds connection management complexity. They do not calculate that for a ride lasting 15 minutes with 3-second location updates, polling generates 300 requests versus 1 WebSocket connection with 300 pushes.
WRONG: Rider app polls GET /rides/id/status every 1 second. At 100K concurrent rides: 100K requests/sec to the API layer. Each response is 200 bytes. Over 99% of polls return unchanged status. Wasted bandwidth: 20 MB/sec of redundant responses. Battery drain on mobile.
RIGHT: We chose WebSocket for active rides (not HTTP polling) because one persistent connection replaces 300 poll requests per ride. Server pushes location updates every 3 seconds (40 bytes each). Total for 100K rides: 1.3 MB/sec. No redundant requests. Battery-friendly because the radio stays idle between pushes. Trade-off: we accepted stateful connection management (registry in Redis, heartbeat monitoring) for 99% lower request volume. We fall back to polling only if WebSocket fails.
Storing Location History in MySQL
CommonFORMULA
The constraint: writing 167K location updates per second to MySQL causes write amplification from B-tree index maintenance, WAL writes, and replication. A single MySQL instance saturates at 20K inserts/sec. We would need 9 instances for location writes alone.
Why: Candidates use MySQL for everything because it is familiar and provides ACID guarantees. For the trip table (11.6 writes/sec), MySQL is fine. For the location stream (167K writes/sec), the write patterns are fundamentally different: append-only, no updates, no transactions needed.
WRONG: INSERT location updates into a MySQL driver_locations table at 167K rows/sec. Each insert updates B-tree indexes, writes to the WAL, and replicates to followers. At this rate, a single MySQL instance saturates at 20K inserts/sec. We would need 9 MySQL instances for location writes alone. Index fragmentation degrades read performance within days.
RIGHT: We chose Cassandra (not MySQL) for location history because Cassandra's LSM-tree storage engine is optimized for append-only writes: sequential disk writes, no B-tree rebalancing. We stream through Kafka first to decouple ingestion from storage. A single Cassandra node handles 50K writes/sec. Three nodes with RF=3 handle the full 167K/sec load with room to spare. Trade-off: we gave up SQL joins and ACID transactions (unnecessary for append-only location data) for 8x write throughput per node.
No Stale Driver Detection
CommonFORMULA
The constraint: without stale driver removal, the geo index fills with ghost drivers whose app crashed or phone died. Each ghost driver wastes 15 seconds of dispatch timeout before the system moves to the next candidate. Three ghosts in a row means 45 seconds of rider wait time.
Why: In the happy path, drivers go offline explicitly and we remove them from the index. Candidates forget the failure cases: app crash, phone battery death, driver enters a dead zone. The driver never sends an explicit offline signal in any of these cases.
WRONG: Trust that drivers explicitly go offline. A driver's phone dies, but the geo index still shows them as AVAILABLE. Three consecutive ride requests dispatch to this ghost driver. Each waits the full 15-second timeout before re-dispatching. The rider waits 45+ seconds before getting a real driver.
RIGHT: We track last_location_at per driver. A background job runs every 5 seconds, scanning for drivers with no update in >15 seconds. We mark them INACTIVE and remove from the geo index. On the next location update, we re-add them. Maximum ghost driver exposure: 15 seconds. We also add a belt-and-suspenders check in the dispatch path: skip any driver whose last update is older than 10 seconds even if they are still in the index. Trade-off: we accepted false positives (marking a driver inactive during a brief dead zone) for guaranteed 15-second ghost cleanup.
Global Surge Pricing Instead of Per Cell
OccasionalFORMULA
The constraint: computing surge as a single global multiplier punishes riders in low-demand areas. A spike in one neighborhood inflates prices everywhere, driving riders away and misallocating driver supply.
Why: Global surge is simpler to implement: one demand counter, one supply counter, one ratio. Per-cell surge requires tracking supply and demand in each of 500K H3 cells. Candidates skip the geo-specific logic to save complexity without calculating the downstream economic distortion.
WRONG: Compute a single surge multiplier across the entire city. Times Square at midnight has 10x demand with few drivers. Brooklyn at midnight is quiet with ample supply. A global surge of 3.0x overcharges Brooklyn riders and underprices Times Square rides. Drivers flock to Times Square (already oversurged), leaving Brooklyn undersupplied.
RIGHT: We compute supply/demand per H3 cell (not globally) because geographic isolation ensures each neighborhood's pricing reflects its own conditions. Times Square cells get 5.0x surge attracting drivers from adjacent cells. Brooklyn cells stay at 1.0x. We smooth with EMA (α=0.3) per cell to prevent oscillation. Recompute every 60 seconds. Trade-off: we accepted 500K cell computations per minute and a 2-3 minute surge response delay for geographic price accuracy.