#14 Uber, Ola, lyft
Below is a complete, time-boxed 1-hour interview answer you can deliver for designing a ride-hailing system like Uber / Ola / Lyft. Follow the minute markers to pace your response. It covers scope, functional & non-functional requirements, APIs & data model, high-level architecture, deep dives (matching, dispatch, pricing, maps, data), capacity math, ops/security, trade-offs and an evolution path.
0 – 5 min — Problem recap, scope & assumptions (set the stage)
Start by restating and aligning with the interviewer.
Goal: Build a globally available ride-hailing platform that allows riders to request rides, drivers to accept them, and the system to match efficiently while handling pricing, routing, ETA, payments, cancellations, ratings, and availability across regions. Must support real-time locality, low-latency matching, surge pricing, and operational resilience.
Scope for this interview (what we’ll cover):
Rider app + driver app + Marketplace (matching & dispatch)
Real-time location updates, driver discovery, ETA estimation
Pricing (base fare, distance/time, surge), payments & receipts
Dispatch strategies (nearest, ETA), cancellations, pooling (optional)
Telemetry, fraud detection, regional scaling, and offline fallback
Example assumptions for BoE math:
City with 1M daily active users, 200k rides/day, peak 5k requests/min (~83 req/sec). Global system can have many such cities; design to scale horizontally by city/region.
Driver updates: each active driver sends a location every 5s.
5 – 15 min — Functional & Non-Functional Requirements
Functional Requirements (Must / Should / Nice)
Must
Request ride: rider requests pick-up/drop-off, selects ride type.
Driver discovery & match: find suitable nearby driver, send request, handle accept/decline.
Real-time tracking: rider sees driver location & ETA; driver sees pickup/drop-off route.
ETA & routing: compute ETA for arrival and trip duration.
Pricing & payments: compute fare (base + distance + time + surge), accept payments, receipts.
Trip lifecycle: accept, start trip, end trip, cancellations, disputes.
Ratings & feedback: driver and rider ratings, history.
Account management: profiles, KYC requirements for drivers, balance/summaries for drivers.
Notifications: push, SMS for critical states.
Operational tools: driver onboarding, heatmaps, surge controls.
Should
Pooling (shared rides), scheduled rides, ETA accuracy improvement.
Fraud detection (fake drivers, location spoofing).
Multi-region support & city-level configuration.
Nice-to-have
Multi-modal (bikes, scooters), driver incentives, predictive dispatch.
Non-Functional Requirements (NFR)
Latency: match decision and driver notification < 200–500 ms; ETA updates < 1s for UI fluidity.
Availability: 99.9% for core workflows; degraded read-only behavior if necessary during outages.
Scalability: horizontal scale across cities and geographies.
Consistency: eventual consistency ok for non-critical data, but trip state transitions must be strongly consistent per trip (no double assignments).
Durability: persistent trip logs for audit, billing, and disputes.
Security & Privacy: TLS, token auth, limit PII exposure, GDPR right to delete.
Observability: metrics for request latency, matching rates, driver utilization, cancellations, ETA accuracy.
Cost: minimize unnecessary network/push; control push rates.
Acceptance criteria examples
Average match time < 1s in non-congested markets.
ETA P95 error < X minutes (city config).
Trip state consistency: no two drivers can be confirmed for same ride (ACID per-trip).
15 – 25 min — APIs & Data Model (external contract)
Key APIs (Rider & Driver)
Core data model (simplified)
RideRequest: {ride_id, rider_id, pickup, dest, type, status, created_at, assigned_driver_id, price_estimate}
Trip: {trip_id, ride_id, driver_id, start_ts, end_ts, route, distance, fare, status}
DriverState: {driver_id, lat, lon, heading, status(active/idle/on_trip), vehicle_info, last_update_ts}
PricingRule: parameters per city/ride_type (base, per_km, per_min, surge_multiplier)
Payment: {payment_id, trip_id, amount, status, method, refund_info}
Telemetry: location updates stream, events for analytics.
Indices / Partitioning
Drivers partitioned by geo-shard (tile/geohash) + city; RideRequests written to per-city queues for matching.
25 – 40 min — High-level architecture & data flow
Key components
Location Ingest & Store
Drivers send frequent location updates; system ingests into a low-latency stream. Store the latest location in an in-memory geo-index (Redis GEO or adaptive geospatial index) sharded by city. TTL on updates for staleness.
Matching / Dispatch Service
Receives ride request, queries geo-index for candidate nearby drivers (filter by status, ratings, acceptance rate, preferences), scores candidates (ETA, driver acceptance probability, fairness), and sequentially sends match requests to drivers (or batched).
Must guarantee atomic assignment semantics: use per-driver optimistic locking or a distributed lock per driver to prevent double-assign. Trip state persisted in Trip Service with strong consistency.
Routing & ETA
Uses map service + real-time traffic to compute ETA pickup & trip ETA; returned to rider & used by matching. Precompute tile-based travel times for speedups.
Pricing Service
Computes estimate using pricing rules + surge multiplier (dynamic based on demand/supply). Surge decisions from ML/policy service; thresholds configurable by ops.
Trip Service (Authority)
Manages trip state transitions reliably: requested → assigned → driver_accepted → driver_arrived → in_progress → completed → paid. Use transactional DB for trip state and events.
Notification & Messaging
Push notifications to driver and rider; fallback SMS for critical messages.
Payments
Authorize/charge rider, pay out drivers (daily or instant), support promotions and split fares.
Analytics & ML
Demand prediction, ETA model, acceptance probability model, surge prediction — offline and online features consumed by matching/pricing.
Data flow (request flow)
Rider calls /rides/request. Frontend persists request and writes to Event Bus.
Matching service queries geo-index for nearby candidates. Score & rank candidates.
Send match to driver app(s) (push / socket). Driver accepts — Trip Service atomically assigns driver. If driver times out/declines, try next candidate.
On assignment, compute final ETA & price; orchestration triggers driver navigation, rider receives driver tracking.
Trip begins/end — trip events persisted; at end, calculate final fare, charge rider via Payment Service and ledger entries for driver payouts.
40 – 50 min — Deep dive: matching, routing, surge, pooling, failure handling
Matching & Dispatch algorithms
Candidate selection: query geo-index by radius (incremental radius expansion) or by precomputed nearest neighbors (k-NN). Filter by driver availability, vehicle type, driver preferences, compliance (e.g., background checks).
Scoring: multi-factor score = w1 * ETA_to_pickup + w2 * acceptance_prob + w3 * driver_utilization + w4 * fairness / time_since_last_assignment. Use learned weights via ML or heuristics.
Sequential vs batch offers:
Sequential: offer to top-k drivers one by one with short timeout (fast, proven).
Parallel (auctioning): send to multiple drivers and accept first; lower latency but higher chance of wasted offers; policies choose approach by region.
Guarantees & concurrency: use distributed compare-and-swap on driver state (or small lock per driver shard). Once driver calls accept, Trip Service must atomically set assigned_driver_id (DB transaction) and acknowledge; all other offers are invalidated.
ETA & routing
ETA model: combination of historical traffic + live traffic. For pickup ETA, compute route from driver location to pickup considering road speeds. Use precomputed travel-time grids (tiles) for faster approximation, refine with routing for final candidate.
Routing: use external routing engine (GraphHopper/OSRM or commercial maps) with caching and batching. For repeated queries, re-use partial routes.
Pricing & surge
Price calculation: base fare + time_rate * duration + distance_rate * distance + service_fee. Round and apply promotions.
Surge model: detect imbalance using demand/supply metrics in geo-shards; compute multiplier. Surge decision pipeline: real-time signals → ML model → ops throttle → apply surge. Communicate to rider before confirm.
Cancellation & refunds: charge cancellation fee based on policy; refunds processed via Payment Service.
Pooling (shared rides)
Pool matching: combinatorial problem—match multiple riders with overlapping routes. Use heuristic greedy insertion into driver route maintaining ETA constraints and maximum detour/time windows. Pre-filter candidates by geospatial clustering and constraint satisfaction.
Failure handling & edge cases
Driver offline during offer: consider driver status stale if not updated; fallback to next candidate.
Network partition: if Trip Service loses connectivity with driver shard, fall back to cached best-effort (but avoid unsafe assignments). Use leader election & heartbeat.
Double-assign prevention: atomic DB transaction on assignment; use optimistic locking with retry.
Stale location updates: if driver location older than TTL, request fresh ping or exclude.
Fraud: detect spoofed GPS, anomalous cancellations; blacklist or escalate.
50 – 55 min — Back-of-the-envelope capacity & sizing (example city-scale)
Assumptions (city): 1M daily users, 200k rides/day, peak 5k req/min (~83 req/sec), 100k active drivers sending location every 5s.
Location updates
100k drivers × (1 update / 5s) = 20k updates/sec. If each update ~200 bytes → 4 MB/s inbound.
Ride requests
Peak 83 req/sec → trivial for API servers, but globally sum across many cities.
Matching & geo-index
Geo-index (Redis GEO or custom) must handle 20k updates/sec + query throughput (matching queries) with low latency. Partition by tile/city into many Redis instances/shards.
Routing
Routing scale: if compute-intensive, use caching of route segments and approximate ETA via precomputed travel time grid; do exact routing only after assignment.
Event bus (Kafka)
Ingest rate (events from drivers, rides, trips) say 30k events/sec — require sufficient partitions & retention.
Serving & DB
Trip service: transactional DB (Postgres / NewSQL) with sharding by city or hashed ride_id; ensure trip writes < 500 req/sec per shard to keep latency low.
Scaling
Scale horizontally by city and by geo-shard. For global scale (many cities), deploy clusters per region with central analytics.
55 – 58 min — Operations, security & monitoring
Observability
Metrics: match latency, offers per ride, acceptance rates, driver utilization, ETA accuracy, cancellations.
Traces: end-to-end trace for request → match → assignment → start → end.
Alerts: high match time, surge anomalies, spike in cancellations, ETA deviation.
Security & privacy
Auth & tokens: short-lived tokens for driver apps; mutual auth for critical services.
PII management: minimal storage of location history, encryption at rest, GDPR compliance (right to delete).
Anti-spoofing: device attestation, anomaly detection for GPS jumps.
Payment & fraud: secure PCI-compliant payment processing, fraud detection pipeline.
Operational practices
Dark launches for surge logic, canary deployments, chaos testing for matching failover, regular reconciliation between trip state and driver payouts.
58 – 60 min — Trade-offs, evolution & summary (wrap-up)
Key trade-offs
Latency vs accuracy: fast approximate ETA/navigation vs slower exact routing. Use approximate then refine.
Sequential vs parallel offers: sequential reduces driver spamming & improves acceptance fairness; parallel reduces match time at cost of wasted offers. Choose per-market.
Consistency vs availability: trip state needs strong consistency to avoid double assignment; other data (analytics) can be eventually consistent.
Centralized vs distributed matching: per-city distributed matching scales better and keeps local latencies low; central control needed for global features.
Evolution path
MVP: per-city cluster, simple nearest-driver matching, routing from provider, basic pricing and payment.
Add: ML models for ETA, acceptance probability, demand prediction, surge orchestration, pooling.
Scale: multi-region deployment, cross-city pooling, dynamic routing optimization, driver incentives and real-time marketplace optimization.
One-sentence summary
Build a geo-sharded, event-driven platform: ingest real-time driver locations into an in-memory geo-index, run a low-latency matching/dispatch service that atomically assigns drivers via a strongly-consistent Trip Service, compute ETA/pricing via routing & ML models, and run analytics/fraud pipelines asynchronously — all designed to scale by city and recover deterministically from failures.
If you want, I can:
convert this into a 5-slide interview deck (one slide per major time block), or
generate a sequence diagram for request → match → assign → start → end flow, or
run a detailed capacity calculator if you give me your expected DAU, rides/day, avg driver update frequency, and retention — I’ll compute numbers for Redis shards, Kafka partitions, and DB sizing.
Which would you like next?
Last updated