#1 Proximity service
Below is a complete, time-boxed 1-hour interview answer for Designing a Proximity Service (think “find all drivers/shops/friends within X km of a user”).
It’s organized so you can speak smoothly for ~60 minutes while touching every major dimension—including Functional and Non-Functional Requirements, API design, architecture, scaling, and trade-offs.
0 – 5 min ➜ Problem Understanding & Assumptions
Goal: Confirm scope and key constraints before you draw anything.
Use-case: Clients send their current GPS location; service returns nearby entities (drivers, stores, friends) within a given radius in real time.
Scale assumption (for capacity planning):
10 M DAU
50 M location updates/day (~580 writes/sec average, peak 3–5×)
Peak 50 k “nearby” queries/sec
Constraints:
Global coverage
Query latency target: P95 < 200 ms
Location accuracy: ~10 m
99.9 – 99.99 % availability
5 – 15 min ➜ Requirements
Functional Requirements
Core Must-haves
Location Update – Entities send periodic updates (entity_id, lat, lon, timestamp); must be idempotent and handle out-of-order events.
Nearby Query – Given a point & radius, return entities sorted by distance with optional filters (type, status).
Entity Detail – Fetch latest known location and metadata for a specific entity.
Stale Data Handling – Mark entities offline if no update within configurable TTL.
Should / Nice-to-Have
Real-time subscriptions (WebSocket/SSE) for continuous updates.
Geofencing alerts (enter/leave a region).
Location history storage with TTL (e.g., 30 days) for analytics.
Admin APIs for blacklisting, throttling, and data retention management.
Non-Functional Requirements
Performance & Latency –
P95 query < 200 ms, P99 < 500 ms.
Update propagation visible in queries ≤ 1 s.
Scalability –
Reads: sustain 50 k QPS; writes: 3 k QPS peak.
Horizontal scale for sudden bursts (e.g., events, concerts).
Availability / Reliability – 99.9 % SLA; multi-AZ + multi-region failover.
RTO < 15 min, RPO < 1 min.
Consistency – Eventual for queries; strong for individual entity detail if required.
Security & Privacy –
OAuth2/JWT auth, TLS everywhere.
GDPR/CCPA compliance, “right to be forgotten”.
Rate limiting & anomaly detection to prevent spoofing.
Observability & Ops –
Metrics: QPS, latencies, cache hit ratio, stale-entity rate.
Distributed tracing & structured logs.
Cost & Maintainability – Prefer managed services; 90 %+ cache hit ratio to control DB costs.
15 – 25 min ➜ API Design (External Contract)
Method
Endpoint
Request
Response
Notes
POST
/location/update
{entity_id, lat, lon, ts}
200 OK
Idempotent
GET
/nearby
lat,lon,radius,type?limit?
[ {id, lat, lon, distance, meta} ]
Pagination & filters
GET
/entity/{id}
—
{id, lat, lon, updated_at, meta}
Authentication: OAuth2/JWT.
Rate limits: e.g., 100 req/min per user.
Standard error codes & retry guidelines.
25 – 40 min ➜ High-Level Architecture
Key Points
Write Path – App servers validate & enqueue updates → workers upsert into Redis GEO (hot set) and durable store (PostGIS or ElasticSearch geo_point).
Read Path – /nearby hits Redis GEORADIUS (P95 < 50 ms). Fallback to PostGIS for cold data.
Geo-Sharding – Use geohash/H3 cells as partition keys for DB scaling.
Region Strategy – Multi-AZ replication, eventual multi-region active/active.
40 – 50 min ➜ Key Algorithms & Data Model
Data Model
Algorithms / Flows
GeoIndexing: H3 or geohash to bucket earth into ~0.6 km cells; update cell membership on every location update.
Distance Filtering: Redis GEORADIUS or PostGIS ST_DWithin with Haversine distance check.
Stale-entity cleanup: background job removes or flags entries older than TTL.
50 – 55 min ➜ Trade-Offs & Alternatives
Approach
Pros
Cons
Redis GEO only
Ultra-low latency
Memory cost, weaker durability
PostGIS only
Rich spatial queries
Higher read latency
ElasticSearch geo_point
Text + geo combined
Operational complexity
Chosen Hybrid: Redis GEO for hot queries, PostGIS for durability & complex analytics.
Other considerations:
Update frequency vs. mobile battery/network usage.
Strong vs. eventual consistency—eventual is acceptable for proximity.
55 – 60 min ➜ Wrap-Up & Future Work
Future Enhancements
Predictive caching: pre-load likely next cells based on velocity.
Differential privacy or “fuzzing” for sensitive users.
ML-driven ranking (ETA, traffic conditions).
Risk & Mitigation
Region outage → active/active replication.
Sudden traffic spikes → auto-scale app & cache tiers.
Evolution Path
Phase 1: Single-region MVP (Redis + PostGIS).
Phase 2: Multi-region global presence with cross-region replication.
🔑 Key Numbers for Quick BoE
Writes: ~580/sec avg (50 M/day), plan for 3 k/sec peak.
Reads: 50 k/sec peak.
Avg query payload ~2 KB → ~800 Mbps peak outbound.
Storage: 10 GB/day (200 B per update) → ~300 GB for 30 days (×3 replication ≈ 1 TB).
✅ Interview Takeaways
Clear functional & non-functional requirements upfront.
APIs and high-level architecture that match scale/latency goals.
Reasoned trade-offs and evolution plan.
Delivering this sequence keeps you structured, shows capacity planning, and fills the entire hour without rushing or omitting critical design details.
Last updated