#27 Chat services
Here’s a complete, time-boxed, 1-hour interview-ready answer for designing a Global Chat Service (like WhatsApp or Facebook Messenger). It follows your system design interview structure, including functional & non-functional requirements, APIs/data model, architecture, deep dive, and trade-offs.
0 – 5 min — Problem recap, scope & assumptions
Goal: Design a scalable, real-time chat service that allows users to send messages individually or in groups, supports message delivery and persistence, and works globally across devices.
Scope for interview:
One-on-one chats and group chats.
Real-time message delivery.
Message persistence and history.
Presence status (online/offline).
Push notifications.
End-to-end encryption (optional).
Assumptions:
Hundreds of millions of users, billions of messages/day.
Peak concurrent connections ~10–50M.
Message delivery latency <200 ms for online users.
Mobile and web clients.
Message order preserved per conversation.
5 – 15 min — Functional & Non-Functional Requirements
Functional Requirements
Must
Send/receive messages: One-on-one and group chat.
Message persistence: Store messages for later retrieval.
Message ordering: Preserve order per chat.
Read receipts: Track delivered and read messages.
Presence: Show online/offline status.
Notifications: Push notifications for offline users.
Should
Support media messages (images, videos, files).
Support message editing and deletion.
Typing indicators.
Nice-to-have
End-to-end encryption.
Message search.
Multi-device sync.
Reactions to messages.
Non-Functional Requirements
Latency: <200 ms for message delivery.
Availability: 99.99% uptime globally.
Scalability: Handle millions of concurrent users and high message throughput.
Durability: Persist all messages reliably.
Consistency: Strong ordering per conversation; eventual consistency acceptable across devices.
Monitoring: Track message delivery latency, failures, and server health.
15 – 25 min — API / Data Model
APIs
Data Models
User
Chat
Message
25 – 40 min — High-level architecture & data flow
Components
Chat Service: Handles send/receive API calls, message ordering, presence.
Message Queue / Pub-Sub: RabbitMQ/Kafka for message delivery.
Database / Storage: Persistent storage for messages (NoSQL like Cassandra for high write throughput).
Cache Layer: Redis for recent messages, online presence, and typing indicators.
Push Notification Service: Notify offline users via FCM/APNs.
Data Flow
User sends message → Chat Service → Message Queue → Delivery to online users via WebSocket → Persist in DB.
Offline users → Chat Service triggers push notifications → Delivered when device is online.
Message history requests → Fetch from cache or DB.
40 – 50 min — Deep dive — message delivery, scaling, concurrency
Message Delivery
WebSocket connections for real-time delivery.
Fan-out per chat: For group messages, send to all participants.
Delivery acknowledgment: Message Queue tracks delivery to each recipient.
Scaling
Horizontal sharding: Partition users by user_id across multiple chat service instances.
Partitioned queues: Each chat or conversation partitioned across queues for ordering.
Caching: Recent messages, online presence, and typing indicators in Redis.
Fault Tolerance
Persistent queues for reliable message delivery.
Database replication for durability.
Retry mechanism for undelivered messages.
50 – 55 min — Back-of-the-envelope calculations
Assumptions
100M daily active users, 10B messages/day → ~115k messages/sec.
Each message ~1 KB → ~10 GB/day storage.
Peak concurrency 10M → WebSocket connections need horizontal scaling.
Caching top 100 messages per chat per user → manageable in Redis cluster.
Storage
NoSQL DB like Cassandra for horizontal scalability.
Cache recent messages and online status in Redis.
Message queues for reliable delivery.
55 – 58 min — Monitoring & ops
Monitoring
Message delivery latency and failure rates.
Queue lag and processing rate.
Online user count and connection health.
Database and cache performance metrics.
Operational concerns
Handling peak traffic (holidays, events).
Device reconnection and message sync.
Multi-region replication for global availability.
58 – 60 min — Trade-offs, evolution & summary
Trade-offs
WebSocket vs polling: WebSocket for real-time; polling simpler but higher latency.
NoSQL vs SQL: NoSQL for high write throughput; SQL may support complex queries but slower.
Push vs pull notifications: Push reduces delay but increases complexity; pull simpler but higher latency.
Evolution
MVP: One-on-one chat with message persistence, basic delivery.
Phase 2: Group chats, push notifications, online presence.
Phase 3: Multi-device sync, media messages, reactions, encryption, global scaling.
Summary
System enables real-time messaging for one-on-one and group chats.
Uses WebSocket, message queues, NoSQL DB, and cache for scalable, low-latency, reliable delivery.
Horizontally scalable, fault-tolerant, monitorable, and capable of evolving to global usage with multi-device support.
I can next create a sequence diagram showing message sending, queue delivery, DB persistence, and client notification, which is extremely useful to explain in interviews.
Last updated