Chat System
Difficulty: Hard Topics: WebSockets, Long Polling, Real-time Delivery, Consistency, Offline Support Time: 60 minutes Companies: Meta (WhatsApp/Messenger), Telegram, Discord, Slack
Problem Statement
Design a one-on-one and group chat application like WhatsApp.
One-on-one chat: User A sends message to User B.
Group chat: User A sends to Group (B, C, D).
Status: Sent, Delivered, Read receipts.
Online/Offline Status.
Scale:
2 Billion Users.
100 Billion messages/day.
Low Latency (Real-time).
Requirements
Functional
1:1 Chat & Group Chat (Max 256 members).
Message Acknowledgment (Sent, Delivered, Read).
Last Seen / Online Status.
Media Support (Images/Video) - (Design separate Asset Service).
Persistent History: Multi-device login support.
Non-Functional
Low Latency: < 100ms delivery.
Consistency: Order of messages must be preserved (e.g., "Hello" before "How are you?").
Availability: High.
Security: End-to-End Encryption (E2EE) (Optional advanced topic).
Connection Management (The "Real-time" Magic)
Protocols
HTTP (REST): Good for Login, Profile Update, History Fetch. Bad for receiving messages (High latency/overhead).
Long Polling: Client holds connection open. Server responds when data arrives. Better but heavy on server resources.
WebSockets (Selected): Bi-directional persistent connection. Server pushes messages instantly. ideal for Chat.
Dealing with 2 Billion Connections
A single server can manage ~50k concurrent WebSockets (Network IO bound).
Need Connection Gateway layer (Horizontally Scaled).
Stateful: Gateway knows "User A is connected to Box 42".
Uses Redis / Zookeeper to map
UserID -> GatewayMachineID.
Architecture
Message Flow (1:1 Chat)
User A sends message via WebSocket to Gateway 1.
Message Service persists message to Cassandra (Status:
SENT).Message Service queries Redis: "Where is User B connected?"
If User B Online (Gateway 2):
Forward message to Gateway 2.
Gateway 2 pushes to User B via WebSocket.
User B sends ACK (
DELIVERED) -> Gateway 2 -> A.
If User B Offline:
Push Notification Service (FCM/APNS) triggers "You have a new message".
Message sits in DB. When B connects, fetch unread messages.
Data Schema (Cassandra/NoSQL)
Why NoSQL?
Write-heavy (100B/day).
Horizontal scaling easier.
Consistency:
local_quorumsufficient.
Message Table
Group Chat Complexity
Scenario: Group of 200 people.
User A sends message.
Server needs to deliver to 199 users.
Approach:
Message Service fetches Group Members from SQL DB.
Fan-out:
For small groups (<100): Loop and push to each member's Gateway.
For large groups/channels (>5k): Write to a Message Queue (Kafka), workers process fan-out in batches.
Optimization: Don't confirm "Delivered" until X% receive, or just show "Sent". "Read by All" is expensive O(N).
Sequencing & Consistency
Problem: User A sends Msg1 then Msg2. B receives Msg2 then Msg1. Solution:
Sequence Numbers: Assign incremental ID (Sequence ID) per Chat ID.
Client Side Re-ordering:
Client receives
Seq 5. Last seenSeq 3.Client knows
Seq 4is missing.Buffer
Seq 5, requestSeq 4, then display in order.
Timestamps: Unreliable due to clock skew. Use Snowflake IDs or Logical Clocks.
Interview Q&A
Q: "How to handle 'Last Seen'?"
A: "Heartbeat mechanism. Client pings server every 30s. Store in Redis
User:LastActive. If >1 min, show 'Last seen at X'. Don't write to Hard DB every ping, only on session close."
Q: "Multi-device Synching?"
A: "Treat each device as a separate 'User' entity in routing.
UserA_Mobile,UserA_Desktop. Fan-out message to ALL active sessions of User A."
Q: "End-to-End Encryption?"
A: "Signal Protocol. Keys generated on client. Server only stores encrypted blob. Server cannot read messages. Only device with Private Key can decrypt."
Last updated