githubEdit

WhatsApp

Problem Statement

Design a real-time messaging application like WhatsApp that supports one-on-one and group chats, media sharing, end-to-end encryption, message delivery status, and voice/video calls.


Requirements

Functional Requirements

  1. One-on-one messaging (text, images, videos, documents)

  2. Group chats (up to 256 members)

  3. Message delivery status (sent, delivered, read)

  4. End-to-end encryption (E2EE)

  5. Voice and video calls (peer-to-peer)

  6. Last seen and online status

  7. Message persistence (offline delivery)

  8. Push notifications

Non-Functional Requirements

  1. Low latency: < 100ms message delivery

  2. High availability: 99.99% uptime

  3. Scalability: 2 billion users, 100 billion messages/day

  4. Security: E2EE, no server access to message content

  5. Reliability: Messages must not be lost


Capacity Estimation

Traffic Estimates

  • Daily Active Users (DAU): 500 million

  • Messages per user/day: 40

  • Total messages/day: 20 billion

  • Peak QPS: 20B / 86400 ร— 3 = 694,000 messages/sec

  • Media messages: 20% of total = 4 billion/day

Storage Estimates

  • Text message: 100 bytes

  • Media average: 1 MB (photos), 10 MB (videos)

  • Daily storage:

    • Text: 16B ร— 100 bytes = 1.6 TB

    • Media: 4B ร— 2 MB (avg) = 8 PB

    • Total: 8 PB/day

  • With 90-day retention (older messages compressed/archived): 8 PB ร— 90 = 720 PB

Bandwidth Estimates

  • Ingress: 694K msg/sec ร— 100 bytes = 69 MB/sec (text only)

  • Media ingress: 4B media/day / 86400 = 46K/sec ร— 2 MB = 92 GB/sec

  • Egress: Same as ingress (message delivery)


High-Level Architecture

spinner

Core Components

1. WebSocket Connection Management

spinner

Session Manager:

  • Maps userId โ†’ gatewayId (which WebSocket server user connected to)

  • Redis for fast lookup

  • TTL: 5 minutes (refreshed by heartbeat)

Sticky Sessions:

  • User always connects to same gateway (via load balancer affinity)

  • Reduces session renegotiation overhead

SDE-3 Deep Dive: WebSockets vs SSE vs Long-Polling

  • WebSockets: Bi-directional, persistent connection. Ideal for chat because users both send and receive high volumes of data rapidly with low overhead.

  • Server-Sent Events (SSE): Uni-directional (Server -> Client). Good if the client mostly reads (like a stock ticker), but chat requires sending too.

  • Long-Polling: Client opens request, server holds it open until data is ready. High overhead (HTTP headers per message) and latency. Fallback only.

2. Message Delivery Flow

spinner

Message Status:

  1. Sent (one checkmark): Server received

  2. Delivered (two checkmarks): Delivered to recipient's device

  3. Read (two blue checkmarks): Recipient opened chat

3. Group Messaging

spinner

Group Schema (PostgreSQL):

Optimization: Group Message Storage

  • Single copy: Store message once, reference from each member's inbox

  • Denormalize: Each member gets copy (faster reads, more storage)

4. End-to-End Encryption

spinner

Signal Protocol:

  • Double Ratchet Algorithm: Forward secrecy

  • Server stores encrypted messages only

  • Keys stored locally on device

Key Storage:

5. Media Handling

spinner

Media Encryption:

  • Generate random AES-256 key for each media file

  • Encrypt media client-side

  • Share decryption key in message (E2EE)

  • Server stores encrypted blob only

6. Voice/Video Calls

spinner

WebRTC Flow:

  1. Offer/Answer: SDP exchange via signaling server

  2. ICE: Establish peer-to-peer connection

  3. STUN: Discover public IP/port

  4. TURN: Relay if P2P fails (10-20% of calls)


Database Design

Message Storage (Cassandra)

Schema:

Why Cassandra?

  • Write-heavy: Billions of messages/day

  • Time-series: Natural ordering by message_id

  • Scalability: Horizontal scaling

Query Patterns:

Undelivered Messages (Redis)


Scalability Strategies

1. Sharding Strategy

User-based sharding:

Benefit: User's conversations co-located

2. Connection Pooling

Problem: Millions of WebSocket connections

Solution:

  • Per-server limit: 50K connections/server

  • For 500M DAU: 10,000 WebSocket servers

  • Auto-scaling: Based on connection count

3. Read Replicas

  • Read-heavy for old messages (chat history)

  • Cassandra replicas for read scaling


Advanced Features

1. Status Updates (Stories)

  • Similar to Instagram Stories

  • 24-hour TTL, S3 lifecycle policy

  • View count tracking in Redis

2. Message Reactions

3. Typing Indicators


Monitoring & Reliability

Metrics

  • Message delivery latency: p50, p99, p999

  • WebSocket connection count: Per server

  • Message delivery success rate: %

  • Push notification delivery rate

Failure Handling

WebSocket disconnect:

Message delivery retry:


Trade-offs

Aspect
Choice
Trade-off

Protocol

WebSocket

Persistent connection vs HTTP overhead

E2EE

Signal Protocol

Privacy vs searchability

Message Storage

Cassandra

Write speed vs complex queries

Group Fanout

Async (Kafka)

Eventual delivery vs instant


Interview Discussion Points

Q: How to ensure message ordering in group chats?

  • Lamport timestamps: Each message tagged with logical clock

  • Server-assigned sequence: Central sequencer per conversation

  • SDE-3 Concept (CRDTs): Conflict-free Replicated Data Types can be used for distributed message ordering and resolving concurrent edits without locking.

  • Trade-off: Strict ordering vs throughput

Q: Handling message floods (spam)?

  • Rate limiting: Max 100 msg/min per user

  • Bloom filter: Detect duplicate hash (copy-paste spam)

  • ML model: Detect spam patterns

Q: Disaster recovery?

  • Multi-region Cassandra: 3 replicas across regions

  • WAL (Write-Ahead Log): Replay on failure

  • Backup: Daily snapshots to S3

Last updated