Notification Service

Difficulty: Medium Topics: Pub-Sub, Message Queues, Third-party Integration, Rate Limiting Time: 45 minutes Companies: Amazon (SNS), Uber, LinkedIn, Airbnb

Problem Statement

Design a scalable notification service that sends notifications across multiple channels (Email, SMS, Push Notifications) to millions of users.

Requirements:

Functional:
- Send email, SMS, and push notifications.
- Support bulk notifications (e.g., "Marketing blast to 1M users").
- Support transactional notifications (e.g., "OTP", "Order Confirmed").
- User preferences (Opt-out/Opt-in channels).
Non-Functional:
- High Availability: 99.99%
- Durability: No notification lost.
- Throughput: 1M+ notifications/min.
- Latency: Real-time for OTP (<5s), eventual for marketing (<10m).
- Deduplication: Don't send duplicate alerts.

Architecture

Client Services
(Order Service, Marketing)
       ↓
   Load Balancer
       ↓
┌───────────────────┐
│ Notification API  │ (Rate limiting, Validation)
└────────┬──────────┘
         │
         ▼
┌───────────────────┐        ┌──────────────────┐
│  Service Router   │ ──────▶│ User Preferences │
└────────┬──────────┘        │ DB (MySQL/NoSQL) │
         │                   └──────────────────┘
         ▼
    Kafka / RabbitMQ (Decoupling & Buffering)
    topics: [priority-otp], [email], [sms], [push]
         │
         ▼
┌───────────────────┐
│     Workers       │ (Stateless Consumers)
└────────┬──────────┘
         │
    ┌────┴────┬─────────────┐
    ▼         ▼             ▼
  Email      SMS          Push
  Handler    Handler      Handler 
(SendGrid)  (Twilio)     (FCM/APNS)

Data Schema

User Preferences

CREATE TABLE user_preferences (
    user_id BIGINT PRIMARY KEY,
    email_enabled BOOLEAN DEFAULT TRUE,
    sms_enabled BOOLEAN DEFAULT FALSE,
    push_enabled BOOLEAN DEFAULT TRUE,
    timezone VARCHAR(50),
    dnd_start_time TIME,
    dnd_end_time TIME
);

Notification Log (Cassandra/DynamoDB)

CREATE TABLE notification_logs (
    notification_id UUID PRIMARY KEY,
    user_id BIGINT,
    type VARCHAR(20), -- EMAIL/SMS
    status VARCHAR(20), -- SENT/FAILED
    created_at TIMESTAMP,
    third_party_id VARCHAR(100)
);

Key Components

1. API & Validation

Rate Limit: Prevent abusive clients (e.g., 100 req/sec per service).
Validation: Check email format, phone number validity.

2. Priority Handling (Message Queues)

Problem: Marketing blast (1M emails) blocks critical OTPs.
Solution: Separate Queues!
- High_Priority_Queue: OTPs, Security Alerts (Dedicated Workers).
- Low_Priority_Queue: Marketing, Monthly Statements.

3. Deduplication and Exactly-Once Semantics (SDE-3 Concept)

Problem: Client retries or network blips create duplicate notifications.
Solution: Idempotency Keys.
- The calling service generates a unique idempotency_key (UUID).
- Check a fast distributed cache (Redis) for the key. TTL = 10 mins to 24 hours.
- If exists -> Return 200 OK (already processed).
- If not exists -> Process message, add key to Redis, return 200 OK.
- Note on Kafka: Kafka provides "At-least-once" delivery by default. To prevent workers from sending duplicates when they crash before committing offsets, we must also check idempotency at the worker level before making the 3rd party API call.

4. Failure Handling with Dead Letter Queues (DLQ)

Problem: What if an email address is permanently invalid or a message always crashes the worker (poison pill)?
Solution:
- If a message fails processing N times (e.g., 3 retries), move it to a Dead Letter Queue (DLQ).
- Setup alerts on the DLQ size.
- Engineers can inspect the DLQ, fix the bug or data issue, and replay the messages back into the main queue.

4. Third-Party Integration (The "Hard" Part)

Challenges:
- Vendor Downtime (Twilio down).
- Rate Limits (Vendor blocks IP).
- Latency (Vendor slow).
Mitigation:
- Retry with Backoff: Exponential backoff for 5xx errors.
- Circuit Breaker: Stop calling if 50% errors, failover to secondary provider (e.g., SendGrid -> AWS SES).
- Rate Limiter (Outbound): Throttle requests to match Vendor's TPS limit.

Scale Estimation

Traffic: 10M notifications/day.
Peak: 1M within 10 mins (Marketing).
Storage:
- Log retention: 1 year.
- 1KB per log * 10M/day * 365 = ~3.6 TB/year.
- Database: Cassandra (Write-heavy, TTL support).

Interview Tips

Q: "How do you handle users in Do Not Disturb (DND) mode?"

A: "Workers check User Preferences DB before sending. If DND is active, park message in a Delayed_Queue with a timestamp to process later."

Q: "How to prevent spamming users?"

A: "Implement a Frequency Cap in Redis (e.g., 'User X received 3 marketing emails today -> Drop 4th'). Critical alerts bypass this."

Q: "What if the worker crashes after sending to Twilio but before updating DB?"

A: "Idempotency! Store third_party_id in DB. If worker restarts, check if we already have a success ID for this notification_id before retrying Twilio. If Twilio's API supports idempotency keys, pass our notification_id to them."

PreviousMedium NextTwitter

Last updated 1 month ago

hashtagProblem Statement

hashtagArchitecture

hashtagData Schema

hashtagUser Preferences

hashtagNotification Log (Cassandra/DynamoDB)

hashtagKey Components

hashtag1. API & Validation

hashtag2. Priority Handling (Message Queues)

hashtag3. Deduplication and Exactly-Once Semantics (SDE-3 Concept)

hashtag4. Failure Handling with Dead Letter Queues (DLQ)

hashtag4. Third-Party Integration (The "Hard" Part)

hashtagScale Estimation

hashtagInterview Tips