githubEdit

Notification Service

Difficulty: Medium Topics: Pub-Sub, Message Queues, Third-party Integration, Rate Limiting Time: 45 minutes Companies: Amazon (SNS), Uber, LinkedIn, Airbnb

Problem Statement

Design a scalable notification service that sends notifications across multiple channels (Email, SMS, Push Notifications) to millions of users.

Requirements:

  • Functional:

    • Send email, SMS, and push notifications.

    • Support bulk notifications (e.g., "Marketing blast to 1M users").

    • Support transactional notifications (e.g., "OTP", "Order Confirmed").

    • User preferences (Opt-out/Opt-in channels).

  • Non-Functional:

    • High Availability: 99.99%

    • Durability: No notification lost.

    • Throughput: 1M+ notifications/min.

    • Latency: Real-time for OTP (<5s), eventual for marketing (<10m).

    • Deduplication: Don't send duplicate alerts.

Architecture

Data Schema

User Preferences

Notification Log (Cassandra/DynamoDB)

Key Components

1. API & Validation

  • Rate Limit: Prevent abusive clients (e.g., 100 req/sec per service).

  • Validation: Check email format, phone number validity.

2. Priority Handling (Message Queues)

  • Problem: Marketing blast (1M emails) blocks critical OTPs.

  • Solution: Separate Queues!

    • High_Priority_Queue: OTPs, Security Alerts (Dedicated Workers).

    • Low_Priority_Queue: Marketing, Monthly Statements.

3. Deduplication and Exactly-Once Semantics (SDE-3 Concept)

  • Problem: Client retries or network blips create duplicate notifications.

  • Solution: Idempotency Keys.

    • The calling service generates a unique idempotency_key (UUID).

    • Check a fast distributed cache (Redis) for the key. TTL = 10 mins to 24 hours.

    • If exists -> Return 200 OK (already processed).

    • If not exists -> Process message, add key to Redis, return 200 OK.

    • Note on Kafka: Kafka provides "At-least-once" delivery by default. To prevent workers from sending duplicates when they crash before committing offsets, we must also check idempotency at the worker level before making the 3rd party API call.

4. Failure Handling with Dead Letter Queues (DLQ)

  • Problem: What if an email address is permanently invalid or a message always crashes the worker (poison pill)?

  • Solution:

    • If a message fails processing N times (e.g., 3 retries), move it to a Dead Letter Queue (DLQ).

    • Setup alerts on the DLQ size.

    • Engineers can inspect the DLQ, fix the bug or data issue, and replay the messages back into the main queue.

4. Third-Party Integration (The "Hard" Part)

  • Challenges:

    • Vendor Downtime (Twilio down).

    • Rate Limits (Vendor blocks IP).

    • Latency (Vendor slow).

  • Mitigation:

    • Retry with Backoff: Exponential backoff for 5xx errors.

    • Circuit Breaker: Stop calling if 50% errors, failover to secondary provider (e.g., SendGrid -> AWS SES).

    • Rate Limiter (Outbound): Throttle requests to match Vendor's TPS limit.

Scale Estimation

  • Traffic: 10M notifications/day.

  • Peak: 1M within 10 mins (Marketing).

  • Storage:

    • Log retention: 1 year.

    • 1KB per log * 10M/day * 365 = ~3.6 TB/year.

    • Database: Cassandra (Write-heavy, TTL support).

Interview Tips

Q: "How do you handle users in Do Not Disturb (DND) mode?"

  • A: "Workers check User Preferences DB before sending. If DND is active, park message in a Delayed_Queue with a timestamp to process later."

Q: "How to prevent spamming users?"

  • A: "Implement a Frequency Cap in Redis (e.g., 'User X received 3 marketing emails today -> Drop 4th'). Critical alerts bypass this."

Q: "What if the worker crashes after sending to Twilio but before updating DB?"

  • A: "Idempotency! Store third_party_id in DB. If worker restarts, check if we already have a success ID for this notification_id before retrying Twilio. If Twilio's API supports idempotency keys, pass our notification_id to them."

Last updated