Notification Service
Difficulty: Medium Topics: Pub-Sub, Message Queues, Third-party Integration, Rate Limiting Time: 45 minutes Companies: Amazon (SNS), Uber, LinkedIn, Airbnb
Problem Statement
Design a scalable notification service that sends notifications across multiple channels (Email, SMS, Push Notifications) to millions of users.
Requirements:
Functional:
Send email, SMS, and push notifications.
Support bulk notifications (e.g., "Marketing blast to 1M users").
Support transactional notifications (e.g., "OTP", "Order Confirmed").
User preferences (Opt-out/Opt-in channels).
Non-Functional:
High Availability: 99.99%
Durability: No notification lost.
Throughput: 1M+ notifications/min.
Latency: Real-time for OTP (<5s), eventual for marketing (<10m).
Deduplication: Don't send duplicate alerts.
Architecture
Data Schema
User Preferences
Notification Log (Cassandra/DynamoDB)
Key Components
1. API & Validation
Rate Limit: Prevent abusive clients (e.g., 100 req/sec per service).
Validation: Check email format, phone number validity.
2. Priority Handling (Message Queues)
Problem: Marketing blast (1M emails) blocks critical OTPs.
Solution: Separate Queues!
High_Priority_Queue: OTPs, Security Alerts (Dedicated Workers).Low_Priority_Queue: Marketing, Monthly Statements.
3. Deduplication and Exactly-Once Semantics (SDE-3 Concept)
Problem: Client retries or network blips create duplicate notifications.
Solution: Idempotency Keys.
The calling service generates a unique
idempotency_key(UUID).Check a fast distributed cache (Redis) for the key. TTL = 10 mins to 24 hours.
If exists -> Return 200 OK (already processed).
If not exists -> Process message, add key to Redis, return 200 OK.
Note on Kafka: Kafka provides "At-least-once" delivery by default. To prevent workers from sending duplicates when they crash before committing offsets, we must also check idempotency at the worker level before making the 3rd party API call.
4. Failure Handling with Dead Letter Queues (DLQ)
Problem: What if an email address is permanently invalid or a message always crashes the worker (poison pill)?
Solution:
If a message fails processing
Ntimes (e.g., 3 retries), move it to a Dead Letter Queue (DLQ).Setup alerts on the DLQ size.
Engineers can inspect the DLQ, fix the bug or data issue, and replay the messages back into the main queue.
4. Third-Party Integration (The "Hard" Part)
Challenges:
Vendor Downtime (Twilio down).
Rate Limits (Vendor blocks IP).
Latency (Vendor slow).
Mitigation:
Retry with Backoff: Exponential backoff for 5xx errors.
Circuit Breaker: Stop calling if 50% errors, failover to secondary provider (e.g., SendGrid -> AWS SES).
Rate Limiter (Outbound): Throttle requests to match Vendor's TPS limit.
Scale Estimation
Traffic: 10M notifications/day.
Peak: 1M within 10 mins (Marketing).
Storage:
Log retention: 1 year.
1KB per log * 10M/day * 365 = ~3.6 TB/year.
Database: Cassandra (Write-heavy, TTL support).
Interview Tips
Q: "How do you handle users in Do Not Disturb (DND) mode?"
A: "Workers check User Preferences DB before sending. If DND is active, park message in a
Delayed_Queuewith a timestamp to process later."
Q: "How to prevent spamming users?"
A: "Implement a Frequency Cap in Redis (e.g., 'User X received 3 marketing emails today -> Drop 4th'). Critical alerts bypass this."
Q: "What if the worker crashes after sending to Twilio but before updating DB?"
A: "Idempotency! Store
third_party_idin DB. If worker restarts, check if we already have a success ID for thisnotification_idbefore retrying Twilio. If Twilio's API supports idempotency keys, pass ournotification_idto them."
Last updated