githubEdit

Pastebin

Problem Statement

Design a service like Pastebin where users can store plain text and share it via a unique URL. The service should support temporary pastes that expire after a certain time.


Requirements

Functional Requirements

  1. Users can paste text and get a unique shareable URL

  2. Users can retrieve paste content via URL

  3. Pastes can have expiration time (1 hour, 1 day, 1 month, never)

  4. Support for custom short URLs (optional)

  5. Basic analytics (view count)

Non-Functional Requirements

  1. High availability: 99.9% uptime

  2. Low latency: < 100ms for read operations

  3. Scalability: Handle millions of pastes per day

  4. Durability: Pastes should not be lost

  5. Security: Private pastes should be accessible only via URL


Capacity Estimation

Traffic Estimates

  • Writes: 1M new pastes/day = ~12 pastes/sec

  • Reads: 10:1 read-to-write ratio = 120 reads/sec

  • Peak traffic: 5x average = 60 writes/sec, 600 reads/sec

Storage Estimates

  • Average paste size: 10 KB (text)

  • Daily storage: 1M × 10 KB = 10 GB/day

  • 5-year storage (assuming 80% pastes expire):

    • Total generated: 1M × 365 × 5 = 1.825 billion pastes

    • Retained (20%): 365 million pastes

    • Storage needed: 365M × 10 KB = 3.65 TB

Bandwidth Estimates

  • Write bandwidth: 12 pastes/sec × 10 KB = 120 KB/sec

  • Read bandwidth: 120 reads/sec × 10 KB = 1.2 MB/sec

URL Key Size

  • Unique URLs needed: ~2 billion (with buffer)

  • Character set: [a-z, A-Z, 0-9] = 62 characters

  • Key length: 62^7 = 3.5 trillion unique URLs (7 characters sufficient)


API Design

1. Create Paste

2. Retrieve Paste

3. Delete Paste (Owner Only)


High-Level Design

Architecture Diagram

spinner

Detailed Component Design

1. Key Generation Service

spinner

Algorithm:

  • Pre-generate 1M keys in batches

  • Store in database with used flag

  • API servers fetch and mark as used

  • Pros: Fast, no collision

  • Cons: Extra storage for key pool

Option B: Hash-based Generation

Check for collision (rare, but handle via retry)

2. Database Schema

Metadata Table (PostgreSQL)

Why S3 for Content, Not DB?

  • Cost: S3 cheaper for large blobs

  • Scalability: DB optimized for metadata, S3 for objects

  • CDN integration: Direct CloudFront integration with S3

3. Caching Strategy

Redis Cache Cluster:

Cache Eviction: LRU (Least Recently Used)

SDE-3 Optimization: Consistent Hashing for Redis As the cache grows, a single Redis instance isn't enough. We need a Redis cluster.

  • Problem: If we add/remove Redis nodes and use hash(key) % N, almost all keys will map to new servers, causing a massive cache miss spike (Cache Thundering Herd) that could bring down the DB.

  • Solution: Use Consistent Hashing (e.g., ring-based routing). When a node is added/removed, only 1/N keys are remapped. Add "virtual nodes" to ensure even distribution across physical nodes of different capacities.

4. Expiration & Cleanup

Lazy Deletion

Active Cleanup (Batch Job)

SDE-3 Optimization: Avoiding DB Scans for Cleanup

  • Problem: Running a SELECT ... WHERE expires_at < NOW() on a huge table is very slow, even with an index, and can cause DB CPU spikes.

  • Solution 1 (S3 Lifecycle Policies): Since we store content in S3, use AWS S3 Object Expiration Lifecycle rules. When uploading, tag the object or put it in a folder corresponding to its expiry bucket (e.g., 1h/, 1d/). S3 deletes it automatically.

  • Solution 2 (Time-Series DB / Redis TTL): Metadata can be stored in a DB that natively supports TTL (like DynamoDB TTL or Cassandra) rather than PostgreSQL. If using Postgres, use Table Partitioning by day/week and just DROP PARTITION for old data instead of row-level deletes.


Read/Write Flow

Write Flow

spinner

Read Flow

spinner

Scalability & Optimization

1. Database Sharding

  • Shard key: hash(short_key) % num_shards

  • Distributes load evenly

2. Read Replicas

  • Master for writes

  • Read replicas for GET requests (view count can be eventually consistent)

3. CDN for Static Content

  • Serve popular pastes from edge locations

  • Cache-Control headers: max-age=3600

4. Rate Limiting

  • Prevent abuse: 10 pastes/hour per IP

  • Use Redis for counters


Security Considerations

1. Private Pastes

  • Generate cryptographically strong keys (16+ chars)

  • No indexing or listing endpoints

2. Input Validation

  • Limit paste size: Max 1 MB

  • Sanitize HTML to prevent XSS

3. DDoS Protection

  • CloudFlare or AWS Shield

  • Rate limiting at LB level


Trade-offs & Alternatives

Aspect
Choice
Alternative
Trade-off

Content Storage

S3

Database

Cost vs simplicity

Key Generation

Pre-generated pool

Hash-based

Storage vs collision risk

Caching

Redis

No cache

Cost vs latency

Expiration

Lazy + Batch cleanup

Active TTL in DB

Complexity vs storage waste


Extensions

1. Syntax Highlighting

  • Detect language, store metadata

  • Client-side rendering with libraries

2. Paste History (User Accounts)

3. Analytics Dashboard

  • Track views over time (TimescaleDB or InfluxDB)

  • Geographic distribution (from CDN logs)


Interview Discussion Points

Q: Why not store everything in database?

  • S3 is cheaper for blob storage ($0.023/GB vs $0.10+/GB for DB)

  • DB performance degrades with large rows

  • S3 integrates with CDN (CloudFront)

Q: How to handle 1M pastes/sec?

  • Horizontal scaling of API servers

  • Database sharding (by short_key hash)

  • Aggressive caching (Redis cluster)

  • CDN for read-heavy workload

Q: What if key generation service goes down?

  • Pre-generated pool: API servers cache locally (1000 keys)

  • Failover: Multiple key generation services (standby)

  • Fallback: Hash-based generation with collision check

Last updated