Pastebin

Problem Statement

Design a service like Pastebin where users can store plain text and share it via a unique URL. The service should support temporary pastes that expire after a certain time.

Requirements

Functional Requirements

Users can paste text and get a unique shareable URL
Users can retrieve paste content via URL
Pastes can have expiration time (1 hour, 1 day, 1 month, never)
Support for custom short URLs (optional)
Basic analytics (view count)

Non-Functional Requirements

High availability: 99.9% uptime
Low latency: < 100ms for read operations
Scalability: Handle millions of pastes per day
Durability: Pastes should not be lost
Security: Private pastes should be accessible only via URL

Capacity Estimation

Traffic Estimates

Writes: 1M new pastes/day = ~12 pastes/sec
Reads: 10:1 read-to-write ratio = 120 reads/sec
Peak traffic: 5x average = 60 writes/sec, 600 reads/sec

Storage Estimates

Average paste size: 10 KB (text)
Daily storage: 1M × 10 KB = 10 GB/day
5-year storage (assuming 80% pastes expire):
- Total generated: 1M × 365 × 5 = 1.825 billion pastes
- Retained (20%): 365 million pastes
- Storage needed: 365M × 10 KB = 3.65 TB

Bandwidth Estimates

Write bandwidth: 12 pastes/sec × 10 KB = 120 KB/sec
Read bandwidth: 120 reads/sec × 10 KB = 1.2 MB/sec

URL Key Size

Unique URLs needed: ~2 billion (with buffer)
Character set: [a-z, A-Z, 0-9] = 62 characters
Key length: 62^7 = 3.5 trillion unique URLs (7 characters sufficient)

API Design

1. Create Paste

POST /api/v1/pastes
Content-Type: application/json

{
  "content": "Hello, World!",
  "expiration": "1d",  // 1h, 1d, 1m, never
  "customUrl": "my-paste",  // optional
  "isPrivate": false
}

Response:
{
  "shortUrl": "https://paste.in/aB3dE5f",
  "expiresAt": "2024-02-10T12:00:00Z"
}

2. Retrieve Paste

GET /api/v1/pastes/{shortKey}

Response:
{
  "content": "Hello, World!",
  "createdAt": "2024-02-09T12:00:00Z",
  "expiresAt": "2024-02-10T12:00:00Z",
  "viewCount": 42
}

3. Delete Paste (Owner Only)

DELETE /api/v1/pastes/{shortKey}
Authorization: Bearer <token>

High-Level Design

Architecture Diagram

Detailed Component Design

1. Key Generation Service

Option A: Pre-generate Keys (Recommended)

Algorithm:

Pre-generate 1M keys in batches
Store in database with used flag
API servers fetch and mark as used
Pros: Fast, no collision
Cons: Extra storage for key pool

Option B: Hash-based Generation

MD5(content + timestamp) → Base62 encode → Take first 7 chars

Check for collision (rare, but handle via retry)

2. Database Schema

Metadata Table (PostgreSQL)

CREATE TABLE pastes (
    id BIGSERIAL PRIMARY KEY,
    short_key VARCHAR(10) UNIQUE NOT NULL,
    s3_object_key VARCHAR(255) NOT NULL,
    created_at TIMESTAMP DEFAULT NOW(),
    expires_at TIMESTAMP,
    view_count BIGINT DEFAULT 0,
    is_private BOOLEAN DEFAULT false,
    owner_id BIGINT,
    INDEX idx_short_key (short_key),
    INDEX idx_expires_at (expires_at)
);

Why S3 for Content, Not DB?

Cost: S3 cheaper for large blobs
Scalability: DB optimized for metadata, S3 for objects
CDN integration: Direct CloudFront integration with S3

3. Caching Strategy

Redis Cache Cluster:

// Cache popular pastes (top 20%)
public String getPasteContent(String shortKey, Timestamp expiresAt) {
    String cacheKey = "paste:" + shortKey;
    long ttl = Math.min(
        Duration.between(Instant.now(), expiresAt.toInstant()).toSeconds(),
        3600  // 1 hour max
    );
    
    if (cache.exists(cacheKey)) {
        return cache.get(cacheKey);
    } else {
        String content = s3Client.getObject(s3Key);
        cache.setex(cacheKey, ttl, content);
        return content;
    }
}

Cache Eviction: LRU (Least Recently Used)

SDE-3 Optimization: Consistent Hashing for Redis As the cache grows, a single Redis instance isn't enough. We need a Redis cluster.

Problem: If we add/remove Redis nodes and use hash(key) % N, almost all keys will map to new servers, causing a massive cache miss spike (Cache Thundering Herd) that could bring down the DB.
Solution: Use Consistent Hashing (e.g., ring-based routing). When a node is added/removed, only 1/N keys are remapped. Add "virtual nodes" to ensure even distribution across physical nodes of different capacities.

4. Expiration & Cleanup

Lazy Deletion

// On read request
if (paste.getExpiresAt().before(new Timestamp(System.currentTimeMillis()))) {
    throw new NotFoundException("Paste expired");
}

Active Cleanup (Batch Job)

// Cron job runs daily
String query = """
    SELECT short_key, s3_object_key 
    FROM pastes 
    WHERE expires_at < NOW() - INTERVAL '1 day'
    LIMIT 10000
    """;

List<Paste> expiredPastes = db.executeQuery(query);

// Delete from S3 and DB
for (Paste paste : expiredPastes) {
    s3Client.deleteObject(paste.getS3ObjectKey());
    db.delete(paste.getId());
}

SDE-3 Optimization: Avoiding DB Scans for Cleanup

Problem: Running a SELECT ... WHERE expires_at < NOW() on a huge table is very slow, even with an index, and can cause DB CPU spikes.
Solution 1 (S3 Lifecycle Policies): Since we store content in S3, use AWS S3 Object Expiration Lifecycle rules. When uploading, tag the object or put it in a folder corresponding to its expiry bucket (e.g., 1h/, 1d/). S3 deletes it automatically.
Solution 2 (Time-Series DB / Redis TTL): Metadata can be stored in a DB that natively supports TTL (like DynamoDB TTL or Cassandra) rather than PostgreSQL. If using Postgres, use Table Partitioning by day/week and just DROP PARTITION for old data instead of row-level deletes.

Read/Write Flow

Write Flow

Read Flow

Scalability & Optimization

1. Database Sharding

Shard key: hash(short_key) % num_shards
Distributes load evenly

2. Read Replicas

Master for writes
Read replicas for GET requests (view count can be eventually consistent)

3. CDN for Static Content

Serve popular pastes from edge locations
Cache-Control headers: max-age=3600

4. Rate Limiting

Prevent abuse: 10 pastes/hour per IP
Use Redis for counters

Security Considerations

1. Private Pastes

Generate cryptographically strong keys (16+ chars)
No indexing or listing endpoints

2. Input Validation

Limit paste size: Max 1 MB
Sanitize HTML to prevent XSS

3. DDoS Protection

CloudFlare or AWS Shield
Rate limiting at LB level

Trade-offs & Alternatives

Aspect

Choice

Alternative

Trade-off

Content Storage

Database

Cost vs simplicity

Key Generation

Pre-generated pool

Hash-based

Storage vs collision risk

Caching

Redis

No cache

Cost vs latency

Expiration

Lazy + Batch cleanup

Active TTL in DB

Complexity vs storage waste

Extensions

1. Syntax Highlighting

Detect language, store metadata
Client-side rendering with libraries

2. Paste History (User Accounts)

ALTER TABLE pastes ADD COLUMN owner_id BIGINT;
CREATE INDEX idx_owner_id ON pastes(owner_id);

3. Analytics Dashboard

Track views over time (TimescaleDB or InfluxDB)
Geographic distribution (from CDN logs)

Interview Discussion Points

Q: Why not store everything in database?

S3 is cheaper for blob storage ($0.023/GB vs $0.10+/GB for DB)
DB performance degrades with large rows
S3 integrates with CDN (CloudFront)

Q: How to handle 1M pastes/sec?

Horizontal scaling of API servers
Database sharding (by short_key hash)
Aggressive caching (Redis cluster)
CDN for read-heavy workload

Q: What if key generation service goes down?

Pre-generated pool: API servers cache locally (1000 keys)
Failover: Multiple key generation services (standby)
Fallback: Hash-based generation with collision check

PreviousURL Shortener NextKey-Value Store

Last updated 1 month ago

hashtagProblem Statement

hashtagRequirements

hashtagFunctional Requirements

hashtagNon-Functional Requirements

hashtagCapacity Estimation

hashtagTraffic Estimates

hashtagStorage Estimates

hashtagBandwidth Estimates

hashtagURL Key Size

hashtagAPI Design

hashtag1. Create Paste

hashtag2. Retrieve Paste

hashtag3. Delete Paste (Owner Only)

hashtagHigh-Level Design

hashtagArchitecture Diagram

hashtagDetailed Component Design

hashtag1. Key Generation Service

hashtagOption A: Pre-generate Keys (Recommended)

hashtagOption B: Hash-based Generation

hashtag2. Database Schema

hashtagMetadata Table (PostgreSQL)

hashtagWhy S3 for Content, Not DB?

hashtag3. Caching Strategy

hashtag4. Expiration & Cleanup

hashtagLazy Deletion

hashtagActive Cleanup (Batch Job)

hashtagRead/Write Flow

hashtagWrite Flow

hashtagRead Flow

hashtagScalability & Optimization

hashtag1. Database Sharding

hashtag2. Read Replicas

hashtag3. CDN for Static Content

hashtag4. Rate Limiting

hashtagSecurity Considerations

hashtag1. Private Pastes

hashtag2. Input Validation

hashtag3. DDoS Protection

hashtagTrade-offs & Alternatives

hashtagExtensions

hashtag1. Syntax Highlighting

hashtag2. Paste History (User Accounts)

hashtag3. Analytics Dashboard

hashtagInterview Discussion Points