# Pastebin ## Problem Statement Design a service like Pastebin where users can store plain text and share it via a unique URL. The service should support temporary pastes that expire after a certain time. *** ## Requirements ### Functional Requirements 1. Users can paste text and get a unique shareable URL 2. Users can retrieve paste content via URL 3. Pastes can have expiration time (1 hour, 1 day, 1 month, never) 4. Support for custom short URLs (optional) 5. Basic analytics (view count) ### Non-Functional Requirements 1. **High availability**: 99.9% uptime 2. **Low latency**: < 100ms for read operations 3. **Scalability**: Handle millions of pastes per day 4. **Durability**: Pastes should not be lost 5. **Security**: Private pastes should be accessible only via URL *** ## Capacity Estimation ### Traffic Estimates * **Writes**: 1M new pastes/day = \~12 pastes/sec * **Reads**: 10:1 read-to-write ratio = 120 reads/sec * **Peak traffic**: 5x average = 60 writes/sec, 600 reads/sec ### Storage Estimates * **Average paste size**: 10 KB (text) * **Daily storage**: 1M × 10 KB = 10 GB/day * **5-year storage** (assuming 80% pastes expire): * Total generated: 1M × 365 × 5 = 1.825 billion pastes * Retained (20%): 365 million pastes * Storage needed: 365M × 10 KB = **3.65 TB** ### Bandwidth Estimates * **Write bandwidth**: 12 pastes/sec × 10 KB = 120 KB/sec * **Read bandwidth**: 120 reads/sec × 10 KB = 1.2 MB/sec ### URL Key Size * **Unique URLs needed**: \~2 billion (with buffer) * **Character set**: \[a-z, A-Z, 0-9] = 62 characters * **Key length**: 62^7 = 3.5 trillion unique URLs (7 characters sufficient) *** ## API Design ### 1. Create Paste ```http POST /api/v1/pastes Content-Type: application/json { "content": "Hello, World!", "expiration": "1d", // 1h, 1d, 1m, never "customUrl": "my-paste", // optional "isPrivate": false } Response: { "shortUrl": "https://paste.in/aB3dE5f", "expiresAt": "2024-02-10T12:00:00Z" } ``` ### 2. Retrieve Paste ```http GET /api/v1/pastes/{shortKey} Response: { "content": "Hello, World!", "createdAt": "2024-02-09T12:00:00Z", "expiresAt": "2024-02-10T12:00:00Z", "viewCount": 42 } ``` ### 3. Delete Paste (Owner Only) ```http DELETE /api/v1/pastes/{shortKey} Authorization: Bearer ``` *** ## High-Level Design ### Architecture Diagram {% @mermaid/diagram content="graph TB subgraph "Client Layer" User\[User Browser] end ``` subgraph "CDN/Load Balancer" CDN[CDN CloudFront] LB[Load Balancer] end subgraph "Application Layer" API1[API Server 1] API2[API Server 2] API3[API Server N] end subgraph "Caching Layer" Redis[(Redis Cache
Hot Pastes)] end subgraph "Storage Layer" DB[(PostgreSQL
Metadata)] S3[(Object Storage
S3/Blob)] end subgraph "Services" KeyGen[Key Generation
Service] Cleanup[Cleanup Service
Expired Pastes] end User -->|HTTPS| CDN CDN -->|Cache Miss| LB LB --> API1 & API2 & API3 API1 & API2 & API3 -->|Check Cache| Redis API1 & API2 & API3 -->|Read/Write Metadata| DB API1 & API2 & API3 -->|Store Content| S3 API1 & API2 & API3 -->|Get Unique Key| KeyGen Cleanup -.->|Delete Expired| DB Cleanup -.->|Delete Objects| S3" %} ``` *** ## Detailed Component Design ### 1. Key Generation Service #### Option A: Pre-generate Keys (Recommended) {% @mermaid/diagram content="graph LR KeyGen\[Key Generator] -->|Generate Batch| KeyDB\[(Key Pool DB)] API\[API Server] -->|Get Unused Key| KeyDB API -->|Mark as Used| KeyDB" %} **Algorithm:** * Pre-generate 1M keys in batches * Store in database with `used` flag * API servers fetch and mark as used * **Pros**: Fast, no collision * **Cons**: Extra storage for key pool #### Option B: Hash-based Generation ``` MD5(content + timestamp) → Base62 encode → Take first 7 chars ``` **Check for collision** (rare, but handle via retry) ### 2. Database Schema #### Metadata Table (PostgreSQL) ```sql CREATE TABLE pastes ( id BIGSERIAL PRIMARY KEY, short_key VARCHAR(10) UNIQUE NOT NULL, s3_object_key VARCHAR(255) NOT NULL, created_at TIMESTAMP DEFAULT NOW(), expires_at TIMESTAMP, view_count BIGINT DEFAULT 0, is_private BOOLEAN DEFAULT false, owner_id BIGINT, INDEX idx_short_key (short_key), INDEX idx_expires_at (expires_at) ); ``` #### Why S3 for Content, Not DB? * **Cost**: S3 cheaper for large blobs * **Scalability**: DB optimized for metadata, S3 for objects * **CDN integration**: Direct CloudFront integration with S3 ### 3. Caching Strategy **Redis Cache Cluster:** ```java // Cache popular pastes (top 20%) public String getPasteContent(String shortKey, Timestamp expiresAt) { String cacheKey = "paste:" + shortKey; long ttl = Math.min( Duration.between(Instant.now(), expiresAt.toInstant()).toSeconds(), 3600 // 1 hour max ); if (cache.exists(cacheKey)) { return cache.get(cacheKey); } else { String content = s3Client.getObject(s3Key); cache.setex(cacheKey, ttl, content); return content; } } ``` **Cache Eviction**: LRU (Least Recently Used) **SDE-3 Optimization: Consistent Hashing for Redis** As the cache grows, a single Redis instance isn't enough. We need a Redis cluster. * **Problem:** If we add/remove Redis nodes and use `hash(key) % N`, almost all keys will map to new servers, causing a massive cache miss spike (Cache Thundering Herd) that could bring down the DB. * **Solution:** Use **Consistent Hashing** (e.g., ring-based routing). When a node is added/removed, only `1/N` keys are remapped. Add "virtual nodes" to ensure even distribution across physical nodes of different capacities. ### 4. Expiration & Cleanup #### Lazy Deletion ```java // On read request if (paste.getExpiresAt().before(new Timestamp(System.currentTimeMillis()))) { throw new NotFoundException("Paste expired"); } ``` #### Active Cleanup (Batch Job) ```java // Cron job runs daily String query = """ SELECT short_key, s3_object_key FROM pastes WHERE expires_at < NOW() - INTERVAL '1 day' LIMIT 10000 """; List expiredPastes = db.executeQuery(query); // Delete from S3 and DB for (Paste paste : expiredPastes) { s3Client.deleteObject(paste.getS3ObjectKey()); db.delete(paste.getId()); } ``` **SDE-3 Optimization: Avoiding DB Scans for Cleanup** * **Problem:** Running a `SELECT ... WHERE expires_at < NOW()` on a huge table is very slow, even with an index, and can cause DB CPU spikes. * **Solution 1 (S3 Lifecycle Policies):** Since we store content in S3, use AWS S3 Object Expiration Lifecycle rules. When uploading, tag the object or put it in a folder corresponding to its expiry bucket (e.g., `1h/`, `1d/`). S3 deletes it automatically. * **Solution 2 (Time-Series DB / Redis TTL):** Metadata can be stored in a DB that natively supports TTL (like DynamoDB TTL or Cassandra) rather than PostgreSQL. If using Postgres, use Table Partitioning by day/week and just `DROP PARTITION` for old data instead of row-level deletes. *** ## Read/Write Flow ### Write Flow {% @mermaid/diagram content="sequenceDiagram participant User participant API participant KeyGen participant DB participant S3 participant Cache ``` User->>API: POST /pastes {content} API->>KeyGen: Get unique key KeyGen-->>API: "aB3dE5f" API->>S3: Store content at key S3-->>API: S3 object URL API->>DB: Insert metadata (short_key, s3_url, expires_at) API-->>User: {shortUrl}" %} ``` ### Read Flow {% @mermaid/diagram content="sequenceDiagram participant User participant CDN participant API participant Cache participant DB participant S3 ``` User->>CDN: GET /aB3dE5f CDN->>API: Cache miss API->>Cache: Check Redis alt Cache Hit Cache-->>API: Content else Cache Miss API->>DB: Get metadata by short_key DB-->>API: {s3_object_key, expires_at} API->>S3: Fetch content S3-->>API: Content API->>Cache: Store in Redis (TTL) end API-->>User: Paste content" %} ``` *** ## Scalability & Optimization ### 1. Database Sharding * **Shard key**: `hash(short_key) % num_shards` * Distributes load evenly ### 2. Read Replicas * Master for writes * Read replicas for GET requests (view count can be eventually consistent) ### 3. CDN for Static Content * Serve popular pastes from edge locations * Cache-Control headers: `max-age=3600` ### 4. Rate Limiting * Prevent abuse: 10 pastes/hour per IP * Use Redis for counters *** ## Security Considerations ### 1. Private Pastes * Generate cryptographically strong keys (16+ chars) * No indexing or listing endpoints ### 2. Input Validation * Limit paste size: Max 1 MB * Sanitize HTML to prevent XSS ### 3. DDoS Protection * CloudFlare or AWS Shield * Rate limiting at LB level *** ## Trade-offs & Alternatives | Aspect | Choice | Alternative | Trade-off | | ------------------- | -------------------- | ---------------- | --------------------------- | | **Content Storage** | S3 | Database | Cost vs simplicity | | **Key Generation** | Pre-generated pool | Hash-based | Storage vs collision risk | | **Caching** | Redis | No cache | Cost vs latency | | **Expiration** | Lazy + Batch cleanup | Active TTL in DB | Complexity vs storage waste | *** ## Extensions ### 1. Syntax Highlighting * Detect language, store metadata * Client-side rendering with libraries ### 2. Paste History (User Accounts) ```sql ALTER TABLE pastes ADD COLUMN owner_id BIGINT; CREATE INDEX idx_owner_id ON pastes(owner_id); ``` ### 3. Analytics Dashboard * Track views over time (TimescaleDB or InfluxDB) * Geographic distribution (from CDN logs) *** ## Interview Discussion Points **Q: Why not store everything in database?** * S3 is **cheaper** for blob storage ($0.023/GB vs $0.10+/GB for DB) * DB performance degrades with large rows * S3 integrates with CDN (CloudFront) **Q: How to handle 1M pastes/sec?** * Horizontal scaling of API servers * Database sharding (by short\_key hash) * Aggressive caching (Redis cluster) * CDN for read-heavy workload **Q: What if key generation service goes down?** * **Pre-generated pool**: API servers cache locally (1000 keys) * **Failover**: Multiple key generation services (standby) * **Fallback**: Hash-based generation with collision check