githubEdit

YouTube

Problem Statement

Design a video sharing platform like YouTube that allows users to upload, view, search, and comment on videos. The system must handle billions of users watching millions of videos concurrently.


Requirements

Functional Requirements

  1. Upload videos (various formats: MP4, AVI, MOV)

  2. Stream videos with adaptive bitrate

  3. Search videos by title, tags, channel

  4. Like/Dislike and comment on videos

  5. Subscribe to channels

  6. Recommendations based on watch history

  7. View count and analytics

Non-Functional Requirements

  1. High availability: 99.99% uptime

  2. Low latency: < 200ms to start streaming

  3. Scalability: Handle 1 billion DAU, 500 hours of video uploaded/min

  4. Global distribution: CDN for low-latency streaming worldwide

  5. Reliability: No video loss during upload/encoding


Capacity Estimation

Traffic Estimates

  • Daily Active Users (DAU): 1 billion

  • Videos watched per user/day: 5 videos

  • Total video views/day: 5 billion

  • Peak QPS: 5B / 86400 × 3 (peak factor) = 174,000 views/sec

Upload Estimates

  • Video uploads: 500 hours/min = 30,000 hours/hour

  • Average video length: 10 minutes

  • Upload rate: 30,000 × 6 videos/hour = 180,000 videos/hour

Storage Estimates

  • Average raw video size: 1 GB for 10-min 1080p video

  • Videos uploaded/day: 180K × 24 = 4.32M videos/day

  • Daily storage (raw): 4.32M × 1 GB = 4.32 PB/day

  • With transcoding (multiple resolutions): 4.32 PB × 3 = 13 PB/day

  • 5-year storage: 13 PB × 365 × 5 = 23.7 exabytes

Bandwidth Estimates

  • Average bitrate: 4 Mbps (1080p)

  • Concurrent viewers (peak): 174K streams

  • Egress bandwidth: 174K × 4 Mbps = 696 Gbps


API Design

1. Upload Video

2. Stream Video

3. Search Videos

4. Like/Comment


High-Level Architecture

spinner

Detailed Component Design

1. Video Upload & Processing Pipeline

spinner

Video Transcoding

Formats:

  • HLS (HTTP Live Streaming): Apple, widely supported

  • DASH (Dynamic Adaptive Streaming): Industry standard

  • Resolutions: 144p, 240p, 360p, 480p, 720p, 1080p, 1440p, 2160p (4K)

  • Codecs: H.264 (compatibility), H.265/HEVC (efficiency), AV1 (future)

Adaptive Bitrate Streaming (ABR):

Client switches quality based on bandwidth

Distributed Transcoding (SDE-3 Deep Dive: DAG)

  • Horizontal scaling: 1000+ transcoding nodes

  • Job queue: Kafka partitioned by priority (high for verified creators)

  • Optimization: GPU-accelerated encoding (NVIDIA NVENC)

  • DAG (Directed Acyclic Graph) Workflow: A complex video goes through many processing steps: [Inspect] -> [Watermark] -> [Transcode 1080p, Transcode 720p, Extract Audio] -> [Merge] -> [CDN Push]. Using a DAG scheduler (like Apache Airflow or a custom engine) ensures these steps are executed in the correct dependency order across distributed workers, parallelizing independent tasks.

2. Video Streaming Architecture

spinner

CDN Strategy:

  • Multi-CDN: CloudFront + Akamai (redundancy)

  • Cache hierarchy: Edge PoP → Regional cache → Origin

  • TTL: Hot videos cached for hours, cold videos on-demand

  • Prefetching: First 10 seconds cached aggressively

3. Database Schema

Videos Table (PostgreSQL)

Comments Table (Sharded by video_id)

View Counts (Redis + Batch Processing)

4. Search Service (Elasticsearch)

Index Schema:

Ranking Factors:

  1. Text relevance (BM25 score)

  2. View count (popularity boost)

  3. Recency (decay function)

  4. User engagement (CTR, watch time)

Autocomplete:

5. Recommendation Engine

spinner

Two-Tower Model:

  • User Tower: Embeddings from watch history, demographics

  • Video Tower: Embeddings from title, tags, engagement

  • Dot product for similarity score

Features for Ranking:

  • Video metadata (duration, upload date, category)

  • User engagement (CTR, watch time, likes)

  • Freshness (recent uploads get boost)

  • Channel authority (subscriber count)


Scalability Strategies

1. Database Sharding

Comments Sharding:

Hot Shard Problem:

  • Viral video gets all comments on one shard

  • Solution: Further partition by time buckets

2. Caching Strategy

Multi-layer cache:

  1. CDN: Video chunks (30-day TTL for popular)

  2. Redis: Video metadata (1-hour TTL)

  3. Application cache: Search results (5-min TTL)

3. Write-Heavy Optimization (Likes/Views)

Problem: Updating counters creates DB hotspot

Solution: Batch Writes


Advanced Features

1. Live Streaming

  • Protocol: RTMP (ingest) → HLS (delivery)

  • Low latency: LL-HLS or WebRTC

  • Chat: WebSocket connections

2. Content Moderation

  • Upload-time: ML model checks for policy violations

  • User reports: Queue for human review

  • Copyright: Content ID system (audio fingerprinting)

3. Monetization

  • Ad insertion: SSAI (Server-Side Ad Insertion) in HLS stream

  • Analytics: Track impressions, clicks (Kafka → BigQuery)


Trade-offs & Optimizations

Aspect
Choice
Trade-off

Storage

S3 + CDN

Cost ($millions/month) vs latency

Transcoding

Multi-resolution

Storage (3x) vs user experience

View counts

Eventual consistency

Accuracy vs DB load

Recommendations

Batch updates

Freshness vs compute cost


Interview Discussion Points

Q: How to handle viral videos?

  • CDN prefetching: Warm cache proactively

  • Read replicas: Metadata DB scaled horizontally

  • Rate limiting: Prevent single video from overwhelming system

  • SDE-3 Approach: Cache Stampede Prevention: When a viral video expires from cache, millions of requests might hit the origin DB simultaneously. Use Promise Caching or Mutex Locks (Redis SETNX): The first request gets the lock, fetches from DB, and updates cache. Other requests wait or are served stale data.

  • Graceful degradation: Serve lower quality if needed

Q: How to prevent duplicate uploads?

  • Perceptual hashing (pHash or dHash) of video frames

  • Compare hash against existing videos

  • Trade-off: False positives vs compute cost

Q: Optimizing for mobile users on slow networks?

  • ABR: Start at 144p, upgrade as bandwidth allows

  • Prefetch next video: Start downloading while watching current

  • Thumbnail previews: Low-res sprite sheets instead of full video scrubbing

Q: Disaster recovery?

  • Multi-region S3 replication: Cross-region backups

  • Database snapshots: Daily automated backups

  • CDN failover: Multi-CDN strategy (CloudFront → Akamai)

Last updated