githubEdit

Google Drive

Problem Statement

Design a cloud storage and file synchronization service like Google Drive that allows users to upload, store, share files and folders, with real-time synchronization across multiple devices.


Requirements

Functional Requirements

  1. Upload/Download files (up to 15 GB per file)

  2. Create folders and organize files

  3. Share files/folders with permissions (view, edit)

  4. Real-time sync across devices

  5. Version history (restore previous versions)

  6. Collaborative editing (Google Docs-style)

  7. Search files by name, content, type

  8. Trash with 30-day retention

Non-Functional Requirements

  1. High availability: 99.9% uptime

  2. Strong consistency: Same view across devices

  3. Scalability: 1 billion users, exabytes of data

  4. Reliability: No data loss (11 nines durability)

  5. Low latency: < 100ms for metadata operations

  6. Bandwidth efficiency: Delta sync, compression


Capacity Estimation

User & Storage Estimates

  • Total users: 1 billion

  • Free tier: 15 GB/user

  • Paid tier (10%): 100 GB/user average

  • Total storage:

    • Free: 900M × 15 GB = 13.5 EB

    • Paid: 100M × 100 GB = 10 EB

    • Total: 23.5 exabytes

Traffic Estimates

  • DAU: 200 million

  • Files uploaded per user/day: 5 files

  • Average file size: 2 MB

  • Daily uploads: 200M × 5 × 2 MB = 2 PB/day

  • Upload QPS (peak): 200M × 5 / 86400 × 3 = 35K uploads/sec

Metadata Estimates

  • Files per user: 10,000 files

  • Metadata size: 1 KB per file

  • Total metadata: 1B users × 10K files × 1 KB = 10 PB


High-Level Architecture


Core Components

1. File Upload Flow with Chunking

Chunking Benefits:

  • Resume uploads: Re-upload only failed chunks

  • Bandwidth efficiency: Skip unchanged chunks

  • Deduplication: Same chunk across files stored once

  • Parallelization: Upload chunks in parallel

Chunk Schema (PostgreSQL):

2. Metadata Database Schema

3. Real-Time Sync Protocol

WebSocket Events:

Conflict Resolution:

4. Delta Sync (Bandwidth Optimization)

rsync-style algorithm:

Reduces bandwidth by 80%+ for small edits

5. File Sharing & Permissions

Share Link Schema:

Access Control:

6. Version Control

Version Retention:

  • Last 30 days: Keep all versions

  • 30-90 days: Keep weekly versions

  • 90+ days: Keep monthly versions

  • Manual versions: Kept indefinitely

Storage Optimization:

  • Copy-on-write: Only changed chunks stored

  • Garbage collection: Delete unreferenced chunks


Advanced Features

1. Collaborative Editing (Google Docs)

Operational Transformation (OT) or CRDT:

  • OT: Transform operations based on concurrent changes

  • CRDT: Conflict-free Replicated Data Type (Yjs library)

2. Full-Text Search (Elasticsearch)

Index Schema:

Content Extraction:

  • PDF: Apache Tika

  • DOCX/XLSX: Apache POI

  • Images: OCR (Tesseract)

3. Trash & Recovery


Scalability Strategies

1. Database Sharding

Metadata sharding by user_id:

Benefit: User's files co-located

2. CDN for Downloads

  • Popular files cached at edge

  • Signed URLs: Time-limited S3 URLs via CloudFront

  • Cache-Control: max-age=3600, private

3. S3 Storage Classes

  • Hot files (< 30 days): S3 Standard

  • Warm files (30-90 days): S3 Intelligent-Tiering

  • Cold files (90+ days): S3 Glacier


Security & Compliance

1. Encryption

  • At rest: S3-SSE (AES-256)

  • In transit: TLS 1.3

  • Client-side (enterprise): E2EE before upload

2. Virus Scanning

  • ClamAV on upload

  • Quarantine infected files

3. Audit Logs


Trade-offs

Aspect
Choice
Trade-off

Chunking

4MB chunks

Deduplication vs overhead

Consistency

Strong (PostgreSQL)

Latency vs correctness

Sync

WebSocket

Persistent connection vs latency

Deduplication

Hash-based

Privacy vs storage savings


Interview Discussion Points

Q: How to handle large files (100 GB+)?

  • Multipart upload: Parallel upload of chunks

  • Resumable uploads: Store uploaded chunk IDs

  • Bandwidth throttling: Prevent single user from saturating network

Q: Preventing data loss?

  • Replication: S3 (99.999999999% durability)

  • Versioning: Accidental deletion recovery

  • Cross-region: Multi-region S3 replication

Q: Optimizing for mobile devices?

  • Selective sync: Only sync important folders

  • Photo backup: Upload in background over WiFi

  • Thumbnail generation: 200KB thumbnail vs 5MB photo

Last updated