Key-Value Store
Problem Statement
Design a distributed key-value store like Redis or DynamoDB that supports basic operations (GET, PUT, DELETE) with high availability, partition tolerance, and tunable consistency.
Requirements
Functional Requirements
PUT(key, value): Store key-value pair
GET(key): Retrieve value for a key
DELETE(key): Remove a key-value pair
Support for configurable TTL (Time-To-Live)
Support for atomic operations (compare-and-swap)
Non-Functional Requirements
High availability: 99.99% uptime
Partition tolerance: Continue operating during network partitions
Low latency: < 10ms for GET, < 50ms for PUT
Scalability: Store petabytes of data, handle millions of QPS
Tunable consistency: Support both strong and eventual consistency
Capacity Estimation
Traffic Estimates
Peak QPS: 10 million requests/sec
Read:Write ratio: 9:1 (90% reads, 10% writes)
Reads: 9M/sec
Writes: 1M/sec
Storage Estimates
Average key size: 20 bytes
Average value size: 200 bytes
Total entries: 1 billion keys
Storage needed: 1B × (20 + 200 bytes) = 220 GB
With replication (3x): 660 GB
With overhead (2x): 1.32 TB
Memory Estimates (Cache)
Cache 20% hot data: 220 GB × 0.2 = 44 GB per node
For 10 nodes: 440 GB total cache
API Design
1. Basic Operations
2. Atomic Operations
3. Batch Operations
High-Level Architecture
Core Components
1. Consistent Hashing & Virtual Nodes
Why Virtual Nodes?
Each physical node owns multiple virtual nodes (vnodes)
Better load distribution when nodes join/leave
Typical: 256 vnodes per physical node
Consistent Hashing Algorithm:
2. Replication Strategy
Replication Factor (RF) = 3
Primary node + 2 replicas
Replicas stored on next N-1 nodes in ring (clockwise)
Replication Types:
Synchronous: Wait for all replicas to ACK (strong consistency)
Asynchronous: Write to primary, replicate in background (eventual consistency)
Quorum: Write to W nodes, read from R nodes where
W + R > RF
3. Quorum Consistency
Tunable Consistency Levels:
Strong
3
3
Linearizable
Banking, critical data
Quorum
2
2
Strong if R+W>RF
Balanced
Eventual
1
1
Eventual
High throughput
Quorum Formula:
Detailed Workflows
Write Path (PUT)
Optimization: Hinted Handoff
If replica node down, store hint on another node
Replay when original node recovers
Read Path (GET)
SDE-3 Concept: Read Repair
Problem: Replicas diverge due to dropped updates or network partitions.
Solution (Read Repair): When the coordinator performs a read across replicas and detects a version mismatch, it returns the newest data to the client and simultaneously sends an async update to the replicas holding stale data to synchronize them to the latest version. This repairs data lazily during read operations.
Conflict Resolution
Version Vectors (Dotted Version Vectors)
When concurrent writes occur, track causality:
Resolution Strategies:
Last Write Wins (LWW): Use timestamp (simple, but loses data)
Client-side merge: Return conflicts, let client decide (e.g., Amazon Shopping Cart merges items)
Application-defined: Custom merge logic
Dynamo-style Vector Clocks (SDE-3 Deep Dive): A vector clock is a list of (node, counter) pairs. It tracks the causal history of an object.
Comparison Rule: Clock X is an ancestor of Clock Y (no conflict) if for every node
i,X[i] <= Y[i].Conflict: If
Xhas some elements strictly greater thanY, andYhas some elements strictly greater thanX, thenXandYare concurrent and in conflict. Client application is responsible for merging them.
Failure Handling
1. Node Failure
Detection: Gossip protocol (heartbeat every 1s)
Recovery: Hinted handoff replays missed writes
Permanent failure: Anti-entropy repair (Merkle trees)
2. Network Partition
Sloppy Quorum: Accept writes even if primary replicas unreachable
Write to next healthy nodes (temporary replicas)
Eventual reconciliation when partition heals
3. Data Corruption
Checksum verification on reads
Merkle trees for anti-entropy (compare subtrees)
Storage Engine
LSM-Tree (Log-Structured Merge-Tree)
Why LSM-Tree?
Write-optimized: Sequential writes (append-only)
Compaction: Periodically merge SSTables
Used by: RocksDB, LevelDB, Cassandra
Read Path Optimization:
Bloom filters: Skip SSTables unlikely to have key
Index blocks: Binary search within SSTable
Block cache: Cache hot SSTable blocks in memory
Advanced Features
1. TTL (Time-To-Live)
2. Atomic Compare-And-Swap (CAS)
Use Case: Distributed counters, locks
Scalability
Horizontal Scaling
Auto-rebalancing:
Virtual nodes migrate to new physical node
Minimal data movement (only affected vnodes)
Partitioning Strategies
Hash-based: Consistent hashing (default)
Range-based: Keys sorted, split into ranges (ZooKeeper)
Hybrid: Combine both (Bigtable)
Monitoring & Operations
Metrics
Latency: p50, p99, p999 for GET/PUT
Throughput: QPS per node
Storage: Disk usage, compaction lag
Replication lag: Time delta between primary and replicas
Anti-Entropy Repair
SDE-3 Concept: Anti-Entropy with Merkle Trees
What it is: A background process (anti-entropy) that constantly compares replicas to catch any inconsistencies that Hinted Handoff or Read Repair missed.
Why Merkle Trees (Hash Trees): Transferring the entire dataset across the network to compare it is too expensive. A Merkle Tree hashes data hierarchically. If the root hashes match, the entire dataset matches. If they differ, the system can quickly traverse down the branches to pinpoint the exact keys that differ, transferring only the necessary hashes over the network instead of the actual data.
Trade-offs
Consistency
Tunable (Quorum)
Latency vs correctness
Replication
Async by default
Speed vs durability
Storage Engine
LSM-tree
Write-optimized, read amplification
Partitioning
Consistent hashing
Complexity vs load balance
Conflict Resolution
Last-Write-Wins
Simplicity vs data loss
Real-World Examples
Amazon DynamoDB: Multi-master, eventual consistency
Apache Cassandra: Wide-column, tunable consistency
etcd: Raft consensus, strong consistency (CP system)
Redis Cluster: Hash slots, master-slave replication
Interview Discussion Points
Q: CAP Theorem - What does your design prioritize?
AP (Availability + Partition tolerance) in default config
Can configure for CP with
R=ALL, W=ALL(sacrifices availability)Trade-off: During partition, accept stale reads vs reject requests
Q: How to handle hot keys?
Read hotspots: Cache aggressively, add read replicas
Write hotspots: Shard hot key (e.g.,
counter:global→counter:global:shard1,counter:global:shard2)
Q: Why LSM-tree over B-tree?
Writes: LSM sequential (faster), B-tree random I/O
Reads: B-tree faster (direct lookup), LSM checks multiple levels
Use case: Write-heavy workloads favor LSM
Last updated