Key-Value Store

Problem Statement

Design a distributed key-value store like Redis or DynamoDB that supports basic operations (GET, PUT, DELETE) with high availability, partition tolerance, and tunable consistency.

Requirements

Functional Requirements

PUT(key, value): Store key-value pair
GET(key): Retrieve value for a key
DELETE(key): Remove a key-value pair
Support for configurable TTL (Time-To-Live)
Support for atomic operations (compare-and-swap)

Non-Functional Requirements

High availability: 99.99% uptime
Partition tolerance: Continue operating during network partitions
Low latency: < 10ms for GET, < 50ms for PUT
Scalability: Store petabytes of data, handle millions of QPS
Tunable consistency: Support both strong and eventual consistency

Capacity Estimation

Traffic Estimates

Peak QPS: 10 million requests/sec
Read:Write ratio: 9:1 (90% reads, 10% writes)
Reads: 9M/sec
Writes: 1M/sec

Storage Estimates

Average key size: 20 bytes
Average value size: 200 bytes
Total entries: 1 billion keys
Storage needed: 1B × (20 + 200 bytes) = 220 GB
With replication (3x): 660 GB
With overhead (2x): 1.32 TB

Memory Estimates (Cache)

Cache 20% hot data: 220 GB × 0.2 = 44 GB per node
For 10 nodes: 440 GB total cache

API Design

1. Basic Operations

PUT /v1/kv/{key}
Content-Type: application/json
{
  "value": "data",
  "ttl": 3600  // seconds, optional
}

GET /v1/kv/{key}
Response: {"value": "data", "version": 5}

DELETE /v1/kv/{key}

2. Atomic Operations

POST /v1/kv/{key}/cas
{
  "expectedVersion": 5,
  "newValue": "updated_data"
}

3. Batch Operations

POST /v1/kv/batch
{
  "operations": [
    {"op": "PUT", "key": "k1", "value": "v1"},
    {"op": "GET", "key": "k2"},
    {"op": "DELETE", "key": "k3"}
  ]
}

High-Level Architecture

Core Components

1. Consistent Hashing & Virtual Nodes

Why Virtual Nodes?

Each physical node owns multiple virtual nodes (vnodes)
Better load distribution when nodes join/leave
Typical: 256 vnodes per physical node

Consistent Hashing Algorithm:

public PhysicalNode getNode(String key) {
    int hashValue = hash(key) % RING_SIZE;  // 0-999
    
    // Find next vnode on ring
    for (VirtualNode vnode : sortedVnodes) {
        if (vnode.getPosition() >= hashValue) {
            return vnode.getPhysicalNode();
        }
    }
    
    // Wrap around to first vnode
    return sortedVnodes.get(0).getPhysicalNode();
}

2. Replication Strategy

Replication Factor (RF) = 3

Primary node + 2 replicas
Replicas stored on next N-1 nodes in ring (clockwise)

Replication Types:

Synchronous: Wait for all replicas to ACK (strong consistency)
Asynchronous: Write to primary, replicate in background (eventual consistency)
Quorum: Write to W nodes, read from R nodes where W + R > RF

3. Quorum Consistency

Tunable Consistency Levels:

Level

Reads (R)

Writes (W)

Consistency

Use Case

Strong

Linearizable

Banking, critical data

Quorum

Strong if R+W>RF

Balanced

Eventual

Eventual

High throughput

Quorum Formula:

R + W > RF  → Strong consistency
R = 2, W = 2, RF = 3  → 2 + 2 > 3 ✓

Detailed Workflows

Write Path (PUT)

Optimization: Hinted Handoff

If replica node down, store hint on another node
Replay when original node recovers

Read Path (GET)

SDE-3 Concept: Read Repair

Problem: Replicas diverge due to dropped updates or network partitions.
Solution (Read Repair): When the coordinator performs a read across replicas and detects a version mismatch, it returns the newest data to the client and simultaneously sends an async update to the replicas holding stale data to synchronize them to the latest version. This repairs data lazily during read operations.

Conflict Resolution

Version Vectors (Dotted Version Vectors)

When concurrent writes occur, track causality:

Node A writes: key="user" → {value: "Alice", vector: {A:1}}
Node B writes: key="user" → {value: "Bob", vector: {B:1}}

Read returns both:
[
  {value: "Alice", vector: {A:1}},
  {value: "Bob", vector: {B:1}}
]

Resolution Strategies:

Last Write Wins (LWW): Use timestamp (simple, but loses data)
Client-side merge: Return conflicts, let client decide (e.g., Amazon Shopping Cart merges items)
Application-defined: Custom merge logic

Dynamo-style Vector Clocks (SDE-3 Deep Dive): A vector clock is a list of (node, counter) pairs. It tracks the causal history of an object.

public class VectorClock {
    private String value;
    private Map<String, Integer> vectorClock;
    
    // Example:
    // D1: Node A writes -> [A:1]
    // D2: Node A writes again -> [A:2]
    // D3: Node B writes (descends from D2) -> [A:2, B:1]
    // D4: Node C writes (descends from D2, parallel to D3) -> [A:2, C:1]
    // D5: Client reads D3 & D4, resolves conflict, writes to A -> [A:3, B:1, C:1]
}

Comparison Rule: Clock X is an ancestor of Clock Y (no conflict) if for every node i, X[i] <= Y[i].
Conflict: If X has some elements strictly greater than Y, and Y has some elements strictly greater than X, then X and Y are concurrent and in conflict. Client application is responsible for merging them.

Failure Handling

1. Node Failure

Detection: Gossip protocol (heartbeat every 1s)
Recovery: Hinted handoff replays missed writes
Permanent failure: Anti-entropy repair (Merkle trees)

2. Network Partition

Sloppy Quorum: Accept writes even if primary replicas unreachable
Write to next healthy nodes (temporary replicas)
Eventual reconciliation when partition heals

3. Data Corruption

Checksum verification on reads
Merkle trees for anti-entropy (compare subtrees)

Storage Engine

LSM-Tree (Log-Structured Merge-Tree)

Why LSM-Tree?

Write-optimized: Sequential writes (append-only)
Compaction: Periodically merge SSTables
Used by: RocksDB, LevelDB, Cassandra

Read Path Optimization:

Bloom filters: Skip SSTables unlikely to have key
Index blocks: Binary search within SSTable
Block cache: Cache hot SSTable blocks in memory

Advanced Features

1. TTL (Time-To-Live)

// Store with expiration timestamp
public class Entry {
    private String key;
    private String value;
    private Timestamp expiresAt;
    
    public Entry(String key, String value, long ttl) {
        this.key = "session:123";
        this.value = "token";
        this.expiresAt = new Timestamp(System.currentTimeMillis() + ttl);
    }
}

// Lazy deletion on read
if (entry.getExpiresAt().before(new Timestamp(System.currentTimeMillis()))) {
    return null;
}

// Active deletion via compaction
// Remove expired entries during SSTable merge

2. Atomic Compare-And-Swap (CAS)

public boolean cas(String key, int expectedVersion, String newValue) {
    Entry current = db.get(key);
    
    if (current.getVersion() == expectedVersion) {
        db.put(key, newValue, current.getVersion() + 1);
        return true;
    } else {
        return false;  // Version mismatch
    }
}

Use Case: Distributed counters, locks

Scalability

Horizontal Scaling

Auto-rebalancing:

Virtual nodes migrate to new physical node
Minimal data movement (only affected vnodes)

Partitioning Strategies

Hash-based: Consistent hashing (default)
Range-based: Keys sorted, split into ranges (ZooKeeper)
Hybrid: Combine both (Bigtable)

Monitoring & Operations

Metrics

Latency: p50, p99, p999 for GET/PUT
Throughput: QPS per node
Storage: Disk usage, compaction lag
Replication lag: Time delta between primary and replicas

Anti-Entropy Repair

// Merkle tree per partition
public MerkleTree buildMerkleTree(List<Entry> partitionData) {
    List<String> leaves = partitionData.stream()
        .map(entry -> hash(entry.getKey() + entry.getValue()))
        .collect(Collectors.toList());
    
    return buildTree(leaves);
}

// Compare with replica
if (!localMerkleRoot.equals(replicaMerkleRoot)) {
    // Recursively find differing subtrees
    syncDiff(localTree, replicaTree);
}

SDE-3 Concept: Anti-Entropy with Merkle Trees

What it is: A background process (anti-entropy) that constantly compares replicas to catch any inconsistencies that Hinted Handoff or Read Repair missed.
Why Merkle Trees (Hash Trees): Transferring the entire dataset across the network to compare it is too expensive. A Merkle Tree hashes data hierarchically. If the root hashes match, the entire dataset matches. If they differ, the system can quickly traverse down the branches to pinpoint the exact keys that differ, transferring only the necessary hashes over the network instead of the actual data.

Trade-offs

Aspect

Choice

Trade-off

Consistency

Tunable (Quorum)

Latency vs correctness

Replication

Async by default

Speed vs durability

Storage Engine

LSM-tree

Write-optimized, read amplification

Partitioning

Consistent hashing

Complexity vs load balance

Conflict Resolution

Last-Write-Wins

Simplicity vs data loss

Real-World Examples

Amazon DynamoDB: Multi-master, eventual consistency
Apache Cassandra: Wide-column, tunable consistency
etcd: Raft consensus, strong consistency (CP system)
Redis Cluster: Hash slots, master-slave replication

Interview Discussion Points

Q: CAP Theorem - What does your design prioritize?

AP (Availability + Partition tolerance) in default config
Can configure for CP with R=ALL, W=ALL (sacrifices availability)
Trade-off: During partition, accept stale reads vs reject requests

Q: How to handle hot keys?

Read hotspots: Cache aggressively, add read replicas
Write hotspots: Shard hot key (e.g., counter:global → counter:global:shard1, counter:global:shard2)

Q: Why LSM-tree over B-tree?

Writes: LSM sequential (faster), B-tree random I/O
Reads: B-tree faster (direct lookup), LSM checks multiple levels
Use case: Write-heavy workloads favor LSM

PreviousPastebin NextMedium

Last updated 1 month ago

hashtagProblem Statement

hashtagRequirements

hashtagFunctional Requirements

hashtagNon-Functional Requirements

hashtagCapacity Estimation

hashtagTraffic Estimates

hashtagStorage Estimates

hashtagMemory Estimates (Cache)

hashtagAPI Design

hashtag1. Basic Operations

hashtag2. Atomic Operations

hashtag3. Batch Operations

hashtagHigh-Level Architecture

hashtagCore Components

hashtag1. Consistent Hashing & Virtual Nodes

hashtag2. Replication Strategy

hashtag3. Quorum Consistency

hashtagDetailed Workflows

hashtagWrite Path (PUT)

hashtagRead Path (GET)

hashtagConflict Resolution

hashtagVersion Vectors (Dotted Version Vectors)

hashtagFailure Handling

hashtag1. Node Failure

hashtag2. Network Partition

hashtag3. Data Corruption

hashtagStorage Engine

hashtagLSM-Tree (Log-Structured Merge-Tree)

hashtagAdvanced Features

hashtag1. TTL (Time-To-Live)

hashtag2. Atomic Compare-And-Swap (CAS)

hashtagScalability

hashtagHorizontal Scaling

hashtagPartitioning Strategies

hashtagMonitoring & Operations

hashtagMetrics

hashtagAnti-Entropy Repair

hashtagTrade-offs

hashtagReal-World Examples

hashtagInterview Discussion Points