Replication
Copying data across multiple nodes for availability, read scaling, and durability.
1. Concept Overview
Replication keeps multiple copies of data (replicas) in sync. It provides:
High availability: If one node fails, others can serve traffic.
Read scaling: Distribute reads across replicas.
Durability: Data survives single-node (or single-datacenter) failure.
Why it exists: A single copy is a single point of failure and a read bottleneck. Replication addresses both.
2. Core Principles
Topologies
Leader–follower (primary–replica)
One leader for writes; followers replicate; reads can go to followers
Most RDBMS, MongoDB
Multi-leader
Multiple nodes accept writes; replicate to each other
Multi-datacenter; offline-first
Leaderless
No single leader; quorum writes and reads (e.g. W=2, R=2, N=3)
Cassandra, DynamoDB
Sync vs async replication
Synchronous
Leader waits for replica(s) to ack before confirming write
No data loss on leader failover
Higher latency; availability tied to replica
Asynchronous
Leader acks immediately; replicas updated in background
Low latency; leader not blocked by replica
Replica can lag; possible data loss if leader fails before replication
Architecture (leader–follower)
3. Real-World Usage
PostgreSQL / MySQL: Primary + read replicas; async or semi-sync; failover via promotion.
MongoDB: Replica set; one primary, secondaries replicate; automatic failover.
Cassandra / DynamoDB: Leaderless; quorum (W, R, N); tunable consistency.
Kafka: Partition replicas; in-sync replicas (ISR); leader handles writes.
4. Trade-offs
Sync replication
No loss on failover
Latency; if replica is down, writes can block or fail
Async replication
Low latency; leader not blocked
Replication lag; possible loss on leader failure
Read from replica
Scale reads
Stale reads (lag); need to handle consistency (e.g. read-your-writes)
Multi-leader
Write locally in multiple DCs
Conflict resolution; complexity
When to use: Need HA or read scaling; can tolerate eventual consistency for reads from replicas. When not: Single-node acceptable; or strong consistency with no lag (then sync and read from primary only).
5. Failure Scenarios
Leader fails
Promote replica to leader (manual or automatic); clients reconnect to new leader
Replica lag**
Monitor lag; route critical reads to leader; increase replica capacity or reduce write load
Split brain (multi-leader)
Conflict resolution (LWW, vector clocks, CRDTs); or avoid multi-leader
Replication loop (multi-leader)
Use topology that avoids cycles; or use conflict-free structures
6. Performance Considerations
Write latency: Sync replication adds round-trip(s) to replica(s); async does not.
Read scaling: More replicas → more read capacity; balance with replication load and storage cost.
Replication lag: Depends on write volume and replica capacity; can be seconds under load.
7. Implementation Patterns
Single leader + N replicas: Standard for RDBMS; async or semi-sync; read replicas for reporting and read scaling.
Leaderless quorum: W + R > N for strong consistency; tune W, R for latency vs durability.
Cross-datacenter: Async replica in second DC for DR; or multi-leader if needed for local writes.
Quick Revision
Purpose: HA, read scaling, durability.
Leader–follower: One writer; replicas copy; reads can go to replicas (stale possible).
Sync vs async: Sync = no loss, higher latency; async = low latency, possible loss.
Leaderless: Quorum (W, R, N); no single leader; tunable consistency.
Interview: “We use a primary and two async read replicas so writes are fast and we scale reads; we accept replication lag and route read-your-writes to the primary when needed.”
Last updated