githubEdit

ZooKeeper

Overview

Apache ZooKeeper is a distributed coordination service used for maintaining configuration, naming, synchronization, and group services in distributed systems.


Core Concepts

Znodes (Data Nodes)

ZooKeeper's data model is a hierarchical namespace, like a file system.

/
├─ /app
│   ├─ /app/config
│   └─ /app/leader
├─ /services
│   ├─ /services/service1
│   └─ /services/service2

Znode Properties:

  • Data: Small payload (typically < 1MB, recommended < 1KB)

  • Version: Optimistic locking (CAS operations)

  • ACLs: Access Control Lists

  • Stat: Metadata (created time, modified time, version)

Znode Types

1. Persistent

  • Created explicitly, deleted explicitly

  • Survive client disconnection

2. Ephemeral

  • Automatically deleted when client session ends

  • Cannot have children

  • Used for presence detection (leader election, service discovery)

3. Sequential

  • Appends monotonically increasing counter

  • Example: /app/lock-0000000001, /app/lock-0000000002

  • Used for distributed locks and queues


Architecture

Ensemble (Cluster)

  • Odd number of servers (3, 5, 7 typical)

  • Quorum-based: Majority must be alive

  • Formula: Quorum = (N / 2) + 1

Example:

Leader & Followers

Roles:

  • Leader: Processes all write requests

  • Followers: Process read requests, forward writes to leader

Leader Election:

  • Uses ZAB (ZooKeeper Atomic Broadcast) protocol

  • Fast leader election algorithm (< 200ms typical)


ZAB (ZooKeeper Atomic Broadcast)

Purpose

Ensures total order and atomic delivery of transactions across the ensemble.

Two Phases

1. Discovery Phase

  • New leader elected

  • Synchronizes with followers

2. Broadcast Phase

  • Leader processes writes

  • Broadcasts changes to followers

  • Commits when quorum acknowledges

Write Flow

Key Property: Total Order

  • All nodes see transactions in same order

  • Achieved via ZXID (ZooKeeper Transaction ID)

ZXID

  • 64-bit ID: [epoch (32 bits)][counter (32 bits)]

  • Epoch: Changes with each leader election

  • Counter: Monotonically increasing within epoch

Example:


Session Management

Client-Server Session

  • Client establishes session with one server

  • Session ID + timeout (default 10s)

  • Heartbeats (pings) keep session alive

Session States

Session Expiry

  • If no heartbeat within timeout → session expires

  • Ephemeral znodes deleted

  • Watches triggered


Watches (Notifications)

One-Time Triggers

  • Client sets watch on znode

  • ZooKeeper sends one notification when znode changes

  • Must re-set watch for continuous monitoring

Example:

Watch Types

  • Data watches: getData(), exists()

  • Child watches: getChildren()

Events:

  • NodeDataChanged

  • NodeChildrenChanged

  • NodeCreated

  • NodeDeleted


Consistency Model

Sequential Consistency

  • Total order: All clients see updates in same order

  • Client FIFO: Client's requests executed in order sent

Guarantees

  1. Linearizable writes: All writes totally ordered

  2. Reads may be stale: Read from local replica (no quorum read)

  3. Sync before read: Use sync() for latest data

Example:


Common Use Cases

1. Leader Election

Key Point: Watch Predecessor Only (prevents herd effect)

2. Distributed Lock

Fairness: Guaranteed by sequential znodes

3. Service Discovery

Ephemeral znodes: Automatically de-register on crash

4. Configuration Management


Performance Characteristics

Read Throughput

  • High: Reads served by any follower

  • Scales linearly with ensemble size

  • ~100K reads/sec per server

Write Throughput

  • Lower: All writes go through leader

  • Limited by leader capacity

  • ~10K-40K writes/sec for ensemble

Tuning:

  • More followers = higher read throughput

  • Faster leader = higher write throughput

  • Closer quorum = lower write latency


Failure Scenarios

Follower Failure

  • Leader continues with remaining quorum

  • No downtime if quorum alive

Leader Failure

  • New leader elected (~200ms)

  • Brief write unavailability

  • Reads continue on followers

Network Partition

  • Partition with quorum continues

  • Partition without quorum becomes read-only

Example: 5-node cluster splits 2-3

  • 3-node partition: Can read and write

  • 2-node partition: Becomes read-only (no quorum)


Tuning & Best Practices

Session Timeout

Trade-offs:

  • Short timeout: Faster failure detection, more heartbeat overhead

  • Long timeout: Slower failure detection, less overhead

Data Size

  • Keep znode data < 1KB

  • ZooKeeper is coordination service, not storage

  • Large data slows ensemble

Ensemble Size

  • 3 nodes: 1 failure tolerance

  • 5 nodes: 2 failures tolerance (recommended for production)

  • 7 nodes: 3 failures tolerance (rare, high latency)

Avoid Watches on Large Child Lists

  • getChildren() with many children is expensive

  • Consider hierarchical structure


Common Pitfalls

❌ Using ZooKeeper as Database

  • Designed for small metadata (< 1MB per znode)

  • Not for large data storage

❌ Not Handling Session Expiry

  • Ephemeral znodes deleted

  • Must recreate on reconnect

❌ Herd Effect

  • All clients watch same znode

  • All wake up on change → thundering herd

  • Solution: Chain watches (watch predecessor)

❌ Blocking in Watch Callback

  • Callbacks run in ZooKeeper event thread

  • Blocking stalls all callbacks

  • Solution: Spawn async task


Comparison: ZooKeeper vs etcd

Feature
ZooKeeper
etcd

Protocol

ZAB

Raft

Language

Java

Go

API

Custom (Zookeeper client)

REST + gRPC

Data Model

Hierarchical (tree)

Flat key-value

Watch

One-time

Continuous

Use Case

Coordination

Kubernetes, service mesh


Interview Questions

Q: How does ZooKeeper achieve fault tolerance?

  • Quorum-based replication (majority must agree)

  • 5-node ensemble tolerates 2 failures

  • ZAB protocol ensures consistency across replicas

  • Fast leader election (<200ms) on leader failure

Q: Explain ephemeral znodes and their use case

  • Deleted automatically when client session ends

  • Cannot have children

  • Use case: Leader election (leader znode disappears if leader dies), service discovery (service de-registers on crash)

Q: What is the herd effect and how do you avoid it?

  • All clients wake up when watched znode changes

  • Causes spike in load

  • Solution: Watch only predecessor in distributed lock/queue scenario

  • Only one client wakes up per change

Q: Why use ZooKeeper instead of a database for coordination?

  • Specialized: Built for coordination (leader election, locks, watches)

  • Low latency: In-memory, optimized for small data

  • Ordering guarantees: Sequential consistency, total order

  • Failure detection: Session management, ephemeral znodes

Last updated