githubEdit

Distributed Message Queue (Kafka)

Difficulty: Hard Topics: Partitioning, Replication, Consumer Groups, Exactly-Once Semantics Time: 75 minutes Companies: LinkedIn, Uber, Netflix, Confluent

Problem Statement

Design a distributed message queue like Apache Kafka that:

  • Handles millions of messages per second

  • Guarantees message ordering within a partition

  • Provides at-least-once, at-most-once, and exactly-once delivery semantics

  • Scales horizontally by adding brokers

  • Supports multiple consumer groups

Requirements

Functional

  • Publish: Producers send messages to topics

  • Subscribe: Consumers receive messages from topics

  • Partitioning: Messages distributed across partitions for parallelism

  • Consumer Groups: Multiple consumers share topic load

  • Message Retention: Messages stored for configurable time (e.g., 7 days)

Non-Functional

  • Throughput: 1M messages/sec per broker

  • Latency: P99 < 10ms for writes

  • Durability: No message loss (replicated to 3 brokers)

  • Ordering: Messages in same partition maintain order

  • Availability: 99.99% uptime

Scale Estimation

Core Concepts

1. Topic & Partition

Why Partitions?

  • Parallelism (multiple consumers read different partitions)

  • Ordering within partition (not across partitions)

  • Scalability (add more partitions)

2. Consumer Groups

Key: Each partition assigned to exactly ONE consumer per group.

3. Replication

Architecture

Data Model

Message Structure

Offset Management

API Design

Producer API

Consumer API

Deep Dive Topics

1. Delivery Semantics

At-Most-Once (May Lose Messages)

At-Least-Once (May Duplicate)

Most Common: At-least-once + idempotent processing

Exactly-Once (Complex)


2. Replication & Leader Election

In-Sync Replicas (ISR):

Leader Election:


3. Consumer Rebalancing

Scenario: Consumer added/removed from group

Rebalance Protocol:

  1. Group coordinator (Kafka broker) detects change

  2. All consumers stop consuming

  3. Partition assignment computed

  4. Consumers resume from last committed offset


4. Message Ordering Guarantees

Within Partition: Guaranteed (FIFO)

Across Partitions: NOT guaranteed

Solution for Global Ordering: Use 1 partition (limits throughput)


5. Compaction (Log Compaction)

Use Case: Keep only latest value per key (e.g., user profile updates)

Config: cleanup.policy=compact


Scaling Strategies

Horizontal Scaling (Add Brokers)

Partition Scaling


Failure Scenarios

Broker Failure

Consumer Failure

Network Partition (Split Brain)


Monitoring

Key Metrics:

Alerts:


Interview Tips

Common Questions:

Q: "How does Kafka achieve high throughput?"

  • A: Sequential disk I/O (append-only log), zero-copy (sendfile), batching, compression

Q: "What happens if consumer crashes before committing offset?"

  • A: On restart, consumer re-reads from last committed offset (at-least-once delivery, may duplicate)

Q: "How do you ensure exactly-once semantics?"

  • A: Idempotent producer + transactional API + consumer reads only committed transactions

Q: "Kafka vs RabbitMQ?"

Feature
Kafka
RabbitMQ

Throughput

Very high (1M+/sec)

Medium (10K-100K/sec)

Message Ordering

Per partition

Per queue

Use Case

Event streaming, logs

Task queues, RPC

Persistence

Always (log)

Optional

Key Trade-offs:

  • Ordering vs Parallelism (1 partition = ordered but slow, many partitions = fast but unordered)

  • Durability vs Latency (acks=all = slow but safe, acks=0 = fast but risky)

  • Consumer lag vs throughput (commit frequently = slow, commit rarely = risk of reprocessing)

Last updated