Service Discovery
Mechanism for services to find and communicate with instances of other services in a dynamic environment (e.g. containers, autoscaling).
1. Concept Overview
In a distributed system, service instances come and go (deploys, scaling, failures). Service discovery lets a client or router find the current set of healthy instances (e.g. host:port or DNS names) for a service.
Why it exists: Hard-coded or static config breaks when instances change. Discovery keeps clients and load balancers in sync with the actual topology.
2. Core Principles
Client-side vs server-side
Client-side
Client queries a registry (e.g. Consul, etcd), gets list of instances, chooses one (e.g. round-robin)
Fewer hops; client can do smart LB
Client complexity; every client needs discovery logic
Server-side
Client talks to a fixed endpoint (e.g. LB or proxy); LB/proxy uses registry to find backends
Simple client
Extra hop; LB can be bottleneck
Registry
Registry holds: service name → list of (host, port, metadata, health).
Registration: Instances register on start and deregister on shutdown; often with TTL and heartbeat.
Discovery: Clients or LBs query the registry (or subscribe to updates) to get the current list.
Architecture
3. Real-World Usage
Consul: Service registry, health checks, DNS interface; used in many on-prem and cloud setups.
etcd: Key-value store used by Kubernetes for cluster state; often used as registry.
Kubernetes: Built-in: Services and DNS (e.g.
service-name.namespace.svc.cluster.local); no separate registry app.AWS: Cloud Map; ECS/EKS integrations.
Eureka: Netflix OSS; client-side discovery; often used with Spring Cloud.
4. Trade-offs
Client-side
No extra hop; client can do LB and failover
Heavy clients; every language needs SDK
Server-side
Thin clients; central control
Extra hop; LB/registry critical path
DNS-based
Universal; simple
TTL lag; less flexible than API registry
API-based registry
Real-time; rich metadata
Dependency on registry availability
When to use: Microservices or any environment where instance endpoints change (containers, autoscaling). When not: Single monolith or static, long-lived instances with fixed config.
5. Failure Scenarios
Registry down
Cache last known list in clients; tolerate stale; multi-node registry (Consul, etcd cluster)
Stale entries
TTL and heartbeats; health checks; deregister on failure
Thundering herd
Clients back off when registry is slow; cache and rate-limit discovery calls
Split brain
Use CP store (etcd, Consul) with quorum; avoid serving stale data
6. Performance Considerations
Latency: Discovery should be fast; cache results with short TTL or use watch/long-poll for updates.
Scale: Registry must handle many services and instances; scale registry (cluster) and limit update rate per service.
7. Implementation Patterns
Kubernetes: Use Service + DNS; optional sidecar or client that uses API for more dynamic behavior.
Consul: Agents on each node; services register; clients use DNS or HTTP API; health checks drive removal.
Service mesh: Sidecar proxies often implement discovery and LB; application stays discovery-agnostic.
Quick Revision
Purpose: Find current, healthy instances of a service in a dynamic environment.
Client-side: Client gets list from registry and chooses instance. Server-side: LB/proxy uses registry.
Registry: Registration (with TTL/heartbeat) and discovery (API or DNS); health checks remove bad instances.
Failure: Registry HA; clients cache list; health checks and TTL avoid stale entries.
Interview: “We use Consul for service discovery: instances register on startup and clients query Consul to get the list of healthy instances so we don’t rely on static IPs in a scaling environment.”
Last updated