Chaos Engineering
For SDE-3 Interview Preparation "Stability is not a state, it's a practice."
Table of Contents
What is Chaos Engineering?
Chaos Engineering is the discipline of experimenting on a system in order to build confidence in the system's capability to withstand turbulent conditions in production.
Goal: Identify weaknesses before they manifest in customer-facing outages.
It is NOT:
Randomly breaking things
Testing in production without safeguards
"Breaking things" just for fun
It IS:
Controlled experiments
Validating hypothesis ("If DB fails, cache should serve stale data")
Minimizing blast radius
Principles
Start with Steady State: Define "normal" behavior (e.g., < 1% error rate).
Hypothesize: "If we kill one replica, latency will increase by < 10%".
Inject Fault: Kill the replica.
Verify: Did latency stay within bounds?
Fix: If it failed, fix the weakness.
Fault Injection Types
1. Resource Exhaustion
CPU Spike: Run
stress-ngto hit 100% CPU.Memory Leak: Consume RAM until OOM Killer triggers.
Disk Fill: Fill disk to 100%.
2. Network Attacks
Latency: Add 500ms delay to all packets (simulate cross-region lag).
Packet Loss: Drop 5% of packets.
Blackhole: Drop all traffic to specific IP (simulate dependency down).
DNS Failure: Block DNS port 53.
3. Application State
Clock Skew: Change system time (breaking distributed locks).
Process Kill:
kill -9main process.Certificate Expiry: Use expired certs.
Resilience Patterns
When Chaos reveals a weakness, use these patterns to fix it.
1. Circuit Breaker
Problem: One slow dependency cascades to entire system.
Solution: If failure rate > 50%, "open" circuit and fail fast immediately. Do not wait for timeout.
Tools: Netflix Hystrix, Resilience4j.
2. Bulkhead Pattern
Analogy: Ship compartments. If one floods, ship doesn't sink.
Implementation: Separate thread pools for different downstream services.
Benefit: Slow "Image Service" won't exhaust threads for "User Service".
3. Retry with Exponential Backoff + Jitter
Problem: Thundering Herd. If service recovers, 10,000 clients retry simultaneously -> kills it again.
Solution:
Wait = Base * 2^AttemptJitter = Random(0, 100ms)
4. Fallback Strategies
Graceful Degradation:
Recommendation Service Down? -> Show "Trending Now" (static list).
Price Service Down? -> Show cached price (with staleness warning).
Game Days
Structured Chaos Event:
Preparation: Pick a date. Select a "Master of Disaster".
Execution: Run experiments (2-3 hours).
Observation: Monitory metrics/logs.
Reporting: Document findings. Create tickets for fixes.
Example Scenario:
Role: SDE-3 Lead.
Experiment: "What happens if the primary Leader of our Redis Cluster dies?"
Expectation: Automatic failover to Replica in < 30s. No data loss.
Reality: Failover took 5 mins because client library didn't refresh topology. Bug Found!
Tools
Chaos Monkey (Netflix)
Random instance termination
AWS EC2 (The original)
Gremlin
SaaS Platform
Safe, controlled, enterprise grade
Chaos Mesh
Kubernetes
YAML based experiments (PodKill, NetworkDelay)
LitmusChaos
Kubernetes
Cloud-native, declarative
Toxiproxy (Shopify)
Network simulation
Great for unit testing network partitions
Interview Tips
Q: "How do you ensure your system is resilient?"
A:
"I design for failure using patterns like Circuit Breakers and Bulkheads."
"We practice Chaos Engineering—running Game Days to validate our assumptions."
"For Distributed Systems, I specifically test Network Partitions (split-brain) and Clock Skew."
"I ensure purely idempotent operations so retries are safe."
Last updated