Observability
Production-focused guide for SDE-3: How to ensure your systems are working correctly
Table of Contents
What is Observability?
Monitoring: "Is the system up?" Observability: "Why is the system behaving this way?"
Observability is the ability to understand the internal state of a system by examining its outputs (logs, metrics, traces).
Key Difference:
Monitoring: Predefined dashboards, known failure modes
Observability: Debug unknown/novel failures, ask arbitrary questions
The Three Pillars
1. Metrics
What: Numerical measurements over time
Examples:
Types:
Counter (always increasing)
Gauge (goes up/down)
Histogram (distribution)
Tools: Prometheus, Grafana, CloudWatch, Datadog
2. Logs
What: Discrete events with context
Structured Logging (JSON)
Unstructured Logging (Plain Text)
Log Levels:
Best Practices:
Use structured logging (JSON) for easy parsing
Include
trace_idfor correlationLog user actions, not just errors
Don't log sensitive data (passwords, PII)
Don't log too verbosely (log fatigue)
Tools: ELK Stack (Elasticsearch, Logstash, Kibana), Splunk, CloudWatch Logs
3. Distributed Tracing
What: Track a single request across multiple services
Example: Process Order
Trace Structure:
Implementation:
Tools: Jaeger, Zipkin, AWS X-Ray, Datadog APM
Benefits:
Identify bottlenecks (which service is slow?)
Debug distributed failures
Understand dependencies
SLI, SLO, SLA
Service Level Indicator (SLI)
What: Quantitative measure of service quality
Examples:
How to Measure:
Service Level Objective (SLO)
What: Target value for an SLI (internal goal)
Examples:
Error Budget:
Calculation:
Service Level Agreement (SLA)
What: Contractual commitment to customers (legal/financial consequences)
Examples:
Rule: SLA < SLO
Key Metrics to Track
Application Metrics
Golden Signals (Google SRE):
1. Latency
2. Traffic
3. Errors
4. Saturation
Infrastructure Metrics
Compute:
Database:
Cache:
Distributed Tracing Deep Dive
Context Propagation
HTTP Headers:
Propagation:
Sampling
Problem: Tracing every request is expensive
Strategies:
1. Head-based Sampling (decide at start)
2. Tail-based Sampling (decide after completion)
Alerting Best Practices
Alert Fatigue
Problem: Too many alerts → ignored Solution: Alert only on actionable, high-priority issues
Good Alert vs Bad Alert
Bad Alert:
Good Alert:
Alert Severity Levels
P0 (Critical): Page on-call immediately
P1 (High): Alert during business hours
P2 (Medium): Create ticket
P3 (Low): Optional notification
On-Call Runbooks
Example Runbook for "High Error Rate"
Tools & Technologies
Metrics
Prometheus
Datadog
CloudWatch
Logging
ELK Stack (Elasticsearch + Logstash + Kibana)
Splunk
Tracing
Jaeger
Datadog APM
Real-World Example: E-commerce Checkout
Metrics to Track
Distributed Trace
Finding: Payment processing is slow (300ms). Action: Optimize Stripe API calls or use async processing.
Alert
Interview Tips
When discussing observability:
Mention the 3 pillars: Metrics, Logs, Traces
Define SLI/SLO/SLA with examples
Golden Signals: Latency, Traffic, Errors, Saturation
Alert on symptoms, not causes: "Error rate high" not "CPU high"
Error budgets: "If SLO is 99.9%, we have 43.8 min downtime/month"
Example Answer:
"For this payment system, I'd track SLIs like transaction success rate (target: 99.99%) and latency (P99 < 500ms). Use Prometheus for metrics, Jaeger for distributed tracing to debug cross-service issues. Implement structured logging with trace IDs for correlation. Set up alerts on SLO violations, like success rate < 99.9% for 5 minutes. Maintain an error budget: if we burn through it, stop new features and focus on reliability."
Senior Engineer Insights
Design trade-offs: More metrics and traces improve debuggability but add cost and noise. Define SLIs that reflect user impact (e.g. success rate, latency), not just internal metrics (e.g. CPU). SLOs should be achievable but strict enough to force investment in reliability.
Cost: High-cardinality metrics and long retention are expensive. Sample traces (e.g. 1% or tail-based for errors/slow); aggregate metrics; set retention policies. Use error budgets to decide when to invest in reliability vs features.
Operational complexity: Centralized logging and tracing require pipelines (agents, collectors, storage). Alert fatigue kills response; alert on symptoms (e.g. error rate) with runbooks; avoid alerting on every possible cause.
Deployment: Feature flags and canaries need observability (latency, errors by version). Correlate deployments with metric changes; use trace IDs across service boundaries for debugging.
Resilience: Observability itself must be resilient: buffers (queues) for log/trace ingestion so backpressure doesn’t kill the app; sampling under load so tracing doesn’t add significant latency.
Quick Revision
Three pillars: Metrics (numbers over time), Logs (discrete events), Traces (request across services). Use all three; correlate with trace_id.
SLI/SLO/SLA: SLI = measurable indicator; SLO = target (internal); SLA = contract (external). SLO stricter than SLA; error budget = 1 − SLO.
Golden signals: Latency, Traffic, Errors, Saturation. Alert on symptoms (e.g. error rate high), not only causes (e.g. CPU high).
Interview talking points: “We track SLIs (success rate, P99 latency), set SLOs and error budgets. We use Prometheus for metrics, structured logs with trace_id, and Jaeger for tracing. We alert on SLO burn rate and have runbooks for common failures.”
Common mistakes: Too many alerts (fatigue); no error budget; tracing everything at 100% (cost); logging PII or secrets.
Last updated