githubEdit

Observability

Production-focused guide for SDE-3: How to ensure your systems are working correctly

Table of Contents


What is Observability?

Monitoring: "Is the system up?" Observability: "Why is the system behaving this way?"

Observability is the ability to understand the internal state of a system by examining its outputs (logs, metrics, traces).

Key Difference:

  • Monitoring: Predefined dashboards, known failure modes

  • Observability: Debug unknown/novel failures, ask arbitrary questions


The Three Pillars

1. Metrics

What: Numerical measurements over time

Examples:

Types:

Counter (always increasing)

Gauge (goes up/down)

Histogram (distribution)

Tools: Prometheus, Grafana, CloudWatch, Datadog


2. Logs

What: Discrete events with context

Structured Logging (JSON)

Unstructured Logging (Plain Text)

Log Levels:

Best Practices:

  • Use structured logging (JSON) for easy parsing

  • Include trace_id for correlation

  • Log user actions, not just errors

  • Don't log sensitive data (passwords, PII)

  • Don't log too verbosely (log fatigue)

Tools: ELK Stack (Elasticsearch, Logstash, Kibana), Splunk, CloudWatch Logs


3. Distributed Tracing

What: Track a single request across multiple services

Example: Process Order

Trace Structure:

Implementation:

Tools: Jaeger, Zipkin, AWS X-Ray, Datadog APM

Benefits:

  • Identify bottlenecks (which service is slow?)

  • Debug distributed failures

  • Understand dependencies


SLI, SLO, SLA

Service Level Indicator (SLI)

What: Quantitative measure of service quality

Examples:

How to Measure:


Service Level Objective (SLO)

What: Target value for an SLI (internal goal)

Examples:

Error Budget:

Calculation:


Service Level Agreement (SLA)

What: Contractual commitment to customers (legal/financial consequences)

Examples:

Rule: SLA < SLO


Key Metrics to Track

Application Metrics

Golden Signals (Google SRE):

1. Latency

2. Traffic

3. Errors

4. Saturation


Infrastructure Metrics

Compute:

Database:

Cache:


Distributed Tracing Deep Dive

Context Propagation

HTTP Headers:

Propagation:

Sampling

Problem: Tracing every request is expensive

Strategies:

1. Head-based Sampling (decide at start)

2. Tail-based Sampling (decide after completion)


Alerting Best Practices

Alert Fatigue

Problem: Too many alerts → ignored Solution: Alert only on actionable, high-priority issues

Good Alert vs Bad Alert

Bad Alert:

Good Alert:

Alert Severity Levels

P0 (Critical): Page on-call immediately

P1 (High): Alert during business hours

P2 (Medium): Create ticket

P3 (Low): Optional notification

On-Call Runbooks

Example Runbook for "High Error Rate"


Tools & Technologies

Metrics

Prometheus

Datadog

CloudWatch

Logging

ELK Stack (Elasticsearch + Logstash + Kibana)

Splunk

Tracing

Jaeger

Datadog APM


Real-World Example: E-commerce Checkout

Metrics to Track

Distributed Trace

Finding: Payment processing is slow (300ms). Action: Optimize Stripe API calls or use async processing.

Alert


Interview Tips

When discussing observability:

  1. Mention the 3 pillars: Metrics, Logs, Traces

  2. Define SLI/SLO/SLA with examples

  3. Golden Signals: Latency, Traffic, Errors, Saturation

  4. Alert on symptoms, not causes: "Error rate high" not "CPU high"

  5. Error budgets: "If SLO is 99.9%, we have 43.8 min downtime/month"

Example Answer:

"For this payment system, I'd track SLIs like transaction success rate (target: 99.99%) and latency (P99 < 500ms). Use Prometheus for metrics, Jaeger for distributed tracing to debug cross-service issues. Implement structured logging with trace IDs for correlation. Set up alerts on SLO violations, like success rate < 99.9% for 5 minutes. Maintain an error budget: if we burn through it, stop new features and focus on reliability."


Senior Engineer Insights

  • Design trade-offs: More metrics and traces improve debuggability but add cost and noise. Define SLIs that reflect user impact (e.g. success rate, latency), not just internal metrics (e.g. CPU). SLOs should be achievable but strict enough to force investment in reliability.

  • Cost: High-cardinality metrics and long retention are expensive. Sample traces (e.g. 1% or tail-based for errors/slow); aggregate metrics; set retention policies. Use error budgets to decide when to invest in reliability vs features.

  • Operational complexity: Centralized logging and tracing require pipelines (agents, collectors, storage). Alert fatigue kills response; alert on symptoms (e.g. error rate) with runbooks; avoid alerting on every possible cause.

  • Deployment: Feature flags and canaries need observability (latency, errors by version). Correlate deployments with metric changes; use trace IDs across service boundaries for debugging.

  • Resilience: Observability itself must be resilient: buffers (queues) for log/trace ingestion so backpressure doesn’t kill the app; sampling under load so tracing doesn’t add significant latency.


Quick Revision

  • Three pillars: Metrics (numbers over time), Logs (discrete events), Traces (request across services). Use all three; correlate with trace_id.

  • SLI/SLO/SLA: SLI = measurable indicator; SLO = target (internal); SLA = contract (external). SLO stricter than SLA; error budget = 1 − SLO.

  • Golden signals: Latency, Traffic, Errors, Saturation. Alert on symptoms (e.g. error rate high), not only causes (e.g. CPU high).

  • Interview talking points: “We track SLIs (success rate, P99 latency), set SLOs and error budgets. We use Prometheus for metrics, structured logs with trace_id, and Jaeger for tracing. We alert on SLO burn rate and have runbooks for common failures.”

  • Common mistakes: Too many alerts (fatigue); no error budget; tracing everything at 100% (cost); logging PII or secrets.

Last updated