Phase 3 - SRE And Operations
This phase shifts from "how to deploy" to "how to run production systems safely."
Goal
By the end of this phase, you should be able to detect issues early, troubleshoot methodically, stabilize production systems, and explain the reasoning behind your actions.
Study Order
../../05_Observability_and_Troubleshooting/Monitoring/README.md../../07_Interview_Preparation/devops-interview-playbook.md../../07_Interview_Preparation/general-interview-questions.md../../07_Interview_Preparation/interview-questions-medium.md../../07_Interview_Preparation/interview-questions-hard.md../../04_Infrastructure_as_Code_and_Cloud/Cloud_Services/azure-medium-questions.md../../04_Infrastructure_as_Code_and_Cloud/Cloud_Services/azure-hard-questions.md
What To Master
Observability
metrics, logs, and traces
Prometheus, Grafana, and alerting
SLI, SLO, SLA, and error budgets
high-cardinality risks and noisy alerts
Troubleshooting
using metrics before guessing
events, logs, rollout history, and node inspection
common Kubernetes and Terraform failure states
how to distinguish app issues from infrastructure issues
Reliability
rollback strategy
canary, blue-green, and rolling trade-offs
graceful degradation and load shedding
incident mitigation versus long-term prevention
Security And Operational Hygiene
secrets handling
least privilege
image and dependency scanning
policy as code and runtime controls
Cost And Capacity
overprovisioning versus reliability margin
autoscaling logic
rightsizing
cloud waste and idle resources
Hands-On Tasks
Build a Grafana dashboard and define one actionable alert.
Simulate a failing deployment and practice rollback.
Debug a
CrashLoopBackOfforPendingpod in a lab.Write a short runbook for a high-latency service.
Write a short RCA for a deployment or config incident.
Checkpoint Questions
How do logs, metrics, and traces work together during an outage?
Why is symptom-based alerting better than paging on every CPU spike?
What do you check first in
CrashLoopBackOff?If HPA scales pods but latency stays high, what does that imply?
What is the difference between immediate mitigation and permanent fix?
Exit Criteria
Move to Phase 4 only when you can:
explain SLO-based thinking clearly
investigate a production-style issue in a structured way
name the commands you would use for Kubernetes and host troubleshooting
describe a safe rollback strategy
talk about monitoring, security, and cost as one operating model rather than separate topics
Last updated