07 Interview Preparation
Use this file as the main answer framework for DevOps, SRE, platform, and cloud-operations interviews.
What Interviewers Are Really Testing
1. Fundamentals
You should be able to explain how Linux, networking, cloud, CI/CD, containers, and observability fit together. Interviewers usually care less about memorizing a command and more about whether you know when and why to use it.
2. Systems Thinking
A strong DevOps engineer thinks in dependencies and blast radius:
What changed recently?
Which component is actually failing?
Which downstream system is causing the symptom?
What is the fastest safe mitigation?
3. Troubleshooting Depth
Good candidates do not jump to a favorite root cause. They narrow the problem with evidence:
Metrics to understand impact and timing
Logs and events to find failing components
Commands to validate a hypothesis
A rollback or mitigation if the system is still unhealthy
4. Automation Mindset
Interviewers look for engineers who reduce manual work. Repeated tasks should turn into scripts, pipelines, modules, templates, dashboards, or runbooks.
5. Communication
A senior answer is structured. It explains:
What I would check first
Why I am checking it
What outcome I expect
How I would mitigate the issue
What I would change long term
A Strong Answer Framework
Use this structure for most technical questions and incident scenarios:
Clarify the symptom and impact.
Check whether there was a recent deployment or configuration change.
Start with user-facing signals: latency, errors, availability, saturation.
Narrow the failing layer: app, container, node, network, database, cloud service, or pipeline.
Run the smallest commands that can confirm or reject a hypothesis.
Stabilize first, then optimize, then document the long-term fix.
Example:
I would first confirm scope and timing in Grafana, then check recent deployment history, then inspect pod events and logs. If the service is actively failing, I would prepare a rollback while I validate whether the issue is config, image, dependency, or resource pressure.
What You Should Know By Topic
Git And Collaboration
Branching, pull requests, merge vs rebase, revert vs reset
How to recover from a bad merge or bad release
Protected branches, code review, signed commits, secret scanning
Linux And Shell
Processes, signals, file permissions, systemd, logs, disk and memory inspection
Safe shell scripting with
set -euo pipefail, exit codes, functions, logging, and cron or timersHow to debug a host that is slow, full, swapping, or refusing connections
Networking
TCP vs UDP, DNS, CIDR, routing, NAT, TLS, load balancing
Connection refused vs timeout
Basic packet and socket inspection with
ss,curl,dig,tcpdump,traceroute, ormtr
Cloud
Compute, storage, networking, IAM/RBAC, autoscaling, HA, backups, DR
Public vs private subnets
Managed service trade-offs versus self-hosted platforms
CI/CD
Build, test, scan, package, publish, deploy, verify, rollback
Artifact immutability and promotion
Secrets handling, approval gates, canary or blue-green, smoke tests
Docker And Kubernetes
Dockerfile best practices, image layers, registries, networking, volumes
Pods, Deployments, StatefulSets, Services, Ingress, probes, HPA
Common failures: CrashLoopBackOff, ImagePullBackOff, Pending, OOMKilled, DNS issues
Terraform And Ansible
Providers, modules, state, backends, locking, outputs, drift,
for_eachWhy Terraform provisions and Ansible configures
Safe rollout patterns for infrastructure changes
Observability And SRE
Metrics, logs, traces
Four Golden Signals
SLI, SLO, error budget, alert fatigue
How to turn symptoms into dashboards and actionable alerts
Security
Secrets management, least privilege, image scanning, dependency scanning, SBOMs
Policy as code, signed artifacts, protected environments, admission controls
Must-Know Commands
Git
git statusgit log --oneline --graph --decorategit diffgit revert <commit>git fetch --all --prune
Linux
toporhtopps auxjournalctl -u <service>systemctl status <service>df -hfree -mss -tulpnlsof -i
Networking
curl -vdig <name>nslookup <name>ip addrip routetraceroute <host>ormtr <host>tcpdump -i <iface> port <port>
Docker
docker psdocker logs <container>docker exec -it <container> shdocker inspect <container>docker stats
Kubernetes
kubectl get pods -Akubectl describe pod <name>kubectl logs <pod> --previouskubectl get events --sort-by=.lastTimestampkubectl top podkubectl get svc,ingress,endpointskubectl rollout status deployment/<name>kubectl describe node <name>
Terraform And Ansible
terraform initterraform planterraform showterraform state listterraform force-unlock <lock-id>ansible -m ping allansible-playbook site.yml --check
High-Value Scenarios To Practice
CrashLoopBackOff
Mention:
kubectl describe podkubectl logs --previousconfig and secret validation
entrypoint failure
OOM or bad probes
rollout undo if user impact is active
ImagePullBackOff
Mention:
bad image tag
registry auth or
imagePullSecretsprivate registry availability
node egress or DNS problems
Pending Or FailedScheduling
Mention:
requests and limits
taints and tolerations
node selectors and affinity
PVC zone binding or storage class issues
High Latency Even After Scaling
Mention:
CPU throttling
connection pool exhaustion
downstream dependency limits
DNS latency
kernel or node bottlenecks such as conntrack
Terraform Wants To Recreate A Critical Resource
Mention:
stop before
applyinspect
planand changed attributescheck
ForceNewbehavior and driftreview state, module changes, and imports
use
prevent_destroyfor critical resources
Monitoring Shows No Data
Mention:
target discovery
/metricsreachabilityexporter health
service monitor or scrape config
network policy or label mismatch
Design Trade-Offs You Should Be Ready To Explain
Merge vs rebase
Rolling vs blue-green vs canary deployments
Deployment vs StatefulSet
HPA vs VPA
Terraform module reuse vs environment isolation
L4 vs L7 load balancer
Managed Kubernetes vs self-managed cluster
Prometheus pull model vs push model
Mutable servers vs immutable infrastructure
Strong Signals In Senior Answers
You lead with impact, not guesswork.
You talk about rollback and mitigation before perfect root cause.
You call out trade-offs instead of pretending every tool is always the best choice.
You mention validation after the fix.
You explain how to prevent recurrence with automation, guardrails, or observability.
Common Weak Signals
Jumping straight to one tool or one favorite command
Saying "restart everything" as the first step
Confusing symptoms with root causes
Ignoring rollback, change history, or user impact
Treating monitoring as CPU and memory graphs only
Behavioral Interview Preparation
Prepare short stories for these themes:
A production incident you helped mitigate
A repetitive manual task you automated
A pipeline or deployment you improved
A security or compliance control you introduced
A disagreement you resolved with developers, QA, or security
A time you reduced MTTR, cost, failure rate, or deployment time
Use the STAR format, but keep it technical:
Situation: what was failing or slow
Task: what you owned
Action: what you changed, automated, or investigated
Result: measurable improvement
Final Revision Checklist
I can explain the path of a user request from browser to backend and database.
I can describe how I would debug a broken deployment in Kubernetes.
I can explain Git rollback options without confusing
resetandrevert.I can discuss Terraform state, backends, locking, and drift clearly.
I can explain probes, Services, Ingress, HPA, and common pod failure states.
I can describe CI/CD safety controls such as artifact immutability, approvals, and rollback.
I can explain metrics, logs, traces, SLOs, and alert fatigue.
I can talk through at least two real troubleshooting stories from my own work.
Last updated