07 Interview Preparation

Use this file as the main answer framework for DevOps, SRE, platform, and cloud-operations interviews.

What Interviewers Are Really Testing

1. Fundamentals

You should be able to explain how Linux, networking, cloud, CI/CD, containers, and observability fit together. Interviewers usually care less about memorizing a command and more about whether you know when and why to use it.

2. Systems Thinking

A strong DevOps engineer thinks in dependencies and blast radius:

What changed recently?
Which component is actually failing?
Which downstream system is causing the symptom?
What is the fastest safe mitigation?

3. Troubleshooting Depth

Good candidates do not jump to a favorite root cause. They narrow the problem with evidence:

Metrics to understand impact and timing
Logs and events to find failing components
Commands to validate a hypothesis
A rollback or mitigation if the system is still unhealthy

4. Automation Mindset

Interviewers look for engineers who reduce manual work. Repeated tasks should turn into scripts, pipelines, modules, templates, dashboards, or runbooks.

5. Communication

A senior answer is structured. It explains:

What I would check first
Why I am checking it
What outcome I expect
How I would mitigate the issue
What I would change long term

A Strong Answer Framework

Use this structure for most technical questions and incident scenarios:

Clarify the symptom and impact.
Check whether there was a recent deployment or configuration change.
Start with user-facing signals: latency, errors, availability, saturation.
Narrow the failing layer: app, container, node, network, database, cloud service, or pipeline.
Run the smallest commands that can confirm or reject a hypothesis.
Stabilize first, then optimize, then document the long-term fix.

Example:

I would first confirm scope and timing in Grafana, then check recent deployment history, then inspect pod events and logs. If the service is actively failing, I would prepare a rollback while I validate whether the issue is config, image, dependency, or resource pressure.

What You Should Know By Topic

Git And Collaboration

Branching, pull requests, merge vs rebase, revert vs reset
How to recover from a bad merge or bad release
Protected branches, code review, signed commits, secret scanning

Linux And Shell

Processes, signals, file permissions, systemd, logs, disk and memory inspection
Safe shell scripting with set -euo pipefail, exit codes, functions, logging, and cron or timers
How to debug a host that is slow, full, swapping, or refusing connections

Networking

TCP vs UDP, DNS, CIDR, routing, NAT, TLS, load balancing
Connection refused vs timeout
Basic packet and socket inspection with ss, curl, dig, tcpdump, traceroute, or mtr

Cloud

Compute, storage, networking, IAM/RBAC, autoscaling, HA, backups, DR
Public vs private subnets
Managed service trade-offs versus self-hosted platforms

CI/CD

Build, test, scan, package, publish, deploy, verify, rollback
Artifact immutability and promotion
Secrets handling, approval gates, canary or blue-green, smoke tests

Docker And Kubernetes

Dockerfile best practices, image layers, registries, networking, volumes
Pods, Deployments, StatefulSets, Services, Ingress, probes, HPA
Common failures: CrashLoopBackOff, ImagePullBackOff, Pending, OOMKilled, DNS issues

Terraform And Ansible

Providers, modules, state, backends, locking, outputs, drift, for_each
Why Terraform provisions and Ansible configures
Safe rollout patterns for infrastructure changes

Observability And SRE

Metrics, logs, traces
Four Golden Signals
SLI, SLO, error budget, alert fatigue
How to turn symptoms into dashboards and actionable alerts

Security

Secrets management, least privilege, image scanning, dependency scanning, SBOMs
Policy as code, signed artifacts, protected environments, admission controls

Must-Know Commands

Git

git status
git log --oneline --graph --decorate
git diff
git revert <commit>
git fetch --all --prune

Linux

top or htop
ps aux
journalctl -u <service>
systemctl status <service>
df -h
free -m
ss -tulpn
lsof -i

Networking

curl -v
dig <name>
nslookup <name>
ip addr
ip route
traceroute <host> or mtr <host>
tcpdump -i <iface> port <port>

Docker

docker ps
docker logs <container>
docker exec -it <container> sh
docker inspect <container>
docker stats

Kubernetes

kubectl get pods -A
kubectl describe pod <name>
kubectl logs <pod> --previous
kubectl get events --sort-by=.lastTimestamp
kubectl top pod
kubectl get svc,ingress,endpoints
kubectl rollout status deployment/<name>
kubectl describe node <name>

Terraform And Ansible

terraform init
terraform plan
terraform show
terraform state list
terraform force-unlock <lock-id>
ansible -m ping all
ansible-playbook site.yml --check

High-Value Scenarios To Practice

CrashLoopBackOff

Mention:

kubectl describe pod
kubectl logs --previous
config and secret validation
entrypoint failure
OOM or bad probes
rollout undo if user impact is active

ImagePullBackOff

Mention:

bad image tag
registry auth or imagePullSecrets
private registry availability
node egress or DNS problems

Pending Or FailedScheduling

Mention:

requests and limits
taints and tolerations
node selectors and affinity
PVC zone binding or storage class issues

High Latency Even After Scaling

Mention:

CPU throttling
connection pool exhaustion
downstream dependency limits
DNS latency
kernel or node bottlenecks such as conntrack

Terraform Wants To Recreate A Critical Resource

Mention:

stop before apply
inspect plan and changed attributes
check ForceNew behavior and drift
review state, module changes, and imports
use prevent_destroy for critical resources

Monitoring Shows No Data

Mention:

target discovery
/metrics reachability
exporter health
service monitor or scrape config
network policy or label mismatch

Design Trade-Offs You Should Be Ready To Explain

Merge vs rebase
Rolling vs blue-green vs canary deployments
Deployment vs StatefulSet
HPA vs VPA
Terraform module reuse vs environment isolation
L4 vs L7 load balancer
Managed Kubernetes vs self-managed cluster
Prometheus pull model vs push model
Mutable servers vs immutable infrastructure

Strong Signals In Senior Answers

You lead with impact, not guesswork.
You talk about rollback and mitigation before perfect root cause.
You call out trade-offs instead of pretending every tool is always the best choice.
You mention validation after the fix.
You explain how to prevent recurrence with automation, guardrails, or observability.

Common Weak Signals

Jumping straight to one tool or one favorite command
Saying "restart everything" as the first step
Confusing symptoms with root causes
Ignoring rollback, change history, or user impact
Treating monitoring as CPU and memory graphs only

Behavioral Interview Preparation

Prepare short stories for these themes:

A production incident you helped mitigate
A repetitive manual task you automated
A pipeline or deployment you improved
A security or compliance control you introduced
A disagreement you resolved with developers, QA, or security
A time you reduced MTTR, cost, failure rate, or deployment time

Use the STAR format, but keep it technical:

Situation: what was failing or slow
Task: what you owned
Action: what you changed, automated, or investigated
Result: measurable improvement

Final Revision Checklist

I can explain the path of a user request from browser to backend and database.
I can describe how I would debug a broken deployment in Kubernetes.
I can explain Git rollback options without confusing reset and revert.
I can discuss Terraform state, backends, locking, and drift clearly.
I can explain probes, Services, Ingress, HPA, and common pod failure states.
I can describe CI/CD safety controls such as artifact immutability, approvals, and rollback.
I can explain metrics, logs, traces, SLOs, and alert fatigue.
I can talk through at least two real troubleshooting stories from my own work.

PreviousReal-World Test, Career, and Community NextMLOps Interview Playbook

Last updated 7 days ago

hashtagWhat Interviewers Are Really Testing

hashtag1. Fundamentals

hashtag2. Systems Thinking

hashtag3. Troubleshooting Depth

hashtag4. Automation Mindset

hashtag5. Communication

hashtagA Strong Answer Framework

hashtagWhat You Should Know By Topic

hashtagGit And Collaboration

hashtagLinux And Shell

hashtagNetworking

hashtagCloud

hashtagCI/CD

hashtagDocker And Kubernetes

hashtagTerraform And Ansible

hashtagObservability And SRE

hashtagSecurity

hashtagMust-Know Commands

hashtagGit

hashtagLinux

hashtagNetworking

hashtagDocker

hashtagKubernetes

hashtagTerraform And Ansible

hashtagHigh-Value Scenarios To Practice

hashtagCrashLoopBackOff

hashtagImagePullBackOff

hashtagPending Or FailedScheduling

hashtagHigh Latency Even After Scaling

hashtagTerraform Wants To Recreate A Critical Resource

hashtagMonitoring Shows No Data

hashtagDesign Trade-Offs You Should Be Ready To Explain

hashtagStrong Signals In Senior Answers

hashtagCommon Weak Signals

hashtagBehavioral Interview Preparation

hashtagFinal Revision Checklist

What Interviewers Are Really Testing

1. Fundamentals

2. Systems Thinking

3. Troubleshooting Depth

4. Automation Mindset

5. Communication

A Strong Answer Framework

What You Should Know By Topic

Git And Collaboration

Linux And Shell

Networking

Cloud

CI/CD

Docker And Kubernetes

Terraform And Ansible

Observability And SRE

Security

Must-Know Commands

Git

Linux

Networking

Docker

Kubernetes

Terraform And Ansible

High-Value Scenarios To Practice

CrashLoopBackOff

ImagePullBackOff

Pending Or FailedScheduling

High Latency Even After Scaling

Terraform Wants To Recreate A Critical Resource

Monitoring Shows No Data

Design Trade-Offs You Should Be Ready To Explain

Strong Signals In Senior Answers

Common Weak Signals

Behavioral Interview Preparation

Final Revision Checklist