4. Cloud Services
The Cloud Services section of the roadmap focuses on moving away from physical hardware to "On-Demand" infrastructure. As a DevOps engineer, your goal is to manage these resources efficiently using major providers like AWS, Azure, or GCP.
4. Cloud Services
This section covers the essential pillars of cloud infrastructure that allow applications to run globally with high availability.
1. Compute Servers
Compute is the "brain" where your application logic runs. You must learn how to manage virtual machines and instances across providers.
Provider Equivalents: * AWS: EC2 (Elastic Compute Cloud)
Azure: Virtual Machines
GCP: Compute Engine
Key Concepts:
Instance Types: Choosing the right CPU/RAM balance for your workload.
Auto-scaling: Automatically adding or removing servers based on traffic.
Serverless Compute: Running code without managing any servers (e.g., AWS Lambda or Google Cloud Functions).
2. Database Servers
DevOps engineers must decide between managing the database themselves or using a provider's service.
Self-hosted: You install a database (like MySQL) on a Compute Instance. You are responsible for backups, patching, and scaling.
Managed Services: The cloud provider handles the "heavy lifting."
Relational (SQL): AWS RDS, Azure SQL, Google Cloud SQL.
NoSQL: AWS DynamoDB, Azure Cosmos DB, Google Cloud Firestore.
Benefits: Managed services offer automated backups, high availability (multi-AZ), and easy vertical/horizontal scaling.
3. VPCs & Networking
A Virtual Private Cloud (VPC) is your own private corner of the cloud. It provides network isolation and security.
Subnets: Dividing your VPC into smaller segments.
Public Subnets: For resources that need to be accessed from the internet (e.g., Web Servers).
Private Subnets: For sensitive data that should never be exposed to the public internet (e.g., Databases).
Gateways:
Internet Gateway (IGW): Allows your public subnet to talk to the world.
NAT Gateway: Allows resources in a private subnet to download updates from the internet without being exposed to incoming attacks.
4. Managed Services
Beyond basic compute and storage, cloud providers offer "Platform as a Service" (PaaS) tools that speed up development.
Storage: Object storage for files and assets (e.g., AWS S3, Azure Blob Storage).
Messaging: Decoupling services using queues (e.g., AWS SQS) or notifications (e.g., AWS SNS).
Container Orchestration: Managed environments for Docker and Kubernetes (e.g., AWS EKS, GCP GKE).
5. IAM / RBAC
Security is the most critical part of the cloud. Identity and Access Management (IAM) ensures that only the right people/services have access to specific resources.
Users & Groups: Managing human identities.
Roles: Temporary permissions granted to services (e.g., giving an EC2 server permission to write to an S3 bucket).
RBAC (Role-Based Access Control): A method where permissions are assigned to "Roles" (like Admin, Developer, or Viewer) rather than individual users.
Least Privilege Principle: The golden rule of DevOps—never give a user more permission than they absolutely need to do their job.
This is Section 4: Cloud Services. For a mid-to-senior SRE/DevOps role, "knowing the cloud" is no longer about knowing which button to click in the console. It is about Cloud-Native Architecture, Identity Engineering, and FinOps (Cost Optimization).
At this level, you are expected to treat the Cloud as a programmable resource, managing its limits, costs, and security at scale.
🔹 1. Improved Notes: Engineering the Cloud
Identity & Access Management (IAM): The Security Perimeter
The Principle of Least Privilege (PoLP): Senior SREs do not use
AdministratorAccess. They use IAM Policy Boundaries and Service Control Policies (SCPs) to set a "maximum permission ceiling" for accounts.IAM Roles for Service Accounts (IRSA): In Kubernetes (EKS), you never give a Worker Node full S3 access. Instead, you map a specific IAM Role to a Kubernetes Service Account. This ensures only the specific pod that needs S3 access gets it.
Temporary Credentials: Move away from Long-lived IAM Users. Use AWS Identity Center (SSO) or OIDC federation to issue short-lived tokens.
Compute: Beyond the Virtual Machine
Auto Scaling Groups (ASG): Understand Predictive Scaling (using ML to scale before the spike) vs. Dynamic Scaling (reacting to metrics like CPU).
Spot Instances: A senior engineer saves the company 70–90% on costs by using Spot Instances for stateless workloads (e.g., CI/CD runners, Batch processing) with Spot Fleet and graceful termination handling.
Serverless (Lambda): Understand the "Cold Start" problem and how Provisioned Concurrency or choosing languages like Go/Rust can mitigate it.
Storage & Databases: Data Gravity
S3 Strong Consistency: Amazon S3 now offers strong read-after-write consistency for all applications. You no longer need to worry about "eventual consistency" when listing files immediately after a write.
RDS & Aurora: Aurora is "Cloud Native" storage that scales IOPS independently of compute. SREs care about Failover Time and Read Replicas for global performance.
🔹 2. Interview View (Q&A)
Q1: What is the "Shared Responsibility Model" and how does it change with Lambda vs. EC2?
Answer: In EC2, the customer is responsible for OS patching, firewall (SGs), and data encryption. In Lambda (Serverless), the Cloud Provider handles the OS and underlying runtime; the customer is only responsible for the code logic and IAM permissions.
Follow-up: "If a hacker breaches your Lambda code, is that Amazon's fault or yours?" -> Yours (Application layer security).
Q2: How do you differentiate between a Security Group and a NACL?
Answer: * Security Groups: Stateful (if you allow port 80 in, the return traffic is automatically allowed). Operates at the Instance level.
NACLs (Network ACLs): Stateless (you must explicitly allow return traffic). Operates at the Subnet level. Acts as a second layer of defense.
Q3: We have an application that is extremely latency-sensitive. Should we use Multi-AZ or Multi-Region?
Answer: Multi-AZ. Within a region, the latency between zones is sub-millisecond. Multi-Region introduces significant "speed of light" latency (e.g., 60ms+ between US-East and US-West). Multi-region is for Disaster Recovery (DR), not latency.
🔹 3. Architecture & Design: Resilience at Scale
The Trade-off: RTO vs. RPO
RTO (Recovery Time Objective): How quickly can you get back up?
RPO (Recovery Point Objective): How much data are you willing to lose?
SRE Design: For a "Tier 1" service, we aim for an RTO/RPO near zero by using Active-Active Multi-Region deployment, but this significantly increases cost and architectural complexity (e.g., handling global database replication).
Failure Scenario: "The Thundering Herd"
If your entire cluster restarts simultaneously, they all request secrets from Vault or hit the DB at once.
The Fix: Implement Exponential Backoff with Jitter in your application code and use Connection Pooling.
🔹 4. Commands & Configs (The Cloud CLI)
IAM Policy: The "Deny-by-Default" Standard
This policy allows a user to manage S3, but only if they have MFA enabled.
JSON
FinOps: Finding Waste
Bash
🔹 5. Troubleshooting & Debugging
Scenario: A developer says their Pod can't upload to S3, even though the IAM policy is correct.
Check Trust Relationship: Does the IAM Role trust the EKS OIDC provider?
Check Service Account: Is the Kubernetes Service Account annotated with the Role ARN?
Check VPC Endpoints: If the Pod is in a private subnet, is there an S3 Gateway Endpoint configured? (Without it, the traffic can't reach S3 over the private network).
Check S3 Bucket Policy: Is there a bucket-level policy that explicitly
DENYsaccess from outside a specific VPC?
🔹 6. Production Best Practices
Infrastructure as Code (IaC): 100% of cloud resources must be in Terraform/CloudFormation. Manual changes (ClickOps) lead to Configuration Drift.
Tagging is Mandatory: Every resource must have
Environment,Owner, andCostCenter. Without this, you cannot track spending.Cloud Agnostic vs. Cloud Native: Avoid "Cloud Agnostic" (using only generic features) unless you have a massive budget. Use "Cloud Native" (managed services like RDS/SQS) to reduce your Operational Toil.
Anti-Pattern: Running your own Databases on EC2. Unless you have a specific performance tuning need, use managed services like RDS to offload the "Undifferentiated Heavy Lifting."
🔹 Cheat Sheet / Quick Revision
Service Category
AWS Equivalent
Key SRE Metric
Identity
IAM
Least Privilege, IRSA
Compute
EC2 / Lambda
CPU Utilization, Cold Start
Database
RDS / Aurora
Failover Time, IOPS
Content Delivery
CloudFront
Cache Hit Ratio
Reliability
Route53
Health Check Success
This is Section 4: Cloud Services. For an SRE or Cloud Engineer, the interview isn't about memorizing the AWS/Azure console; it’s about Resource Management, Cost Efficiency (FinOps), and Architecture Resilience.
🟢 Easy: Core Concepts & Service Knowledge
Focus: Understanding what the cloud provides and the shared responsibility model.
Explain the "Shared Responsibility Model" in the cloud.
Context: Who is responsible for patching the OS in an EC2 instance versus a Lambda function?
What is the difference between S3 (Object Storage) and EBS (Block Storage)?
Context: When would you use one over the other? (e.g., Static files vs. a Database root volume).
What is an IAM Role, and how does it differ from an IAM User?
Context: Focus on the security benefit of temporary credentials (roles) versus long-lived credentials (users).
Define "Region" and "Availability Zone (AZ)."
Context: How does a region differ from a single data center?
🟡 Medium: Networking & Scalability
Focus: How resources connect and handle traffic.
What is a VPC, and what are its primary components?
Context: Mention Subnets, Route Tables, Internet Gateways, and Security Groups.
Explain the difference between Vertical Scaling and Horizontal Scaling.
Context: Which one is preferred in a cloud-native environment and why? (Mention Auto Scaling Groups).
How do Security Groups (SGs) and Network ACLs (NACLs) work together?
Context: Discuss Stateful (SGs) vs. Stateless (NACLs) and their layers of defense.
What is a "Managed Database" (like RDS), and what are the benefits over running a DB on an EC2 instance?
Context: Focus on automated backups, patching, and Multi-AZ failover.
🔴 Hard: Architecture, Security & Cost
Focus: Senior-level design decisions and troubleshooting.
Scenario: Your application in a private subnet needs to access an S3 bucket. You want to avoid the traffic going over the public internet. How do you achieve this?
Context: The interviewer is looking for VPC Endpoints (Gateway vs. Interface).
What are RTO (Recovery Time Objective) and RPO (Recovery Point Objective)? How do they influence your Disaster Recovery (DR) strategy?
Context: Compare "Backup & Restore" vs. "Pilot Light" vs. "Multi-Site Active-Active" strategies.
Explain "Identity Federation" or OIDC in the context of the cloud.
Context: How can you allow GitHub Actions or an on-premise Active Directory to deploy resources without using static access keys?
How do you approach "FinOps" or Cloud Cost Optimization for a large-scale cluster?
Context: Mention Spot Instances, Reserved Instances/Savings Plans, lifecycle policies for S3, and identifying "zombie" resources (unattached EBS volumes, idle LBs).
💡 Pro-Tip for your Interview
When asked about cloud design, always mention "The Well-Architected Framework."
The SRE Answer: "When I design cloud infrastructure, I follow the five pillars: Operational Excellence, Security, Reliability, Performance Efficiency, and Cost Optimization. For example, to ensure Reliability, I always distribute my compute across at least two Availability Zones."
Ready for Section 5 (Linux Operating System) questions?
Last updated