5. Linux Operating System

5. Linux Operating System

Linux is the "natural habitat" of a DevOps engineer. These notes cover the transition from a casual user to a system power-user.

1. Installation / Setup

Before a server can host an application, it must be properly initialized.

  • Distro Choice: Most DevOps environments use Ubuntu/Debian (popular for developers) or RHEL/CentOS/Rocky Linux (enterprise standard).

  • Initial Config: Includes setting the hostname, configuring the timezone (timedatectl), and setting up basic networking (ip addr, nmcli).

  • Security First: The first step after installation is usually disabling root SSH login and setting up a firewall (like ufw or firewalld).

2. Command Line Interface (CLI)

The CLI is where speed and automation happen.

  • Navigation: pwd (where am I?), ls -la (show all files), cd (change directory).

  • System Monitoring: top or htop to see CPU/RAM usage in real-time, df -h to check disk space, and free -m for memory status.

  • Process Management: Finding a "stuck" process with ps aux | grep app_name and stopping it with kill -9 <PID>.

3. Filesystem / Storage / User Permissions

Linux treats everything as a file, and who can touch those files is strictly controlled.

  • Filesystem Hierarchy (FHS):

    • /etc: The "Nerve Center" where all configuration files live (e.g., Nginx or SSH configs).

    • /var/log: Where applications write their history—essential for troubleshooting.

    • /bin & /sbin: Where the system's executable tools are stored.

  • Permissions (chmod/chown):

    • rwx: Read (4), Write (2), Execute (1).

    • Structure: Permissions are set for the User, Group, and Others. (e.g., chmod 755 means the owner can do everything, others can only read and execute).

  • Image of Linux directory structureShutterstock

4. Package Management

Instead of downloading .exe files, Linux uses package managers to handle software and their "dependencies."

  • APT (Advanced Package Tool): Used in Ubuntu/Debian. Commands: apt update, apt install nginx.

  • YUM/DNF: Used in RHEL/CentOS. Commands: yum install git, dnf upgrade.

  • Why it matters: In DevOps, we use these commands inside Dockerfiles or Ansible Playbooks to automate software setup across thousands of servers.

5. Virtualization

Virtualization allows you to run multiple "virtual" computers on a single physical machine.

  • Hypervisors:

    • Type 1 (Bare Metal): Runs directly on hardware (e.g., VMware ESXi, KVM).

    • Type 2 (Hosted): Runs on top of an OS (e.g., VirtualBox).

  • VMs vs. Containers: While a VM virtualizes the entire hardware (including a full OS), a container only virtualizes the application layer, making it much faster and lighter.

6. Utilities & Tools (The "Power Tools")

These are the text-processing "Big Four" used to parse logs and automate configuration changes.

  • grep: The "Search" tool. Used to find specific strings in massive log files (e.g., grep "ERROR" /var/log/syslog).

  • find: Used to locate files based on attributes like name, size, or modification date.

  • sed (Stream Editor): Used to find and replace text without opening the file. Essential for automation (e.g., sed -i 's/80/8080/g' config.conf).

  • awk: A powerful language for processing data in columns. If you need to "get the second word of every line in this log," you use awk.

This is Section 5: Linux Operating System. At a mid-to-senior SRE level, Linux is not just about "knowing commands"—it is about understanding how the Kernel manages hardware resources and how User Space applications interact with those resources.

In an interview, you are expected to explain what happens "under the hood" when a system is under pressure.


🔹 1. Improved Notes: System Internals

The Linux Kernel & The Shell

The Kernel is the bridge between software and hardware. It manages CPU scheduling, memory allocation, and device I/O.

  • System Calls (Syscalls): This is how an application asks the Kernel to do something (e.g., open(), read(), write(), fork()). If an app is slow, an SRE looks at syscall latency.

  • Virtual File System (VFS): Linux treats everything as a file (Hardware, Sockets, Processes). These are represented in /proc (process info) and /sys (kernel/hardware info).

Process Management: Life & Death

  • Parent/Child Relationship: Every process (except PID 1) is created by a parent using fork().

  • Zombie Processes: A process that has finished (sent SIGCHLD) but its parent hasn't acknowledged it. They take up space in the Process Table.

  • Orphan Processes: A process whose parent died. systemd (PID 1) "adopts" these and reaps them.

Memory & The OOM Killer

When the system runs out of physical RAM and Swap, the Out of Memory (OOM) Killer is triggered. It uses a scoring system (oom_score) to decide which process to kill to save the OS.


🔹 2. Interview View (Q&A)

Q1: What exactly is "Load Average" in the top command?

  • Answer: It is the average number of processes in a "runnable" or "uninterruptible" state.

    • Runnable: Waiting for CPU.

    • Uninterruptible: Waiting for I/O (Disk or Network).

  • Senior Twist: If Load Average is high but CPU usage is low, the system is bottlenecked on Disk I/O.

Q2: You have deleted a large log file, but df -h shows the disk is still full. Why?

  • Answer: A process still has an open File Descriptor to that file. Linux won't actually free the blocks on the disk until the process closes the file or the process is killed.

  • Follow-up: "How do you find which process is holding it?" -> Use lsof | grep deleted.

Q3: Explain the difference between soft link and hard link.

  • Answer: A Hard Link is a direct pointer to the Inode (the actual data on disk). If you delete the original file, the hard link still works. A Soft Link (Symbolic) is a pointer to the filename. If the original file is moved or deleted, the link breaks.


🔹 3. Architecture & Design: The Boot Process

Understanding how a server comes to life is vital for debugging "unreachable" nodes.

  1. BIOS/UEFI: Hardware initialization.

  2. GRUB: The bootloader that loads the Kernel into memory.

  3. Kernel: Initializes drivers and mounts the root filesystem.

  4. Systemd (PID 1): The mother of all processes. It starts services (Targets) in parallel, significantly faster than the old SysVinit.


🔹 4. Commands & Configs (The "SRE Power Tools")

Advanced Debugging

Bash

System Limits (/etc/security/limits.conf)

For high-traffic apps (databases/proxies), you must increase the number of allowed open files.

Plaintext


🔹 5. Troubleshooting & Debugging

Scenario: A server is "sluggish" and unresponsive to SSH.

  1. Isolate the resource: Use top or htop. Look for %wa (I/O Wait). If high, the disk is the bottleneck.

  2. Check Memory: Use free -m. Is Swap being used? If the system is "thrashing" (constantly moving data between RAM and Swap), performance will tank.

  3. Check Dmesg: Run dmesg -T | tail -n 50. Look for "Out of memory", "TCP segments dropped", or "Hardware Error".

  4. Check Zombie PIDs: Use ps aux | grep 'Z'. A massive amount of zombies indicates a buggy application failing to close child processes.


🔹 6. Production Best Practices

  • No SSH in Production: Use Configuration Management (Ansible) or Systems Manager (AWS SSM). Manual changes are "Snowflake Servers."

  • Log Rotation: Always configure logrotate. A common "SRE 1:00 AM incident" is a disk filling up because an app was logging in DEBUG mode.

  • Kernel Hardening: Disable unused services and use SELinux or AppArmor to restrict what processes can do, even if they are breached.

  • Anti-Pattern: Running a process as root. If the app is compromised, the attacker has full control of the OS. Always use a dedicated service user.


🔹 Cheat Sheet / Quick Revision

Command

Purpose

SRE Context

top / htop

Process monitor

Check CPU/Load/Memory.

iostat

Disk I/O stats

Identify disk bottlenecks.

netstat / ss

Network stats

Check for socket exhaustion.

journalctl -u <svc>

View Logs

Standard way to debug systemd services.

lsof

List open files

Find which process is using a port or file.

nice / renice

Process priority

Give critical processes more CPU priority.


This is Section 5: Linux Operating System. For a senior-level role, Linux knowledge isn't about knowing how to list files; it's about understanding how the OS handles resource contention, kernel-level operations, and system performance.


🟢 Easy: Basic Operations & File Systems

Focus: Day-to-day command usage and basic permissions.

  1. What is the difference between chmod and chown?

    • Context: Focus on "Permissions" vs. "Ownership."

  2. How do you find which process is using a specific port (e.g., port 8080)?

    • Context: The interviewer is looking for netstat -tulpn or ss -tulpn.

  3. What is the purpose of the /etc/hosts file?

    • Context: Local DNS resolution before hitting external DNS servers.

  4. Explain the basic Linux file permissions (rwx) and what the numbers 755 or 644 mean.

    • Context: Understanding User, Group, and Others bitmasks.


🟡 Medium: System Internals & Process Management

Focus: How the OS manages tasks and resources.

  1. Explain the three numbers in the "Load Average" output of the uptime command.

    • Context: They represent the 1, 5, and 15-minute averages. Mention that this includes processes waiting for CPU and processes in uninterruptible sleep (I/O wait).

  2. What is a "Zombie" process and how do you "kill" it?

    • Context: A zombie is already dead; you can't kill it. You must kill the parent process so the zombie is "reaped" by init (PID 1).

  3. What is an Inode? What happens if a system runs out of Inodes but still has disk space?

    • Context: An Inode stores metadata about a file. If you run out, you cannot create new files, even if you have 100GB of space left. This happens with many small files (e.g., session files or tiny logs).

  4. Describe the difference between a Hard Link and a Soft (Symbolic) Link.

    • Context: A Hard link points to the Inode; a Soft link points to the file path.


🔴 Hard: SRE-Level Debugging & Kernel Concepts

Focus: High-pressure troubleshooting and kernel-space behavior.

  1. Scenario: You deleted a 10GB log file to free up space, but df -h still shows the disk is 100% full. Why, and how do you fix it without a reboot?

    • Context: A process still has an open File Descriptor to that file. The space isn't freed until the process closes the file. Use lsof | grep deleted to find the PID and restart that service.

  2. What is the OOM (Out of Memory) Killer? How does it decide which process to kill first?

    • Context: Mention the oom_score. Factors include memory usage, process age, and oom_score_adj. SREs often protect critical processes (like sshd or databases) by lowering their score.

  3. How would you use strace to debug a running application that is hanging or "stuck"?

    • Context: strace intercepts System Calls. You can see if the app is stuck on a read() (waiting for network) or an open() (waiting for a file lock).

  4. Explain the difference between User Space and Kernel Space.

    • Context: Focus on security and stability. Applications run in User Space and must use Syscalls to request the Kernel to perform hardware tasks (Disk, Network, RAM).


💡 Pro-Tip for your Interview

When answering Linux questions, mention "Everything is a file."

  • The SRE Answer: "In Linux, everything is a file—from the hardware and network sockets to the processes listed in /proc. If I see a 'Too many open files' error, I know I'm hitting the ulimit or the kernel-wide file descriptor limit, which I can tune in sysctl.conf."


Ready for Section 6 (Containers & Orchestration) questions?

Last updated