Linux & Bash Complete Guide — LearnwithVishnu

Linux & Bash

BeginnerEngineerProductionArchitectCommand line mastery — every DevOps engineer's foundation

Why Linux Processes Performance Files & Permissions Networking Bash Scripts Troubleshoot Interview Q&A Roadmap

🐧 Why Linux for DevOps?

›

Every server, every container, every Kubernetes node runs Linux. When something breaks at 2am, you debug it in a Linux terminal. When you write automation scripts, you use bash. When you tune performance, you read Linux metrics. Linux command line proficiency is non-negotiable for any DevOps role.

Linux in the DevOps World

Where you encounter Linux	What you need to know
Production servers (AWS EC2, Azure VM)	Files, processes, networking, services
Docker containers	Alpine/Debian/Ubuntu base images, shell debugging
Kubernetes nodes	kubelet runs on Linux, debug with node exec
CI/CD pipelines (GitHub Actions, Jenkins)	Pipeline steps run bash on Ubuntu runners
Ansible playbooks	SSH into Linux servers, execute Linux commands
Terraform remote-exec	Run shell scripts on provisioned Linux VMs

Linux navigation and distributions

⚙️ Processes & Services

›

What is a Process?

A process is a running program with its own PID (Process ID), memory space, and file handles. When you start nginx, the OS creates a process. When nginx spawns worker processes, each gets its own PID. Every process has a parent — orphan processes cause zombie issues.

Process States

State	What it means
`R` Running	Actively using CPU or ready to use CPU
`S` Sleeping	Waiting for something (I/O, timer, signal) — normal
`D` Uninterruptible sleep	Waiting for disk I/O — if many D state processes, disk is slow
`Z` Zombie	Process finished but parent hasn't acknowledged yet — minor issue
`T` Stopped	Paused (Ctrl+Z in terminal)

Process management commands

📊 Performance Troubleshooting

›

The USE Method — Systematic Performance Analysis

For every resource (CPU, memory, disk, network): check Utilisation (how busy is it?), Saturation (is there a queue forming?), Errors (are there failures?). Don't randomly check things — follow this framework every time.

Performance troubleshooting — full flow

📁 Files, Permissions & Text Processing

›

Understanding Linux Permissions

Every file has three permission sets: owner, group, others. Each set has read (r=4), write (w=2), execute (x=1). The number 755 means: owner=7(rwx), group=5(r-x), others=5(r-x).

Permission	Octal	Use case
rwxr-xr-x	755	Executables, directories with public access
rw-r--r--	644	Regular config files, public readable
rw-------	600	SSH private keys, sensitive credentials
rwx------	700	Directories with sensitive content
rwxrwxrwx	777	NEVER use in production — anyone can modify!

File operations + text processing

🌐 Networking Commands

›

Network Troubleshooting Mindset

Work through the OSI layers from bottom up: Physical → Network (ping, ip route) → Transport (ss, netstat, nc) → Application (curl, wget, nslookup). Most DevOps networking problems are at layers 3-7.

Complete networking commands

🖥️ Bash Scripting — Production Standard

›

Why Bash Matters

CI/CD pipeline steps are bash. Deployment scripts are bash. Cron jobs are bash. The difference between a good bash script and a dangerous one is error handling. A script that silently continues after an error can delete production data.

Critical rules: Always use set -euo pipefail. Always use logging functions. Always trap cleanup on exit. Never use rm -rf with an unquoted variable.

Production bash script template

🔍 Troubleshooting — Scenarios

›

These are the exact scenarios asked in senior DevOps interviews. Know them cold.

Server troubleshooting — complete playbook

🎯 Interview Questions

›

LINUX · ENGINEER

Server is at high CPU. Walk through how you find the cause.

Start broad, then narrow. First: uptime to see load average — compare to number of CPUs. If load is 2× number of CPUs, something is wrong. Then: ps aux --sort=-%cpu to find the top consumer. Note the PID and process name. Check how long it has been running with ps -o pid,etime,cmd -p PID. If it is a known service (nginx, java): check its logs — journalctl -u nginx --since '30 min ago'. If it is a runaway process: check what it is doing with strace -p PID -e trace=all — you will see infinite loops, repeated failed syscalls. Common causes at HPE: a Kafka consumer stuck in retry loop consuming 100% CPU. Fix: kill the process, find the poison message, add retry limit with backoff in code.

LINUX · ENGINEER

What is the difference between a process and a thread in Linux?

A process is an independent program with its own memory space, file descriptors, and PID. A thread is a lightweight execution unit WITHIN a process — threads share the same memory space and file descriptors as the parent process. Creating a process (fork) is expensive — copies all memory. Creating a thread is cheap — shares existing memory. In Linux, both are implemented as tasks with clone() syscall — processes use clone() without CLONE_VM flag (separate memory), threads use clone() with CLONE_VM (shared memory). For DevOps: ps aux shows processes. To see threads: ps -eLf or top -H. Important for troubleshooting: if a Java process has 200 threads and CPU is high, it might be a thread pool exhaustion issue. Use jstack PID to get thread dump.

LINUX · PRODUCTION

Your disk is 100% full on a production server. Walk through the fix without downtime.

Do NOT just delete random files. Systematic approach: First: df -h to confirm which partition is full. Second: du -sh /* to find the largest directories. Third: common culprits in order — /var/log (logs grew unbounded), /var/lib/docker (Docker images/containers), /tmp (someone wrote large temp files), /home (developer left large files). Safe immediate fixes: journalctl --vacuum-size=500M to trim journal logs. find /var/log -name '*.gz' -mtime +30 -delete to remove old compressed logs. docker system prune -f to remove unused Docker resources. For permanent fix: add logrotate config, add monitoring alert at 80% disk usage. At HPE: had this on a TeMIP server. /var/log/app filled up because log level was set to DEBUG in production. Fixed by changing log level to INFO and adding logrotate.

LINUX · ARCHITECT

Explain Linux file permissions. How do you secure a private key file?

Every file has three permission sets: owner, group, others. Each set has three bits: read (4), write (2), execute (1). Common values: 755 = owner can rwx, group and others can rx — good for executables. 644 = owner can rw, group and others can read — good for config files. 600 = only owner can rw, nobody else has any access — required for SSH private keys. 700 = only owner can rwx — good for directories with sensitive content. For SSH private key: chmod 600 ~/.ssh/id_rsa. If permissions are wrong (too open), SSH refuses to use the key with a permission denied error. For production: sensitive config files should be 640 (owner read-write, group read) and owned by the application user. Never 777 on production — that means anyone can modify the file.

LINUX · PRODUCTION

How do you investigate a memory leak on a Linux server?

Memory leak = application allocates memory and never frees it. Symptoms: free -h shows available memory decreasing over hours/days, server eventually OOM-kills processes. Investigation: watch the specific process over time: watch -n 60 'ps -o pid,vsz,rss,comm -p PID' — VSZ (virtual) and RSS (resident) should both grow over time for a leak. Check dmesg and journalctl -k for OOM killer messages — they show which process was killed and how much memory it had. For Java: jmap -histo PID shows object count by class — which class is growing? For Python: use tracemalloc or memory_profiler. For Go: use pprof. Immediate mitigation: restart the leaking service (cron job restart every night if fix takes time). Permanent fix: find the object that is never dereferenced and fix the code. At HPE: Python Kafka consumer cached every processed message ID in a dict without expiry. Fixed by using OrderedDict with maxlen limit.

LINUX · ENGINEER

What is set -euo pipefail and why do you use it in bash scripts?

Three separate options: set -e makes the script exit immediately when any command returns non-zero exit code. Without it, errors are silently ignored and the script continues — dangerous in deployment scripts. set -u makes the script exit when you reference an undefined variable. Without it, a typo in a variable name gives an empty string — silent bug. Example: rm -rf $DIRECOTRY/ (typo) without -u would run rm -rf / (delete everything). set -o pipefail makes a pipeline fail if ANY command in the pipe fails. Without it, ls /nonexistent | sort returns exit code 0 because sort succeeded — the ls failure is hidden. Together they make bash scripts behave like proper programming languages — fail loudly on errors rather than silently continuing in a broken state. Every production bash script should start with these.

LINUX · PRODUCTION

A service cannot connect to a database. Walk through network troubleshooting.

Layered investigation from application to network. Step 1: can we reach the DB host at all? ping db-server from the app server. If ping fails, routing or firewall issue. Step 2: is the DB port open? nc -zv db-server 5432 (PostgreSQL) or nc -zv db-server 3306 (MySQL). If this fails, DB is not listening, firewall blocking, or wrong host/port. Step 3: is DNS resolving correctly? nslookup db-server — check if it resolves to the right IP. Step 4: is there a firewall rule? On the DB server: sudo iptables -L -n | grep 5432, or ss -tlnp | grep 5432 — is PostgreSQL actually listening? On the app server: check if outbound traffic on 5432 is allowed. Step 5: test the actual connection with the DB client: psql -h db-server -U user -d dbname — this confirms credentials and SSL settings too. Step 6: check application config — wrong host name? wrong port? wrong credentials in config file?

LINUX · ARCHITECT

What is the Linux /proc filesystem and how do you use it for troubleshooting?

/proc is a virtual filesystem — it exists only in memory, not on disk. It exposes kernel and process information as readable files. Every process has a directory /proc/PID containing: cmdline (full command), fd (open file descriptors), status (memory, state), net (network info). Key files: /proc/meminfo shows detailed memory breakdown including cached, buffers, available. /proc/cpuinfo shows CPU details, core count. /proc/loadavg shows 1/5/15 minute load average. /proc/net/tcp shows all TCP connections in kernel format. For troubleshooting: cat /proc/PID/status shows memory usage and OOM score. ls -la /proc/PID/fd | wc -l counts open file descriptors — if this is very high, you have a file descriptor leak. cat /proc/PID/net/tcp shows which network connections this process has. You should never edit /proc files except for specific tuning like /proc/sys/net/ipv4/tcp_fin_timeout or /proc/PID/oom_score_adj.

🗺️ Roadmap

›

Week 1

Navigation

Navigate filesystem without GUI

Understand file permissions

Manage files: cp, mv, rm, find, grep

Week 2

Processes & Services

ps, top, kill — find and manage processes

systemctl — manage services

journalctl — read system logs

Week 3

Networking & Troubleshooting

ss, netstat, nc — port checking

curl, dig — HTTP and DNS testing

iostat, vmstat — performance analysis

Month 2

Bash Scripting

set -euo pipefail in every script

Functions, loops, error handling

Write a deployment script from scratch

Understand cron, at, systemd timers

Continue Learning

☸️ Kubernetes 🐳 Docker 🤖 Ansible 🏠 All Topics