Linux Fundamentals for Engineers

Anatomy of a Process

A developer on your team runs docker exec into a running container and sees the nginx worker they expected at PID 1 — except the "nginx master" has PID 12, and there are eight worker processes with PIDs 13 through 20. They ask: "Why is the master not PID 1? Which one do I kill if I want to restart nginx gracefully? Why does ps in the container only show processes from this container, but ps on the host shows everything? And when I kill -9 the master, why do the workers all die too — but when I kill -9 a worker, nginx just quietly spawns a new one?"

Every one of those questions has the same answer: there is a specific, well-defined structure to what a process is in Linux. Processes have parents, group leaders, session leaders, and a memory layout you can literally read byte-for-byte from /proc. Once you see that structure clearly, "why did this happen?" becomes a predictable walk through the process tree, not a guessing game.


What a Process Actually Is

A process is the kernel's accounting unit for "a running program." It is a struct in the kernel (called task_struct) that holds:

  • Identity: a PID, a parent PID, owner (UID/GID), session and process group memberships.
  • Memory: a virtual address space — text (code), data, heap, stacks, memory-mapped files.
  • File descriptors: a table of open files, sockets, pipes.
  • Scheduling state: running, sleeping, stopped, zombie; CPU time accumulated; priority.
  • Credentials and capabilities: what it is allowed to do.
  • Namespace membership: which PID namespace, mount namespace, network namespace, etc. (we cover this in Module 5).
  • Signal state: what signals are blocked, what handlers are installed, what is pending.

When you type ls /tmp, the shell calls fork() to make a copy of itself and then execve() to replace the copy's program with /bin/ls. The kernel allocates a new task_struct, gives it a PID, hooks it into its parent's tree, and schedules it to run. When ls finishes, the kernel marks the process a zombie (its exit code needs to be reaped by the parent) and eventually frees the task_struct when the shell calls wait().

That cycle — fork, exec, exit, reap — happens millions of times a second across a busy Linux system.


The Identity of a Process

Every process has several IDs, and they each serve a different purpose. Confusing them is the source of a lot of bad scripts.

IDWhat it isTool to see it
PIDProcess ID — the unique number for this processps -o pid, $$ in shell
PPIDParent PID — the PID that created this oneps -o ppid
PGIDProcess Group ID — usually the PID of the leader of a pipelineps -o pgid
SIDSession ID — usually the PID of the controlling shell / sshdps -o sid
TIDThread ID — every thread has one; main thread's TID == the PIDps -T, top -H
UID / GIDThe user/group the process runs as (real and effective)ps -o uid,euid
# Your current shell
echo $$
# 12345

# See everything about it
ps -o pid,ppid,pgid,sid,tty,uid,comm -p $$
#    PID   PPID   PGID    SID TT       UID COMMAND
#  12345  12340  12345  12340 pts/0   1000 bash

# Every process on the system in tree form
pstree -p | head -20

# Process tree for a specific PID
pstree -p 1
KEY CONCEPT

PID, PPID, PGID, and SID are four different things and they rarely line up. PID is "which process." PPID is "who made it." PGID is "which pipeline am I in" — ls | grep foo | less puts all three in the same process group, so hitting Ctrl-C delivers SIGINT to all of them at once. SID is "which login session" — it outlives your shell if you nohup something. When you script around signals or cleanup, know which ID you actually need.

Process groups and sessions, explained with one pipeline

# Run a pipeline and inspect its processes from another terminal
sleep 60 | grep xyz | wc -l &
jobs -l
# [1]+ 23456 Running  sleep 60 | grep xyz | wc -l

ps -o pid,ppid,pgid,sid,comm -g 23456
#    PID   PPID   PGID    SID COMMAND
#  23456  12345  23456  12340 sleep
#  23457  12345  23456  12340 grep
#  23458  12345  23456  12340 wc

Three processes, same PGID (23456), same SID (12340). PGID is the PID of the process group leader (the first process in the pipeline). SID is the PID of the session leader (the login shell). Ctrl-C in the terminal sends SIGINT to the whole process group; logging out sends SIGHUP to the whole session.


The Lifecycle: Fork, Exec, Exit, Reap

Linux creates processes by copying an existing one and replacing its program. There is no "create from scratch" — every process descends from PID 1, which descends from the kernel itself.

The life of a child process

Click each step to explore

# Watch a fork/exec/exit happen under strace
strace -f -e trace=clone,execve,exit_group -o /tmp/trace.log bash -c 'ls /tmp > /dev/null'
cat /tmp/trace.log
# [pid 12345] clone(...) = 12346            <- shell forks
# [pid 12346] execve("/bin/ls", [...], ...) = 0   <- child becomes ls
# [pid 12346] exit_group(0) = ?             <- ls exits cleanly
# [pid 12345] --- SIGCHLD {...} ---        <- kernel notifies shell

This is the whole mechanism. Everything else — shells, systemd, Kubernetes, container runtimes — is a program that repeats this cycle in a loop.

PRO TIP

fork() is implemented via the clone() syscall under the hood on Linux. clone() is more general: it lets the caller choose which resources to share (memory, file descriptors, namespaces) with the new task. Threads are created with clone() sharing memory and file tables; containers are created with clone() setting up new namespaces. Same syscall, different flags. When you see clone() in strace output, that is a process or thread being born.


Process States

A process is always in one of a handful of states. You can see the current state in /proc/[pid]/status as State: or in ps aux as the STAT column.

CodeNameMeaning
RRunning / RunnableOn a CPU right now, or in the run queue waiting for a CPU
SInterruptible sleepBlocked on something (usually I/O), but can be woken by a signal
DUninterruptible sleepBlocked in a syscall that cannot be interrupted (usually disk I/O) — cannot be killed
TStoppedSuspended via SIGSTOP/SIGTSTP (like Ctrl-Z)
tTracing stopStopped under a debugger
ZZombieExited, waiting for parent to reap
XDeadLiterally about to disappear; you will rarely see this
# Show every process and state
ps ax -o stat,pid,comm | head -20
# STAT    PID COMMAND
# Ss        1 systemd       <- S=sleep, s=session leader
# S<        2 kthreadd
# I<        3 rcu_gp
# Ss      423 dbus-daemon
# Sl     1203 containerd    <- l=multithreaded
# Rs     5678 ps

# Count processes by state
ps ax -o stat --no-headers | awk '{print $1}' | sort | uniq -c
WARNING

A process stuck in D state cannot be killed — not even by SIGKILL. It is waiting inside the kernel for a syscall to return (typically disk I/O that is stuck, NFS hanging, a dying block device). You will see it accumulate with ps but nothing you do at the process level will free it. The only fixes are: the I/O completes, the device driver times out, or the machine reboots. When you see load average climb into the hundreds with almost no CPU usage, it is almost always D-state processes piling up behind a stuck I/O.


The Virtual Address Space — What a Process Looks Like in Memory

Every process has its own virtual address space: the 2⁶⁴ bytes of addressable memory the CPU can reference from this process. The kernel, via page tables, maps slices of that virtual space to real physical RAM (or swap, or not-yet-allocated).

A typical userspace layout (simplified, x86-64):

  High address
  ┌──────────────────────────┐  0xffffffffffffffff
  │  Kernel (not mapped for  │
  │  userspace; it is there  │
  │  but inaccessible)       │
  ├──────────────────────────┤  0x00007fffffffffff
  │  Stack (grows down)      │   <- main thread's stack
  │  ...                     │
  ├──────────────────────────┤
  │  Shared libraries, mmap  │   <- libc.so.6, libssl.so, mmapped files
  │  regions                 │
  ├──────────────────────────┤
  │  Heap (grows up)         │   <- malloc lives here (via brk)
  ├──────────────────────────┤
  │  BSS (uninitialized data)│   <- zeroed globals
  ├──────────────────────────┤
  │  Data (initialized data) │   <- global variables with initial values
  ├──────────────────────────┤
  │  Text (executable code)  │   <- the compiled program
  └──────────────────────────┘  0x0000000000400000
  Low address

You can see this for any process in /proc/[pid]/maps:

# Look at a real process
cat /proc/self/maps | head -15
# 55a6b2c00000-55a6b2c24000 r--p ... /usr/bin/cat    <- text (read-only, executable)
# 55a6b2c24000-55a6b2c4c000 r-xp ... /usr/bin/cat
# 55a6b2c4c000-55a6b2c5e000 r--p ... /usr/bin/cat
# 55a6b2c5e000-55a6b2c5f000 r--p ... /usr/bin/cat    <- data
# 55a6b2c5f000-55a6b2c60000 rw-p ... /usr/bin/cat    <- data, writable
# 55a6d1234000-55a6d1255000 rw-p ... [heap]          <- the heap
# 7f... r--p ... /usr/lib/x86_64-linux-gnu/libc.so.6 <- libc
# 7f... r-xp ... /usr/lib/x86_64-linux-gnu/libc.so.6
# 7f... rw-p ... [stack]                             <- the main stack

# Total virtual size and resident size
cat /proc/self/status | grep -E 'Vm|Rss'
# VmPeak:    10240 kB
# VmSize:    10240 kB
# VmRSS:      2048 kB   <- pages currently in physical RAM
# ...

The three numbers you will see over and over:

  • VSS (Virtual Set Size): how much address space the process has mapped. Almost meaningless in isolation — a process can mmap a 100 GB file without using a byte of RAM.
  • RSS (Resident Set Size): how much of that is currently in physical RAM. This is the number top shows you and the one that counts for OOM decisions.
  • PSS (Proportional Set Size): RSS with shared pages divided by the number of processes sharing them. The most honest single number for "how much memory is this process actually using?"
# Memory used by all processes named nginx
ps -C nginx -o pid,comm,rss,vsz
#    PID COMMAND    RSS   VSZ
#   1234 nginx     2048 10240
#   1235 nginx     1536  9216

# Proportional memory (smem is not always installed; this is a quick substitute)
for pid in $(pgrep nginx); do
  awk '/Pss:/ {sum+=$2} END {print "PSS:", sum, "kB"}' /proc/$pid/smaps
done
WAR STORY

We had a Python service that top claimed was using 18 GB on a 24 GB node, yet nothing was OOM-killed. Turned out it was a multi-process gunicorn setup with 8 workers — each forked from the parent via copy-on-write. top happily reported the 18 GB for each worker because that is the RSS for each one, but most of those pages were shared between workers. Actual physical memory in use was around 4 GB. Reading Pss: out of /proc/*/smaps told the real story. After that, we stopped alerting on RSS sums and started alerting on PSS.


Threads Are Processes That Share Memory

Linux has one unified abstraction: every schedulable thing is a task. A "process" in the classical sense is a task with its own memory space and file descriptor table. A "thread" is a task that shares those with another task.

This is why top -H shows threads as separate entries, why /proc/[pid]/task/ exists, and why a multi-threaded Java process can show up in ps as hundreds of entries under the hood.

# Every thread of PID 1234
ls /proc/1234/task/
# 1234  1235  1236  1240  ...  (the main thread's TID == PID)

# Show threads, not just processes
ps -eLf | head -10
#   UID        PID   PPID    LWP  C NLWP STIME TTY    CMD
#   root         1      0      1  0    1 Apr19 ?      /lib/systemd/systemd
#   root      1203      1   1203  0   18 Apr19 ?      /usr/bin/containerd
#   root      1203      1   1204  0   18 Apr19 ?      /usr/bin/containerd
#   root      1203      1   1205  0   18 Apr19 ?      /usr/bin/containerd
#                              ^^^^ LWP = thread ID (TID)

# In htop, press 'H' to toggle thread view

All threads in a process share: memory, file descriptors, signal handlers, current working directory, UID/GID. They have their own: TID, stack, CPU registers, signal mask, scheduling state.

This is why:

  • A memory leak in one thread leaks from the whole process.
  • A thread crashing the process (segfault) takes all threads down with it.
  • kill $PID sends a signal to the process, but exactly one thread will receive it (the kernel picks).
  • pthread_kill(tid, signal) targets a specific thread.
KEY CONCEPT

Linux does not have separate "process" and "thread" syscalls. Both are created by clone() with different flags. A process is clone() with a new memory space; a thread is clone() sharing the parent's memory space. This is why Linux thread creation is fast — it is the same mechanism as process creation, just with different sharing flags — and why every threading library on Linux ultimately calls clone().


Parent-Child Relationships: Why It Matters Who Dies First

If a child exits while its parent is still alive, the parent is responsible for reaping it (wait() / waitpid()). If the parent exits first, the child is orphaned and gets re-parented to PID 1 (systemd) — which immediately reaps anything that exits as its child.

This is the mechanism that makes PID 1 a big deal. Every zombie in the system eventually becomes systemd's problem.

# See every process's parent
ps -eo pid,ppid,comm | head -20

# Find orphaned-then-reparented-to-PID-1 processes
ps -eo pid,ppid,comm | awk '$2 == 1 {print}'
# These were originally children of something else; the parent died

# When something is wrong, look for zombies
ps aux | grep -w 'Z'

Why containers care about PID 1

A container's PID 1 — the first process in the container's PID namespace — inherits the "reap all orphans" responsibility for its namespace. If your container's entrypoint is a shell script that execs something else, fine. If it runs a long-lived program as a child and does not reap its grandchildren, you will leak zombies inside the container until it dies. This is why the tini init binary exists and why Docker has a --init flag.


The Tools You Use Daily

# What is this process and its descendants?
ps --forest -g $(pgrep systemd | head -1)

# Every process, full info, scrollable
ps auxf | less

# Interactive — pick your columns with 'f', sort with '<'/'>'
top       # classic
htop      # much better, almost universally installed now

# Find processes by name
pgrep -lf nginx
# 1234 nginx: master process /usr/sbin/nginx
# 1235 nginx: worker process
# 1236 nginx: worker process

# Send a signal by name
pkill -TERM -f myapp

# Just the pid, for scripting
pgrep -f 'myapp --config prod'

# Count threads of a process
ls /proc/$PID/task | wc -l

# CPU usage of a specific process, updated every second
pidstat -p $PID 1

# Memory of a specific process over time
while true; do cat /proc/$PID/status | grep -E 'VmRSS|VmSize'; sleep 2; done
PRO TIP

ps auxf (the f adds the ASCII tree view) is the single most useful invocation for "what is happening on this box?" It tells you at a glance: every process, its user, its parent chain, memory, CPU. When you are first on a machine, ps auxf | less is worth more than any monitoring dashboard — you see the real topology of the workload.


Key Concepts Summary

  • A process is a task_struct in the kernel. Identity, memory, file descriptors, signal state, namespaces — all live in that struct.
  • PID, PPID, PGID, SID, UID are five different IDs. Use the right one: PID for a specific process, PGID for a pipeline, SID for a login session, UID for ownership.
  • Processes are created by fork + exec. No "create from scratch" exists — every process descends from PID 1.
  • States: R, S, D, T, Z, X. D is uninterruptible sleep and cannot be killed; Z is a zombie waiting for its parent to reap it.
  • Virtual address space is per-process. Text, data, heap, stack, mmaps, and shared libraries all live in there. /proc/[pid]/maps shows the layout.
  • RSS vs VSS vs PSS. RSS is in RAM now; VSS is mapped; PSS divides shared pages fairly. Use PSS for "what is this process actually costing?"
  • Threads are processes that share memory. Created with clone(), just like processes. /proc/[pid]/task/ lists them.
  • Orphans reparent to PID 1. This is why your container's entrypoint matters: whoever is PID 1 must reap, or zombies pile up.

Common Mistakes

  • Killing the wrong thing because you used PID when you needed PGID — kill $PID hits one process; kill -- -$PGID hits the whole group.
  • Reading top's VIRT column and panicking. A process can mmap gigabytes without using a single page of RAM. RES (RSS) is the number that matters for OOM.
  • Summing RSS across identical forked workers and double-counting shared pages. Use PSS (/proc/*/smaps Pss: field) when it matters.
  • Assuming kill -9 always works. D-state processes will not respond to any signal, including SIGKILL.
  • Running a container with a shell entrypoint that does not reap children. Either use a real init (tini, --init), exec the main process, or be prepared for zombie leaks.
  • Confusing ps -T (threads of a specific process) with ps -t (processes on a specific terminal) — same letter, very different behavior.
  • Debugging "slow processes" by reading app logs when ps -o stat would show half of them are in D state waiting on a stuck disk.
  • Forgetting that fork() returns twice — in the parent with the child's PID, in the child with 0. Bad error handling around fork is a common source of shell-script bugs that only trigger under load.

KNOWLEDGE CHECK

You run `ps auxf` on a production server and see a Python service with 200 processes named `python myapp` arranged in a flat list under `systemd`. Their PPID is all 1. The service is supposed to fork 8 workers from a single master. What has most likely happened?