Anatomy of a Process
A developer on your team runs
docker execinto a running container and sees the nginx worker they expected at PID 1 — except the "nginx master" has PID 12, and there are eight worker processes with PIDs 13 through 20. They ask: "Why is the master not PID 1? Which one do I kill if I want to restart nginx gracefully? Why doespsin the container only show processes from this container, butpson the host shows everything? And when Ikill -9the master, why do the workers all die too — but when Ikill -9a worker, nginx just quietly spawns a new one?"Every one of those questions has the same answer: there is a specific, well-defined structure to what a process is in Linux. Processes have parents, group leaders, session leaders, and a memory layout you can literally read byte-for-byte from
/proc. Once you see that structure clearly, "why did this happen?" becomes a predictable walk through the process tree, not a guessing game.
What a Process Actually Is
A process is the kernel's accounting unit for "a running program." It is a struct in the kernel (called task_struct) that holds:
- Identity: a PID, a parent PID, owner (UID/GID), session and process group memberships.
- Memory: a virtual address space — text (code), data, heap, stacks, memory-mapped files.
- File descriptors: a table of open files, sockets, pipes.
- Scheduling state: running, sleeping, stopped, zombie; CPU time accumulated; priority.
- Credentials and capabilities: what it is allowed to do.
- Namespace membership: which PID namespace, mount namespace, network namespace, etc. (we cover this in Module 5).
- Signal state: what signals are blocked, what handlers are installed, what is pending.
When you type ls /tmp, the shell calls fork() to make a copy of itself and then execve() to replace the copy's program with /bin/ls. The kernel allocates a new task_struct, gives it a PID, hooks it into its parent's tree, and schedules it to run. When ls finishes, the kernel marks the process a zombie (its exit code needs to be reaped by the parent) and eventually frees the task_struct when the shell calls wait().
That cycle — fork, exec, exit, reap — happens millions of times a second across a busy Linux system.
The Identity of a Process
Every process has several IDs, and they each serve a different purpose. Confusing them is the source of a lot of bad scripts.
| ID | What it is | Tool to see it |
|---|---|---|
| PID | Process ID — the unique number for this process | ps -o pid, $$ in shell |
| PPID | Parent PID — the PID that created this one | ps -o ppid |
| PGID | Process Group ID — usually the PID of the leader of a pipeline | ps -o pgid |
| SID | Session ID — usually the PID of the controlling shell / sshd | ps -o sid |
| TID | Thread ID — every thread has one; main thread's TID == the PID | ps -T, top -H |
| UID / GID | The user/group the process runs as (real and effective) | ps -o uid,euid |
# Your current shell
echo $$
# 12345
# See everything about it
ps -o pid,ppid,pgid,sid,tty,uid,comm -p $$
# PID PPID PGID SID TT UID COMMAND
# 12345 12340 12345 12340 pts/0 1000 bash
# Every process on the system in tree form
pstree -p | head -20
# Process tree for a specific PID
pstree -p 1
PID, PPID, PGID, and SID are four different things and they rarely line up. PID is "which process." PPID is "who made it." PGID is "which pipeline am I in" — ls | grep foo | less puts all three in the same process group, so hitting Ctrl-C delivers SIGINT to all of them at once. SID is "which login session" — it outlives your shell if you nohup something. When you script around signals or cleanup, know which ID you actually need.
Process groups and sessions, explained with one pipeline
# Run a pipeline and inspect its processes from another terminal
sleep 60 | grep xyz | wc -l &
jobs -l
# [1]+ 23456 Running sleep 60 | grep xyz | wc -l
ps -o pid,ppid,pgid,sid,comm -g 23456
# PID PPID PGID SID COMMAND
# 23456 12345 23456 12340 sleep
# 23457 12345 23456 12340 grep
# 23458 12345 23456 12340 wc
Three processes, same PGID (23456), same SID (12340). PGID is the PID of the process group leader (the first process in the pipeline). SID is the PID of the session leader (the login shell). Ctrl-C in the terminal sends SIGINT to the whole process group; logging out sends SIGHUP to the whole session.
The Lifecycle: Fork, Exec, Exit, Reap
Linux creates processes by copying an existing one and replacing its program. There is no "create from scratch" — every process descends from PID 1, which descends from the kernel itself.
The life of a child process
Click each step to explore
# Watch a fork/exec/exit happen under strace
strace -f -e trace=clone,execve,exit_group -o /tmp/trace.log bash -c 'ls /tmp > /dev/null'
cat /tmp/trace.log
# [pid 12345] clone(...) = 12346 <- shell forks
# [pid 12346] execve("/bin/ls", [...], ...) = 0 <- child becomes ls
# [pid 12346] exit_group(0) = ? <- ls exits cleanly
# [pid 12345] --- SIGCHLD {...} --- <- kernel notifies shell
This is the whole mechanism. Everything else — shells, systemd, Kubernetes, container runtimes — is a program that repeats this cycle in a loop.
fork() is implemented via the clone() syscall under the hood on Linux. clone() is more general: it lets the caller choose which resources to share (memory, file descriptors, namespaces) with the new task. Threads are created with clone() sharing memory and file tables; containers are created with clone() setting up new namespaces. Same syscall, different flags. When you see clone() in strace output, that is a process or thread being born.
Process States
A process is always in one of a handful of states. You can see the current state in /proc/[pid]/status as State: or in ps aux as the STAT column.
| Code | Name | Meaning |
|---|---|---|
R | Running / Runnable | On a CPU right now, or in the run queue waiting for a CPU |
S | Interruptible sleep | Blocked on something (usually I/O), but can be woken by a signal |
D | Uninterruptible sleep | Blocked in a syscall that cannot be interrupted (usually disk I/O) — cannot be killed |
T | Stopped | Suspended via SIGSTOP/SIGTSTP (like Ctrl-Z) |
t | Tracing stop | Stopped under a debugger |
Z | Zombie | Exited, waiting for parent to reap |
X | Dead | Literally about to disappear; you will rarely see this |
# Show every process and state
ps ax -o stat,pid,comm | head -20
# STAT PID COMMAND
# Ss 1 systemd <- S=sleep, s=session leader
# S< 2 kthreadd
# I< 3 rcu_gp
# Ss 423 dbus-daemon
# Sl 1203 containerd <- l=multithreaded
# Rs 5678 ps
# Count processes by state
ps ax -o stat --no-headers | awk '{print $1}' | sort | uniq -c
A process stuck in D state cannot be killed — not even by SIGKILL. It is waiting inside the kernel for a syscall to return (typically disk I/O that is stuck, NFS hanging, a dying block device). You will see it accumulate with ps but nothing you do at the process level will free it. The only fixes are: the I/O completes, the device driver times out, or the machine reboots. When you see load average climb into the hundreds with almost no CPU usage, it is almost always D-state processes piling up behind a stuck I/O.
The Virtual Address Space — What a Process Looks Like in Memory
Every process has its own virtual address space: the 2⁶⁴ bytes of addressable memory the CPU can reference from this process. The kernel, via page tables, maps slices of that virtual space to real physical RAM (or swap, or not-yet-allocated).
A typical userspace layout (simplified, x86-64):
High address
┌──────────────────────────┐ 0xffffffffffffffff
│ Kernel (not mapped for │
│ userspace; it is there │
│ but inaccessible) │
├──────────────────────────┤ 0x00007fffffffffff
│ Stack (grows down) │ <- main thread's stack
│ ... │
├──────────────────────────┤
│ Shared libraries, mmap │ <- libc.so.6, libssl.so, mmapped files
│ regions │
├──────────────────────────┤
│ Heap (grows up) │ <- malloc lives here (via brk)
├──────────────────────────┤
│ BSS (uninitialized data)│ <- zeroed globals
├──────────────────────────┤
│ Data (initialized data) │ <- global variables with initial values
├──────────────────────────┤
│ Text (executable code) │ <- the compiled program
└──────────────────────────┘ 0x0000000000400000
Low address
You can see this for any process in /proc/[pid]/maps:
# Look at a real process
cat /proc/self/maps | head -15
# 55a6b2c00000-55a6b2c24000 r--p ... /usr/bin/cat <- text (read-only, executable)
# 55a6b2c24000-55a6b2c4c000 r-xp ... /usr/bin/cat
# 55a6b2c4c000-55a6b2c5e000 r--p ... /usr/bin/cat
# 55a6b2c5e000-55a6b2c5f000 r--p ... /usr/bin/cat <- data
# 55a6b2c5f000-55a6b2c60000 rw-p ... /usr/bin/cat <- data, writable
# 55a6d1234000-55a6d1255000 rw-p ... [heap] <- the heap
# 7f... r--p ... /usr/lib/x86_64-linux-gnu/libc.so.6 <- libc
# 7f... r-xp ... /usr/lib/x86_64-linux-gnu/libc.so.6
# 7f... rw-p ... [stack] <- the main stack
# Total virtual size and resident size
cat /proc/self/status | grep -E 'Vm|Rss'
# VmPeak: 10240 kB
# VmSize: 10240 kB
# VmRSS: 2048 kB <- pages currently in physical RAM
# ...
The three numbers you will see over and over:
- VSS (Virtual Set Size): how much address space the process has mapped. Almost meaningless in isolation — a process can
mmapa 100 GB file without using a byte of RAM. - RSS (Resident Set Size): how much of that is currently in physical RAM. This is the number
topshows you and the one that counts for OOM decisions. - PSS (Proportional Set Size): RSS with shared pages divided by the number of processes sharing them. The most honest single number for "how much memory is this process actually using?"
# Memory used by all processes named nginx
ps -C nginx -o pid,comm,rss,vsz
# PID COMMAND RSS VSZ
# 1234 nginx 2048 10240
# 1235 nginx 1536 9216
# Proportional memory (smem is not always installed; this is a quick substitute)
for pid in $(pgrep nginx); do
awk '/Pss:/ {sum+=$2} END {print "PSS:", sum, "kB"}' /proc/$pid/smaps
done
We had a Python service that top claimed was using 18 GB on a 24 GB node, yet nothing was OOM-killed. Turned out it was a multi-process gunicorn setup with 8 workers — each forked from the parent via copy-on-write. top happily reported the 18 GB for each worker because that is the RSS for each one, but most of those pages were shared between workers. Actual physical memory in use was around 4 GB. Reading Pss: out of /proc/*/smaps told the real story. After that, we stopped alerting on RSS sums and started alerting on PSS.
Threads Are Processes That Share Memory
Linux has one unified abstraction: every schedulable thing is a task. A "process" in the classical sense is a task with its own memory space and file descriptor table. A "thread" is a task that shares those with another task.
This is why top -H shows threads as separate entries, why /proc/[pid]/task/ exists, and why a multi-threaded Java process can show up in ps as hundreds of entries under the hood.
# Every thread of PID 1234
ls /proc/1234/task/
# 1234 1235 1236 1240 ... (the main thread's TID == PID)
# Show threads, not just processes
ps -eLf | head -10
# UID PID PPID LWP C NLWP STIME TTY CMD
# root 1 0 1 0 1 Apr19 ? /lib/systemd/systemd
# root 1203 1 1203 0 18 Apr19 ? /usr/bin/containerd
# root 1203 1 1204 0 18 Apr19 ? /usr/bin/containerd
# root 1203 1 1205 0 18 Apr19 ? /usr/bin/containerd
# ^^^^ LWP = thread ID (TID)
# In htop, press 'H' to toggle thread view
All threads in a process share: memory, file descriptors, signal handlers, current working directory, UID/GID. They have their own: TID, stack, CPU registers, signal mask, scheduling state.
This is why:
- A memory leak in one thread leaks from the whole process.
- A thread crashing the process (segfault) takes all threads down with it.
kill $PIDsends a signal to the process, but exactly one thread will receive it (the kernel picks).pthread_kill(tid, signal)targets a specific thread.
Linux does not have separate "process" and "thread" syscalls. Both are created by clone() with different flags. A process is clone() with a new memory space; a thread is clone() sharing the parent's memory space. This is why Linux thread creation is fast — it is the same mechanism as process creation, just with different sharing flags — and why every threading library on Linux ultimately calls clone().
Parent-Child Relationships: Why It Matters Who Dies First
If a child exits while its parent is still alive, the parent is responsible for reaping it (wait() / waitpid()). If the parent exits first, the child is orphaned and gets re-parented to PID 1 (systemd) — which immediately reaps anything that exits as its child.
This is the mechanism that makes PID 1 a big deal. Every zombie in the system eventually becomes systemd's problem.
# See every process's parent
ps -eo pid,ppid,comm | head -20
# Find orphaned-then-reparented-to-PID-1 processes
ps -eo pid,ppid,comm | awk '$2 == 1 {print}'
# These were originally children of something else; the parent died
# When something is wrong, look for zombies
ps aux | grep -w 'Z'
Why containers care about PID 1
A container's PID 1 — the first process in the container's PID namespace — inherits the "reap all orphans" responsibility for its namespace. If your container's entrypoint is a shell script that execs something else, fine. If it runs a long-lived program as a child and does not reap its grandchildren, you will leak zombies inside the container until it dies. This is why the tini init binary exists and why Docker has a --init flag.
The Tools You Use Daily
# What is this process and its descendants?
ps --forest -g $(pgrep systemd | head -1)
# Every process, full info, scrollable
ps auxf | less
# Interactive — pick your columns with 'f', sort with '<'/'>'
top # classic
htop # much better, almost universally installed now
# Find processes by name
pgrep -lf nginx
# 1234 nginx: master process /usr/sbin/nginx
# 1235 nginx: worker process
# 1236 nginx: worker process
# Send a signal by name
pkill -TERM -f myapp
# Just the pid, for scripting
pgrep -f 'myapp --config prod'
# Count threads of a process
ls /proc/$PID/task | wc -l
# CPU usage of a specific process, updated every second
pidstat -p $PID 1
# Memory of a specific process over time
while true; do cat /proc/$PID/status | grep -E 'VmRSS|VmSize'; sleep 2; done
ps auxf (the f adds the ASCII tree view) is the single most useful invocation for "what is happening on this box?" It tells you at a glance: every process, its user, its parent chain, memory, CPU. When you are first on a machine, ps auxf | less is worth more than any monitoring dashboard — you see the real topology of the workload.
Key Concepts Summary
- A process is a
task_structin the kernel. Identity, memory, file descriptors, signal state, namespaces — all live in that struct. - PID, PPID, PGID, SID, UID are five different IDs. Use the right one: PID for a specific process, PGID for a pipeline, SID for a login session, UID for ownership.
- Processes are created by fork + exec. No "create from scratch" exists — every process descends from PID 1.
- States: R, S, D, T, Z, X.
Dis uninterruptible sleep and cannot be killed;Zis a zombie waiting for its parent to reap it. - Virtual address space is per-process. Text, data, heap, stack, mmaps, and shared libraries all live in there.
/proc/[pid]/mapsshows the layout. - RSS vs VSS vs PSS. RSS is in RAM now; VSS is mapped; PSS divides shared pages fairly. Use PSS for "what is this process actually costing?"
- Threads are processes that share memory. Created with
clone(), just like processes./proc/[pid]/task/lists them. - Orphans reparent to PID 1. This is why your container's entrypoint matters: whoever is PID 1 must reap, or zombies pile up.
Common Mistakes
- Killing the wrong thing because you used PID when you needed PGID —
kill $PIDhits one process;kill -- -$PGIDhits the whole group. - Reading
top'sVIRTcolumn and panicking. A process can mmap gigabytes without using a single page of RAM.RES(RSS) is the number that matters for OOM. - Summing RSS across identical forked workers and double-counting shared pages. Use PSS (
/proc/*/smapsPss:field) when it matters. - Assuming
kill -9always works.D-state processes will not respond to any signal, including SIGKILL. - Running a container with a shell entrypoint that does not reap children. Either use a real init (
tini,--init),execthe main process, or be prepared for zombie leaks. - Confusing
ps -T(threads of a specific process) withps -t(processes on a specific terminal) — same letter, very different behavior. - Debugging "slow processes" by reading app logs when
ps -o statwould show half of them are inDstate waiting on a stuck disk. - Forgetting that
fork()returns twice — in the parent with the child's PID, in the child with 0. Bad error handling around fork is a common source of shell-script bugs that only trigger under load.
You run `ps auxf` on a production server and see a Python service with 200 processes named `python myapp` arranged in a flat list under `systemd`. Their PPID is all 1. The service is supposed to fork 8 workers from a single master. What has most likely happened?