Linux Fundamentals for Engineers

When Something Is Weird

A batch job that has run cleanly every night for two years suddenly hangs at exactly the same step each run. No error. No crash. No useful log line, it just sits there. The team's first instinct is to read the application logs, then the database logs, then the queue logs. Nothing. Three hours in, someone finally runs cat /proc/$PID/syscall and sees the process has been inside recvfrom on a specific fd for 40 minutes. ls -l /proc/$PID/fd/FD reveals it is a socket to a server that was decommissioned last week but not removed from config. The process is patiently waiting for a reply that will never come.

The hardest production problems are the ones with no stack trace, no error message, no clear failure, a process is "stuck" or "slow" or "using too much" of something, and logging does not help because the code is not in a path that logs anything. Linux gives you live X-ray vision into running processes through /proc, strace, lsof, perf, dmesg, and a handful of newer eBPF tools. This lesson is the mindset and the toolkit: given a weird process, how do you peel back the layers without restarting, without adding logs, without waiting for a redeploy?

The Mindset: Observe, Don't Restart

The reflex "restart and see if it comes back" is the worst debugging habit in production. It destroys the evidence. Every time a process misbehaves, there is information in its state: which syscall it is in, which files it has open, which threads are doing what, what its memory looks like, what the kernel last said about it. Restarting throws that away and guarantees the same problem will happen again.

The rule: gather evidence first, restart only as a last resort. The tools below do not require changes to the running process, do not require you to install anything exotic, and on most systems do not even require stopping or attaching a debugger.

KEY CONCEPT

Every weird-process debug session follows the same script: (1) read /proc/$PID/status for state, (2) read /proc/$PID/wchan and /proc/$PID/syscall for what the kernel thinks it is doing, (3) read /proc/$PID/fd and ls it for what resources it has open, (4) trace with strace or perf for live behavior, (5) check dmesg for kernel complaints about it. That five-step sequence resolves 90% of "something is weird" incidents. The skill is knowing the sequence; the rest is interpretation.

The Toolkit

`/proc/[pid]/*`: live process X-ray (Module 2 Lesson 3 recap)

Always the first stop.

# Identity + state
grep -E '^(Name|State|Pid|PPid|Threads|VmRSS|SigQ|SigBlk|SigCgt):' /proc/$PID/status

# What syscall is it in right now?
cat /proc/$PID/syscall
# 0 0x3 0x7fff... 0x2000 0 0 0 0x7fff... 0x7ff...
# First number = syscall number. Look it up:
ausyscall $(awk '{print $1}' /proc/$PID/syscall)

# What kernel function did it block in?
cat /proc/$PID/wchan
# futex_wait_queue_me        <- waiting on a mutex/condvar
# sk_wait_data               <- waiting on socket data
# do_epoll_wait              <- event loop idle
# io_schedule                <- waiting on disk I/O

# Every file descriptor
ls -l /proc/$PID/fd | head

These four files, status, syscall, wchan, fd/, tell you more about a weird process than any amount of code reading.

`strace`: see every syscall

# Attach to a running process and show every syscall
sudo strace -p $PID

# With timestamps and time-spent-in-each
sudo strace -p $PID -tt -T

# Summary: count and time per syscall
sudo strace -p $PID -c
# (hit Ctrl-C after a representative sample)
# % time     seconds  usecs/call     calls    errors syscall
# ------ ----------- ----------- --------- --------- ----------------
#  62.35    3.214567        1024      3139           recvfrom
#  18.90    0.975123         482      2023           sendto
#  12.20    0.628412         312      2013      1823 read
#   ...

# Follow child processes too
sudo strace -f -p $PID

# Only specific syscalls
sudo strace -p $PID -e trace=openat,read,write
sudo strace -p $PID -e trace=network        # all network syscalls
sudo strace -p $PID -e trace=%file          # all file-related

WARNING

strace slows the target process significantly (10-100×) because every syscall takes an extra context switch to userspace for tracing. In production, avoid leaving strace attached to a latency-critical service for more than a minute or two. For long-term observation use perf trace or eBPF (bpftrace -e 'tracepoint:syscalls:sys_enter_* { ... }') which use lower-overhead mechanisms.

`lsof`: every file, socket, and pipe

# Every open file for a process
sudo lsof -p $PID | head
# COMMAND  PID  USER   FD   TYPE   DEVICE  SIZE/OFF   NODE NAME
# myapp    123  app    cwd  DIR    259,0   4096       12 /opt/myapp
# myapp    123  app    rtd  DIR    259,0   4096       2 /
# myapp    123  app    txt  REG    259,0   8912345  4567 /opt/myapp/bin/server
# myapp    123  app    0u   CHR    1,3     0t0       6 /dev/null
# myapp    123  app    1w   REG    259,0   0       54321 /var/log/myapp.log
# myapp    123  app    2w   REG    259,0   0       54321 /var/log/myapp.log
# myapp    123  app    3u   IPv4   98765   0t0     TCP *:8080 (LISTEN)
# myapp    123  app    4u   IPv4   99987   0t0     TCP 10.0.1.5:50124->10.0.99.1:5432 (ESTABLISHED)

# Find all processes using a specific file
sudo lsof /var/log/myapp.log

# Which process is using port 8080?
sudo lsof -iTCP:8080 -sTCP:LISTEN

# Find all processes with a file open on a specific mount (useful when unmount fails)
sudo lsof +D /mnt/foo

# Deleted-but-still-open files (common cause of "disk full but df says space is free")
sudo lsof | grep '(deleted)'

`perf`: CPU profiling and syscall tracing

When the process is burning CPU but you do not know where.

# Snapshot — what are all CPUs doing right now?
sudo perf top
# Samples: 12K of event 'cycles', Event count (approx.): 3.2G
# Overhead  Command          Shared Object       Symbol
#   15.4%   myapp            myapp               [.] compute_hash
#   11.2%   [kernel]         [k] _raw_spin_lock
#    8.7%   myapp            libc-2.31.so        [.] malloc
# ...

# Record a 30-second profile of one PID, including kernel stacks
sudo perf record -F 99 -p $PID -g -- sleep 30

# View as interactive TUI
sudo perf report

# Or convert to a flame graph
sudo perf script | \
  stackcollapse-perf.pl | \
  flamegraph.pl > myapp-flame.svg

# Trace syscalls system-wide (lower overhead than strace)
sudo perf trace -p $PID

Flame graphs are transformative, they turn "60 seconds of CPU samples" into a visual you can navigate in a browser, seeing exactly which function in which call stack is hot. Every engineer should be able to generate and read one.

`dmesg`: the kernel's complaints

# Everything the kernel has said this boot, with timestamps
dmesg -T | less

# Just recent
dmesg -T | tail -50

# Filter by level (err + above)
dmesg -l err,warn

# Watch live
dmesg -Tw

# Common things to grep for
dmesg -T | grep -iE 'oom|segfault|i/o error|link down|thermal|mce|hung_task'

When a process mysteriously died, or a node "went weird," dmesg is the first place evidence lives. It captures: OOM kills, segfaults with fault addresses, disk I/O errors, NIC link flaps, kernel panics, soft lockups, page allocation failures.

PRO TIP

dmesg | grep -i hung_task shows any process that was stuck in uninterruptible sleep (D state) for longer than kernel.hung_task_timeout_secs (default 120). This is the kernel flagging that a process is stuck inside a syscall, usually disk I/O on a misbehaving device or network I/O on a dead connection. It is a signal you almost never see unless you grep for it, and it has saved many 3 AM debugging sessions.

Core dumps: postmortem of a crash

When a process crashes with SIGSEGV or SIGABRT, the kernel can write a core dump, a snapshot of its memory you can load in a debugger.

# Is core dumping enabled?
ulimit -c
# 0 means disabled

# Enable (for this shell)
ulimit -c unlimited

# Configure where cores go (system-wide)
cat /proc/sys/kernel/core_pattern
# core               <- default: file named "core" in CWD
# Or with systemd-coredump:
# |/lib/systemd/systemd-coredump %P %u %g %s %t %c %h

# With systemd-coredump, list captured cores
coredumpctl list
# TIME                      PID   UID  GID  SIG  COREFILE EXE
# Fri 2026-04-19 10:00:12   12345 1000 1000 SIGSEGV present  /opt/myapp/bin/server

# Inspect
coredumpctl info 12345
coredumpctl gdb 12345     # launch gdb on the core

In a debugger:

(gdb) bt                  # backtrace — show the stack at the moment of death
(gdb) info registers      # CPU state
(gdb) p some_variable     # inspect variables

eBPF tools: the new frontier

Modern Linux comes with eBPF-based tools (BCC, bpftrace) that let you trace syscalls, file opens, network events, kernel functions, all with minimal overhead. Many distros package them as bcc-tools or bpftrace.

# Count syscalls by pid system-wide for 30 seconds
sudo /usr/share/bcc/tools/syscount -p $PID 30

# File opens across the system
sudo /usr/share/bcc/tools/opensnoop

# TCP connect attempts
sudo /usr/share/bcc/tools/tcpconnect

# I/O latency
sudo /usr/share/bcc/tools/biolatency

# Every page fault
sudo bpftrace -e 'tracepoint:exceptions:page_fault_user { @[comm] = count(); }'

These tools have near-zero overhead compared to strace and can stay on in production safely.

Diagnosing Common "Weird" Symptoms

The process is "hung"

# Check state
grep '^State:' /proc/$PID/status
# State:  S (sleeping)

# What is it blocked on?
cat /proc/$PID/wchan
# futex_wait_queue_me

# Which syscall?
cat /proc/$PID/syscall
ausyscall $(awk '{print $1}' /proc/$PID/syscall)
# futex

# Per-thread state (the multi-thread view — usually the one you need)
for tid in $(ls /proc/$PID/task); do
  echo "=== $tid"
  cat /proc/$PID/task/$tid/wchan
  echo
done

A single thread with wchan = 0 while others wait on futex means that one thread is holding a lock. strace -f -p $PID will show you what the busy thread is doing in syscall terms; perf record -p $PID will give you a flame graph.

The process is "leaking memory"

# Over time, is RSS climbing?
while true; do
  awk '/VmRSS/{print strftime("%T"), $2, "kB"}' /proc/$PID/status
  sleep 5
done

# Real memory cost (PSS across threads)
awk '/^Pss:/ {sum+=$2} END {print sum, "kB"}' /proc/$PID/smaps

# Anonymous memory vs file-backed
grep -E '^(Anonymous|AnonHugePages|File):' /proc/$PID/status

# Allocation tracing — what is mallocing?
sudo /usr/share/bcc/tools/memleak -p $PID -a 60

The process is "using too much CPU"

# Per-thread CPU
pidstat -t -p $PID 1 10

# htop with threads shown (press H)
htop

# Sample stacks to see what it is doing
sudo perf record -F 99 -p $PID -g -- sleep 30
sudo perf report

# Or syscall counts
sudo strace -p $PID -c

The process is "using too many file descriptors"

# Current count
ls /proc/$PID/fd | wc -l

# Limit
grep 'Max open files' /proc/$PID/limits

# Top type of fd in use
ls -l /proc/$PID/fd | awk '{print $11}' | sort | uniq -c | sort -rn | head

If socket: dominates and grows, you are leaking sockets. If a specific file dominates, your code is opening the same file many times without closing.

The process is "being killed" and you do not know by whom

# OOM kill?
dmesg -T | grep -i 'killed process' | tail

# Signal kill? Attach and catch
sudo strace -p $PID -e signal=!!all

# Exit code analysis
wait $PID; echo $?
# 137 = SIGKILL; 143 = SIGTERM; 139 = SIGSEGV; <128 = clean exit

# systemd tracking
systemctl status $SERVICE
journalctl -u $SERVICE -n 50

# In Kubernetes
kubectl describe pod $POD
# Look at State, Last State, Exit Code

Performance got 10× worse "for no reason"

Start with dmesg -T | tail -100. Kernel-level events: disk flipping to a read-only mount, NIC dropping to 100 Mbps, CPU thermal throttling, a memory bank going bad, all show up here before they show up in application metrics. Most "inexplicable slowdowns" have a dmesg line pinned to the exact time they started.

A Real Debugging Session

Here is how the techniques stack in practice, using a constructed (but realistic) scenario:

Symptom: a Python service is pegged at one CPU core for 30 minutes.

Check state:

grep State /proc/$PID/status
# State: R (running)

Check threads: which one is running?

for tid in $(ls /proc/$PID/task); do
  echo "=== $tid $(cat /proc/$PID/task/$tid/wchan)"
done
# === 123 futex_wait_queue_me
# === 124 futex_wait_queue_me
# === 125 0                       <- this one is running

What is TID 125 doing?

sudo strace -p 125
# Does not attach — expected if the thread is Python-level hot code

Sample with perf:

sudo perf record -F 99 -t 125 -g -- sleep 10
sudo perf report
# 95% in Python interpreter loop, 80% in regex compile

Conclude: one thread is spinning in a regex compile, probably in a hot path that should cache compiled regexes. Fix the code, not the infrastructure.

Total time: under five minutes. No log lines, no restart, no production downtime, just the tools reading live state.

WAR STORY

A team had a Python service that "randomly" consumed 100% CPU about once an hour. No pattern visible in logs, no correlation with traffic, no clue in dashboards. Thirty seconds of perf record -p $PID during an incident showed 90% of samples were in a cryptographic function calling os.urandom, which was itself blocking on /dev/urandom because the VM had a broken virtio-rng device and the kernel's entropy pool was struggling. The fix was one rngd installation and a configuration change on the host. Without perf, the team had considered every theory except the right one for weeks.

Things That Will Save You Hours

Install strace, perf, bcc-tools (or bpftrace), tcpdump, lsof in every production image. Yes, even minimal ones. The 20 MB you save is not worth the hour you spend shelling into a running container and fighting package installs during an incident.
Turn on core_pattern=|/lib/systemd/systemd-coredump and MaxUse= in /etc/systemd/coredump.conf. Cores are automatically collected and rotated; coredumpctl lets you inspect any crash from any service.
Configure kernel.hung_task_timeout_secs and alert on hung-task warnings. These are the quietest kernel-level signal of stuck processes, and they mean something is badly wrong.
Enable persistent journald (see Module 4 Lesson 3). You cannot debug what you can no longer read.
Pre-write a triage script (see Module 6 Lesson 1). When an incident starts, run it first, the 60-second snapshot often tells the whole story.
Learn to read flame graphs. One afternoon of practice with perf record + FlameGraph, and every future CPU mystery has a visual answer.

Key Concepts Summary

Gather evidence before you restart. Every restart destroys debugging data.
/proc/[pid]/status, syscall, wchan, fd/ form the four-file starter kit. 90% of "weird process" cases are solved by reading them.
strace shows every syscall. Great for short investigations; too heavy for long production use.
perf record + flame graph for CPU profiling. The visualization changes how you think about hot code.
lsof for "what is this process holding open?" and "who has this file/port?"
dmesg is the kernel's side of the story. Check it on every weird incident; it often pins the exact moment something broke.
coredumpctl captures crashes postmortem. Enable it; it costs nothing when nothing crashes.
eBPF tools (bcc, bpftrace) are the low-overhead replacement for strace. Learn one, you will reach for it constantly once you know it.
Per-thread state matters. /proc/[pid]/task/*/wchan decomposes a multi-threaded process into which thread is doing what.
The exit code tells the tale. 137 = SIGKILL (likely OOM), 143 = SIGTERM, 139 = SIGSEGV. Not always obvious, but always meaningful.

Common Mistakes

Restarting before gathering any state. The next incident happens for the same reason and you will still not know why.
Reading application logs in a loop when the process is stuck, the app is not in a logging path, so the logs cannot help. strace or /proc/[pid]/syscall will.
Running strace on a production service for hours. It slows the process materially; leave it attached only briefly.
Ignoring dmesg because "kernel stuff does not apply to us." Almost every infrastructure problem has a dmesg line.
Assuming a process that "stopped doing work" has crashed. Check State:, it is probably in D or S state blocked on something concrete.
Running perf record without -g and then wondering why the profile shows just function names without context. -g gets you stack traces, always use it.
Treating core dumps as scary. They are snapshots of memory; gdb opens them like a session on the dead process. A 5-minute investment in reading a bt is worth hours of log-chasing.
Forgetting that thread TIDs are the PIDs of individual threads, readable in /proc/$PID/task/. Multi-threaded code's bugs usually live at the thread level.
Trying to use strace -p on a pid that is CLONE_PTRACE-protected (hardened kernels). Fallback: sudo sysctl kernel.yama.ptrace_scope, set to 0 temporarily during a debug session if you accept the risk.

KNOWLEDGE CHECK

A Python service hangs. State in /proc/$PID/status is `D (uninterruptible sleep)`. wchan shows `io_schedule`. Its syscall is `read` from fd 7, which ls -l /proc/$PID/fd shows is a socket to a remote NFS server. You have confirmed the NFS server is unreachable (the network link went down 10 minutes ago). Why cannot you kill -9 the process, and what is the only way to recover it?

Network Debugging

Course Complete

←→ navigateM toggle sidebar