When Something Is Weird
A batch job that has run cleanly every night for two years suddenly hangs at exactly the same step each run. No error. No crash. No useful log line — it just sits there. The team's first instinct is to read the application logs, then the database logs, then the queue logs. Nothing. Three hours in, someone finally runs
cat /proc/$PID/syscalland sees the process has been insiderecvfromon a specific fd for 40 minutes.ls -l /proc/$PID/fd/FDreveals it is a socket to a server that was decommissioned last week but not removed from config. The process is patiently waiting for a reply that will never come.The hardest production problems are the ones with no stack trace, no error message, no clear failure — a process is "stuck" or "slow" or "using too much" of something, and logging does not help because the code is not in a path that logs anything. Linux gives you live X-ray vision into running processes through
/proc,strace,lsof,perf,dmesg, and a handful of newer eBPF tools. This lesson is the mindset and the toolkit: given a weird process, how do you peel back the layers without restarting, without adding logs, without waiting for a redeploy?
The Mindset: Observe, Don't Restart
The reflex "restart and see if it comes back" is the worst debugging habit in production. It destroys the evidence. Every time a process misbehaves, there is information in its state: which syscall it is in, which files it has open, which threads are doing what, what its memory looks like, what the kernel last said about it. Restarting throws that away and guarantees the same problem will happen again.
The rule: gather evidence first, restart only as a last resort. The tools below do not require changes to the running process, do not require you to install anything exotic, and on most systems do not even require stopping or attaching a debugger.
Every weird-process debug session follows the same script: (1) read /proc/$PID/status for state, (2) read /proc/$PID/wchan and /proc/$PID/syscall for what the kernel thinks it is doing, (3) read /proc/$PID/fd and ls it for what resources it has open, (4) trace with strace or perf for live behavior, (5) check dmesg for kernel complaints about it. That five-step sequence resolves 90% of "something is weird" incidents. The skill is knowing the sequence; the rest is interpretation.
The Toolkit
/proc/[pid]/* — live process X-ray (Module 2 Lesson 3 recap)
Always the first stop.
# Identity + state
grep -E '^(Name|State|Pid|PPid|Threads|VmRSS|SigQ|SigBlk|SigCgt):' /proc/$PID/status
# What syscall is it in right now?
cat /proc/$PID/syscall
# 0 0x3 0x7fff... 0x2000 0 0 0 0x7fff... 0x7ff...
# First number = syscall number. Look it up:
ausyscall $(awk '{print $1}' /proc/$PID/syscall)
# What kernel function did it block in?
cat /proc/$PID/wchan
# futex_wait_queue_me <- waiting on a mutex/condvar
# sk_wait_data <- waiting on socket data
# do_epoll_wait <- event loop idle
# io_schedule <- waiting on disk I/O
# Every file descriptor
ls -l /proc/$PID/fd | head
These four files — status, syscall, wchan, fd/ — tell you more about a weird process than any amount of code reading.
strace — see every syscall
# Attach to a running process and show every syscall
sudo strace -p $PID
# With timestamps and time-spent-in-each
sudo strace -p $PID -tt -T
# Summary: count and time per syscall
sudo strace -p $PID -c
# (hit Ctrl-C after a representative sample)
# % time seconds usecs/call calls errors syscall
# ------ ----------- ----------- --------- --------- ----------------
# 62.35 3.214567 1024 3139 recvfrom
# 18.90 0.975123 482 2023 sendto
# 12.20 0.628412 312 2013 1823 read
# ...
# Follow child processes too
sudo strace -f -p $PID
# Only specific syscalls
sudo strace -p $PID -e trace=openat,read,write
sudo strace -p $PID -e trace=network # all network syscalls
sudo strace -p $PID -e trace=%file # all file-related
strace slows the target process significantly (10–100×) because every syscall takes an extra context switch to userspace for tracing. In production, avoid leaving strace attached to a latency-critical service for more than a minute or two. For long-term observation use perf trace or eBPF (bpftrace -e 'tracepoint:syscalls:sys_enter_* { ... }') which use lower-overhead mechanisms.
lsof — every file, socket, and pipe
# Every open file for a process
sudo lsof -p $PID | head
# COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
# myapp 123 app cwd DIR 259,0 4096 12 /opt/myapp
# myapp 123 app rtd DIR 259,0 4096 2 /
# myapp 123 app txt REG 259,0 8912345 4567 /opt/myapp/bin/server
# myapp 123 app 0u CHR 1,3 0t0 6 /dev/null
# myapp 123 app 1w REG 259,0 0 54321 /var/log/myapp.log
# myapp 123 app 2w REG 259,0 0 54321 /var/log/myapp.log
# myapp 123 app 3u IPv4 98765 0t0 TCP *:8080 (LISTEN)
# myapp 123 app 4u IPv4 99987 0t0 TCP 10.0.1.5:50124->10.0.99.1:5432 (ESTABLISHED)
# Find all processes using a specific file
sudo lsof /var/log/myapp.log
# Which process is using port 8080?
sudo lsof -iTCP:8080 -sTCP:LISTEN
# Find all processes with a file open on a specific mount (useful when unmount fails)
sudo lsof +D /mnt/foo
# Deleted-but-still-open files (common cause of "disk full but df says space is free")
sudo lsof | grep '(deleted)'
perf — CPU profiling and syscall tracing
When the process is burning CPU but you do not know where.
# Snapshot — what are all CPUs doing right now?
sudo perf top
# Samples: 12K of event 'cycles', Event count (approx.): 3.2G
# Overhead Command Shared Object Symbol
# 15.4% myapp myapp [.] compute_hash
# 11.2% [kernel] [k] _raw_spin_lock
# 8.7% myapp libc-2.31.so [.] malloc
# ...
# Record a 30-second profile of one PID, including kernel stacks
sudo perf record -F 99 -p $PID -g -- sleep 30
# View as interactive TUI
sudo perf report
# Or convert to a flame graph
sudo perf script | \
stackcollapse-perf.pl | \
flamegraph.pl > myapp-flame.svg
# Trace syscalls system-wide (lower overhead than strace)
sudo perf trace -p $PID
Flame graphs are transformative — they turn "60 seconds of CPU samples" into a visual you can navigate in a browser, seeing exactly which function in which call stack is hot. Every engineer should be able to generate and read one.
dmesg — the kernel's complaints
# Everything the kernel has said this boot, with timestamps
dmesg -T | less
# Just recent
dmesg -T | tail -50
# Filter by level (err + above)
dmesg -l err,warn
# Watch live
dmesg -Tw
# Common things to grep for
dmesg -T | grep -iE 'oom|segfault|i/o error|link down|thermal|mce|hung_task'
When a process mysteriously died, or a node "went weird," dmesg is the first place evidence lives. It captures: OOM kills, segfaults with fault addresses, disk I/O errors, NIC link flaps, kernel panics, soft lockups, page allocation failures.
dmesg | grep -i hung_task shows any process that was stuck in uninterruptible sleep (D state) for longer than kernel.hung_task_timeout_secs (default 120). This is the kernel flagging that a process is stuck inside a syscall — usually disk I/O on a misbehaving device or network I/O on a dead connection. It is a signal you almost never see unless you grep for it, and it has saved many 3 AM debugging sessions.
Core dumps — postmortem of a crash
When a process crashes with SIGSEGV or SIGABRT, the kernel can write a core dump — a snapshot of its memory you can load in a debugger.
# Is core dumping enabled?
ulimit -c
# 0 means disabled
# Enable (for this shell)
ulimit -c unlimited
# Configure where cores go (system-wide)
cat /proc/sys/kernel/core_pattern
# core <- default: file named "core" in CWD
# Or with systemd-coredump:
# |/lib/systemd/systemd-coredump %P %u %g %s %t %c %h
# With systemd-coredump, list captured cores
coredumpctl list
# TIME PID UID GID SIG COREFILE EXE
# Fri 2026-04-19 10:00:12 12345 1000 1000 SIGSEGV present /opt/myapp/bin/server
# Inspect
coredumpctl info 12345
coredumpctl gdb 12345 # launch gdb on the core
In a debugger:
(gdb) bt # backtrace — show the stack at the moment of death
(gdb) info registers # CPU state
(gdb) p some_variable # inspect variables
eBPF tools — the new frontier
Modern Linux comes with eBPF-based tools (BCC, bpftrace) that let you trace syscalls, file opens, network events, kernel functions — all with minimal overhead. Many distros package them as bcc-tools or bpftrace.
# Count syscalls by pid system-wide for 30 seconds
sudo /usr/share/bcc/tools/syscount -p $PID 30
# File opens across the system
sudo /usr/share/bcc/tools/opensnoop
# TCP connect attempts
sudo /usr/share/bcc/tools/tcpconnect
# I/O latency
sudo /usr/share/bcc/tools/biolatency
# Every page fault
sudo bpftrace -e 'tracepoint:exceptions:page_fault_user { @[comm] = count(); }'
These tools have near-zero overhead compared to strace and can stay on in production safely.
Diagnosing Common "Weird" Symptoms
The process is "hung"
# Check state
grep '^State:' /proc/$PID/status
# State: S (sleeping)
# What is it blocked on?
cat /proc/$PID/wchan
# futex_wait_queue_me
# Which syscall?
cat /proc/$PID/syscall
ausyscall $(awk '{print $1}' /proc/$PID/syscall)
# futex
# Per-thread state (the multi-thread view — usually the one you need)
for tid in $(ls /proc/$PID/task); do
echo "=== $tid"
cat /proc/$PID/task/$tid/wchan
echo
done
A single thread with wchan = 0 while others wait on futex means that one thread is holding a lock. strace -f -p $PID will show you what the busy thread is doing in syscall terms; perf record -p $PID will give you a flame graph.
The process is "leaking memory"
# Over time, is RSS climbing?
while true; do
awk '/VmRSS/{print strftime("%T"), $2, "kB"}' /proc/$PID/status
sleep 5
done
# Real memory cost (PSS across threads)
awk '/^Pss:/ {sum+=$2} END {print sum, "kB"}' /proc/$PID/smaps
# Anonymous memory vs file-backed
grep -E '^(Anonymous|AnonHugePages|File):' /proc/$PID/status
# Allocation tracing — what is mallocing?
sudo /usr/share/bcc/tools/memleak -p $PID -a 60
The process is "using too much CPU"
# Per-thread CPU
pidstat -t -p $PID 1 10
# htop with threads shown (press H)
htop
# Sample stacks to see what it is doing
sudo perf record -F 99 -p $PID -g -- sleep 30
sudo perf report
# Or syscall counts
sudo strace -p $PID -c
The process is "using too many file descriptors"
# Current count
ls /proc/$PID/fd | wc -l
# Limit
grep 'Max open files' /proc/$PID/limits
# Top type of fd in use
ls -l /proc/$PID/fd | awk '{print $11}' | sort | uniq -c | sort -rn | head
If socket: dominates and grows, you are leaking sockets. If a specific file dominates, your code is opening the same file many times without closing.
The process is "being killed" and you do not know by whom
# OOM kill?
dmesg -T | grep -i 'killed process' | tail
# Signal kill? Attach and catch
sudo strace -p $PID -e signal=!!all
# Exit code analysis
wait $PID; echo $?
# 137 = SIGKILL; 143 = SIGTERM; 139 = SIGSEGV; <128 = clean exit
# systemd tracking
systemctl status $SERVICE
journalctl -u $SERVICE -n 50
# In Kubernetes
kubectl describe pod $POD
# Look at State, Last State, Exit Code
Performance got 10× worse "for no reason"
Start with dmesg -T | tail -100. Kernel-level events — disk flipping to a read-only mount, NIC dropping to 100 Mbps, CPU thermal throttling, a memory bank going bad — all show up here before they show up in application metrics. Most "inexplicable slowdowns" have a dmesg line pinned to the exact time they started.
A Real Debugging Session
Here is how the techniques stack in practice, using a constructed (but realistic) scenario:
Symptom: a Python service is pegged at one CPU core for 30 minutes.
- Check state:
grep State /proc/$PID/status # State: R (running) - Check threads — which one is running?
for tid in $(ls /proc/$PID/task); do echo "=== $tid $(cat /proc/$PID/task/$tid/wchan)" done # === 123 futex_wait_queue_me # === 124 futex_wait_queue_me # === 125 0 <- this one is running - What is TID 125 doing?
sudo strace -p 125 # Does not attach — expected if the thread is Python-level hot code - Sample with perf:
sudo perf record -F 99 -t 125 -g -- sleep 10 sudo perf report # 95% in Python interpreter loop, 80% in regex compile - Conclude: one thread is spinning in a regex compile — probably in a hot path that should cache compiled regexes. Fix the code, not the infrastructure.
Total time: under five minutes. No log lines, no restart, no production downtime — just the tools reading live state.
A team had a Python service that "randomly" consumed 100% CPU about once an hour. No pattern visible in logs, no correlation with traffic, no clue in dashboards. Thirty seconds of perf record -p $PID during an incident showed 90% of samples were in a cryptographic function calling os.urandom — which was itself blocking on /dev/urandom because the VM had a broken virtio-rng device and the kernel's entropy pool was struggling. The fix was one rngd installation and a configuration change on the host. Without perf, the team had considered every theory except the right one for weeks.
Things That Will Save You Hours
- Install
strace,perf,bcc-tools(orbpftrace),tcpdump,lsofin every production image. Yes, even minimal ones. The 20 MB you save is not worth the hour you spend shelling into a running container and fighting package installs during an incident. - Turn on
core_pattern=|/lib/systemd/systemd-coredumpandMaxUse=in /etc/systemd/coredump.conf. Cores are automatically collected and rotated;coredumpctllets you inspect any crash from any service. - Configure
kernel.hung_task_timeout_secsand alert on hung-task warnings. These are the quietest kernel-level signal of stuck processes — and they mean something is badly wrong. - Enable persistent journald (see Module 4 Lesson 3). You cannot debug what you can no longer read.
- Pre-write a triage script (see Module 6 Lesson 1). When an incident starts, run it first — the 60-second snapshot often tells the whole story.
- Learn to read flame graphs. One afternoon of practice with
perf record+ FlameGraph, and every future CPU mystery has a visual answer.
Key Concepts Summary
- Gather evidence before you restart. Every restart destroys debugging data.
/proc/[pid]/status,syscall,wchan,fd/form the four-file starter kit. 90% of "weird process" cases are solved by reading them.straceshows every syscall. Great for short investigations; too heavy for long production use.perf record+ flame graph for CPU profiling. The visualization changes how you think about hot code.lsoffor "what is this process holding open?" and "who has this file/port?"dmesgis the kernel's side of the story. Check it on every weird incident; it often pins the exact moment something broke.coredumpctlcaptures crashes postmortem. Enable it; it costs nothing when nothing crashes.- eBPF tools (bcc, bpftrace) are the low-overhead replacement for strace. Learn one — you will reach for it constantly once you know it.
- Per-thread state matters.
/proc/[pid]/task/*/wchandecomposes a multi-threaded process into which thread is doing what. - The exit code tells the tale. 137 = SIGKILL (likely OOM), 143 = SIGTERM, 139 = SIGSEGV. Not always obvious, but always meaningful.
Common Mistakes
- Restarting before gathering any state. The next incident happens for the same reason and you will still not know why.
- Reading application logs in a loop when the process is stuck — the app is not in a logging path, so the logs cannot help.
straceor/proc/[pid]/syscallwill. - Running
straceon a production service for hours. It slows the process materially; leave it attached only briefly. - Ignoring dmesg because "kernel stuff does not apply to us." Almost every infrastructure problem has a dmesg line.
- Assuming a process that "stopped doing work" has crashed. Check
State:— it is probably inDorSstate blocked on something concrete. - Running
perf recordwithout-gand then wondering why the profile shows just function names without context.-ggets you stack traces — always use it. - Treating core dumps as scary. They are snapshots of memory;
gdbopens them like a session on the dead process. A 5-minute investment in reading abtis worth hours of log-chasing. - Forgetting that thread TIDs are the PIDs of individual threads, readable in /proc/$PID/task/. Multi-threaded code's bugs usually live at the thread level.
- Trying to use
strace -pon a pid that isCLONE_PTRACE-protected (hardened kernels). Fallback:sudo sysctl kernel.yama.ptrace_scope— set to 0 temporarily during a debug session if you accept the risk.
A Python service hangs. State in /proc/$PID/status is `D (uninterruptible sleep)`. wchan shows `io_schedule`. Its syscall is `read` from fd 7, which ls -l /proc/$PID/fd shows is a socket to a remote NFS server. You have confirmed the NFS server is unreachable (the network link went down 10 minutes ago). Why cannot you kill -9 the process, and what is the only way to recover it?