Performance Triage (The USE Method)
A production API's p99 latency just doubled. Ten engineers are in a war-room chat. Someone shares a CPU graph — "CPU looks fine." Someone else shares a memory graph — "memory looks fine." The APM says some requests are slow, others are fast. Nobody knows where to look next. An hour of staring at dashboards produces no hypothesis. Eventually a senior engineer ssh's into one affected node and runs four commands —
uptime,vmstat 1 5,iostat -x 1 3,sar -n DEV 1 3— and says "the disk is saturated, see that 12ms await and 100% util on nvme1n1? That is where to look." Two minutes of terminal work outperformed an hour of dashboards.Linux gives you better performance diagnostics than any monitoring tool — if you know which command to run for which resource. Brendan Gregg's USE method is the checklist: for every resource, measure Utilization, Saturation, and Errors. Work through the list in under five minutes and you know where the bottleneck is. This lesson is the USE method as a production triage recipe, with the specific commands for each resource on Linux.
What USE Is
USE: Utilization, Saturation, Errors. For every resource in the system, check these three:
- Utilization — what percentage of time the resource was busy doing work.
- Saturation — how much extra demand is queued up for the resource (work waiting to be done).
- Errors — how many errors the resource has reported.
The insight: a resource is a bottleneck when its utilization is high and there is saturation. Utilization alone is not enough — a CPU at 100% but no queue is just doing its job. A CPU at 50% with a queue of waiting threads is a dispatcher problem. Saturation is the signal that work is waiting.
The method is exhaustive by design: you list every resource, tick each box, and move on. It stops you chasing the wrong thing and hands you the shortest path to the right one.
USE is not a tool — it is a checklist. Apply it to every resource your system has: CPUs, memory, disks, network interfaces, interconnects. The engineer who memorizes "utilization and saturation for every resource" and knows the Linux command for each has a 5-minute diagnostic that beats any dashboard on unfamiliar systems.
The Resources on a Linux Box
For a typical production server, the resources you check are:
| Resource | Utilization | Saturation | Errors |
|---|---|---|---|
| CPU | %CPU in top, vmstat us/sy/wa/id | run queue length (vmstat r), load avg | perf stat — rare |
| Memory | used/total in free | swap used, vmstat si/so, PSI memory | OOM kills in dmesg |
| Disks | %util per device (with caveats) | await, aqu-sz | read/write errors in dmesg, SMART |
| Network | %rxbw / %txbw vs NIC max | backlog queues, retransmits | netstat -s, interface errors |
| File descriptors | /proc/sys/fs/file-nr | n/a (FDs do not queue) | EMFILE errors |
| Kernel socket buffers | depends | netstat -s overruns | Tcp: *drops* counters |
Everything below is the specific command per cell.
The 60-Second Triage
Brendan Gregg's own recommendation for the first minute on a slow Linux box:
uptime
dmesg -T | tail
vmstat -SM 1
mpstat -P ALL 1
pidstat 1
iostat -xz 1
free -m
sar -n DEV 1
sar -n TCP,ETCP 1
top
Let us walk through the ones that matter most for USE.
CPU
Utilization
# Overall — user/system/iowait/idle per CPU, updated every second
mpstat -P ALL 1 3
# 10:00:01 CPU %usr %nice %sys %iowait %irq %soft %idle
# 10:00:02 all 25.5 0.0 3.5 12.0 0.0 0.5 58.5
# 10:00:02 0 40.0 0.0 5.0 20.0 0.0 1.0 34.0
# 10:00:02 1 10.0 0.0 2.0 5.0 0.0 0.0 83.0
# ...
# One CPU at 90%+ while others idle = single-threaded bottleneck
# Quicker, classic
top # then press '1' to spread out per-CPU
htop # prettier, same info; press 'F2' to customize columns
# Per-process
pidstat -u 1 3
# 10:00:01 UID PID %usr %system %guest %CPU CPU Command
# 10:00:02 1000 1234 25.0 3.0 0.0 28.0 2 myapp
Saturation
CPU saturation = threads waiting to run. Two ways to measure:
# The run queue length from vmstat — column 'r'
vmstat -SM 1 5
# procs ---memory--- ---swap-- -----io---- --system-- ----cpu----
# r b swpd free si so bi bo in cs us sy id wa st
# 4 0 0 4096 0 0 3 12 100 230 20 5 70 5 0
# ^ runnable threads
# r > number of CPUs consistently = CPU saturation
# Pressure Stall Information — the modern answer
cat /proc/pressure/cpu
# some avg10=12.34 avg60=8.90 avg300=5.67 total=123456789
# CPU PSI measures the % of time *something* was stalled waiting on CPU
# "some avg10=12.34" = 12% of the last 10s had threads waiting for CPU
PSI (Pressure Stall Information) is the single best saturation signal Linux exposes — it is per-resource, normalized, and available for CPU, memory, and I/O. Monitor it.
Errors
CPU errors are rare but serious:
# Hardware errors
dmesg -T | grep -i 'mce\|machine check\|cpu thermal'
# MCE daemon logs (if installed)
sudo cat /var/log/mcelog 2>/dev/null
If your CPU is reporting MCEs, replace the hardware.
Memory
Utilization
# The easy one
free -h
# total used free shared buff/cache available
# Mem: 32Gi 14Gi 2Gi 1Gi 16Gi 17Gi
# Swap: 4.0Gi 0B 4.0Gi
# "used" excludes reclaimable cache. "available" is the number that matters.
cat /proc/meminfo | head
# MemTotal: 32893400 kB
# MemFree: 1823440 kB
# MemAvailable: 18435212 kB <- this is what you monitor
# Buffers: 120456 kB
# Cached: 14280432 kB
# ...
Saturation
Memory saturation means the system is straining to free memory — swapping, reclaiming, or OOM-killing.
# Are we swapping? si/so in vmstat (in KiB/s)
vmstat -SM 1 5
# si so
# 0 0 <- no swap activity = good
# 12 48 <- swapping in and out = bad
# Per-process swap usage — more detail
for pid in $(ps -eo pid --no-headers); do
swap=$(awk '/^Swap:/{sum+=$2} END {print sum}' /proc/$pid/status 2>/dev/null)
[ -n "$swap" ] && [ "$swap" -gt 0 ] && echo "$swap kB $pid $(ps -o comm= -p $pid)"
done | sort -n | tail
# Memory PSI
cat /proc/pressure/memory
# some avg10=0.12 avg60=0.34 avg300=0.56 total=12345
# full avg10=0.00 avg60=0.01 avg300=0.03 total=123
# "full" > 0 = every process was stalled; very bad sign
Errors
# OOM-killer has been busy?
dmesg -T | grep -i 'killed process\|oom-killer\|out of memory' | tail
# [Fri Apr 19 09:00:12 2026] Out of memory: Kill process 12345 (myapp) score 890 or ...
# [Fri Apr 19 09:00:12 2026] Killed process 12345 (myapp), UID 1000, total-vm:16000000kB, anon-rss:8000000kB
# Per-cgroup OOM count
grep -H '^oom_kill' /sys/fs/cgroup/*/memory.events 2>/dev/null
Disks
Utilization
# The production workhorse
iostat -xz 1 3
# Device r/s w/s rkB/s wkB/s ... r_await w_await aqu-sz %util
# nvme0n1 241.0 38.0 12032 4864 0.28 0.75 0.09 16.4
# nvme1n1 1203.0 421.0 90240 16384 4.21 12.34 5.80 94.7
# ^^^^^^^^^^^^^^^^^ ^^^^^ ^^^^
# saturation queue util
# Watch %util and await:
# - %util high + await high = saturation
# - %util high + await low = busy but fine
# - %util high on NVMe: misleading (see Module 3 Lesson 3)
Saturation
aqu-sz (average queue size, renamed from avgqu-sz) is the direct saturation metric for disks. await (average latency) combines service time + queue time — rising await with low r/s + w/s is queuing.
# I/O Pressure Stall Info
cat /proc/pressure/io
# some avg10=8.23 avg60=4.56 avg300=2.34 total=1234567
# full avg10=1.23 avg60=0.56 avg300=0.23 total=12345
# Per-process I/O stats
pidstat -d 1 3
# 10:00:01 UID PID kB_rd/s kB_wr/s kB_ccwr/s iodelay Command
# 10:00:02 1000 1234 11200.0 8300.0 0.00 12 postgres
sudo iotop -oPa # interactive
Errors
# I/O errors on block devices
dmesg -T | grep -Ei 'i/o error|blk_update_request|ata.*error|nvme.*error'
# SMART status — predicts drive failure
sudo smartctl -H /dev/nvme0 # SMART overall-health self-assessment test result: PASSED
sudo smartctl -a /dev/nvme0 | head -20
Network
Utilization
# Per-interface bandwidth
sar -n DEV 1 5
# Time IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s %ifutil
# 10:00 eth0 12000 8500 12500 8200 0 0 1 6.40
# 10:00 lo 400 400 120 120 0 0 0 0.00
# Also works: nload, iftop, bmon for interactive
%ifutil is Linux computing utilization relative to the link's declared speed (which ethtool eth0 reports). For high-speed NICs or virtualized interfaces, the declared speed may be wrong — double-check with throughput benchmarks.
Saturation
Network saturation shows as backlog queues, retransmits, and dropped packets:
# TCP-level saturation
sar -n TCP,ETCP 1 5
# TCP active/s passive/s iseg/s oseg/s
# 1.0 0.5 120 150
# ETCP atmptf/s estres/s retrans/s isegerr/s orsts/s
# 0.0 0.0 5.0 0.0 0.5
# ^^^ retransmits = network or receiver problem
# Detailed counters
netstat -s | head -30
# Tcp:
# 1234 active connections openings
# 5678 passive connection openings
# 12 failed connection attempts
# 3 connection resets received <- resets = trouble
# 150 connections established
# 42000 segments received
# 38500 segments send out
# 123 segments retransmitted <- retransmits
# 0 bad segments received
# Dropped packets at the NIC
ip -s link show eth0 | head
# 2: eth0: <BROADCAST,...> mtu 1500 ...
# link/ether ... brd ff:ff:ff:ff:ff:ff
# RX: bytes packets errors dropped overrun mcast
# 32409283... 450000 0 12 0 1
# ^^^ dropped packets = saturation or wrong size
Errors
# Interface errors
ip -s link show eth0
ethtool -S eth0 | grep -i error
# rx_errors: 0
# tx_errors: 0
# rx_length_errors: 0
# rx_crc_errors: 0
# ...
# Connection-level reset rate
ss -s
# Total: 200
# TCP: 128 (estab 100, closed 10, orphaned 3, timewait 15)
# ...
# Ring buffer / NIC saturation
ethtool -S eth0 | grep -E 'discard|drop|missed'
File Descriptors
Utilization and errors
# System-wide
cat /proc/sys/fs/file-nr
# 12304 0 1048576
# ^ ^ ^
# allocated unused max
# If allocated approaches max, the system will refuse opens
# Per-process
ls /proc/$PID/fd | wc -l
grep 'Max open files' /proc/$PID/limits
# Max open files 1024 4096
# ^soft ^hard
# If count approaches soft, process will fail opens with EMFILE
Errors show up as EMFILE: Too many open files in application logs.
The Lesson-Length Triage Script
Here is a single shell script that runs the USE highlights. Save it, test it, commit it to your runbooks.
#!/bin/bash
# /usr/local/bin/quick-triage
set -eu
echo "=== Load, PSI ==="
uptime
echo
echo -n "cpu pressure: "; cat /proc/pressure/cpu | head -1
echo -n "mem pressure: "; cat /proc/pressure/memory | head -1
echo -n "io pressure: "; cat /proc/pressure/io | head -1
echo
echo "=== CPU (per-cpu, one sample) ==="
mpstat -P ALL 1 1 | tail -n +4
echo
echo "=== Memory ==="
free -h
echo
awk '/^MemAvailable:/ {print "MemAvailable: " $2 " " $3}' /proc/meminfo
dmesg -T | tail -50 | grep -i 'oom\|killed process' | tail -5 || echo "No recent OOMs in dmesg"
echo
echo "=== Disks ==="
iostat -xz 1 2 | awk '/^Device/,/^$/'
echo
echo "=== Network ==="
sar -n DEV 1 1 | grep -v 'Average\|IFACE\|^$' | head
echo
sar -n ETCP 1 1 | grep -v 'Average\|^$' | head
echo
echo "=== Top 5 CPU hogs ==="
ps -eo pid,pcpu,comm --sort=-pcpu | head -6
echo
echo "=== Top 5 RSS consumers ==="
ps -eo pid,rss,comm --sort=-rss | head -6
echo
echo "=== File descriptors (system) ==="
cat /proc/sys/fs/file-nr
Running this gives you every USE signal in one screen.
Put your version of this script on every production server and in every image. When the on-call starts, the first command is always quick-triage. Sixty seconds later you have Utilization, Saturation, and Errors for every resource. No dashboards, no delay, no ambiguity about what you are looking at.
Interpreting the Signals: Which Resource Is Guilty?
| Symptom | Likely culprit | Confirming metric |
|---|---|---|
High r in vmstat, low disk I/O | CPU saturation | mpstat per-CPU, PSI cpu |
%iowait high, %util high on a disk | Disk saturation | iostat await + aqu-sz, PSI io |
si/so > 0 in vmstat | Memory saturation (swapping) | /proc/pressure/memory, MemAvailable |
| "Out of memory" in dmesg | Memory errors (OOM) | cgroup memory.events |
| Retransmits rising in ETCP | Network saturation (peer) or packet loss | netstat -s, ss -s |
| Interface %ifutil > 80% | Network saturation (link) | sar -n DEV |
| EMFILE in application logs | FD exhaustion | /proc/sys/fs/file-nr, /proc/*/limits |
| CPU PSI "full > 0" | System-wide CPU starvation | mpstat -P ALL |
| IO PSI "full > 0" | All processes stuck on I/O | iotop, iostat per device |
Beyond USE: When You Need More Detail
When USE shows CPU saturation but not which code:
# Snapshot — what functions the CPUs are running right now
sudo perf top
# Samples: 12K of event 'cycles'
# Overhead Symbol
# 12.35% [k] finish_task_switch
# 8.90% [k] native_queued_spin_lock_slowpath
# 5.40% [.] malloc
# ...
# 30-second profile + flame graph (requires FlameGraph scripts from brendangregg)
sudo perf record -F 99 -a -g -- sleep 30
sudo perf script | stackcollapse-perf.pl | flamegraph.pl > flame.svg
When USE shows memory saturation but not which allocations:
# Per-process RSS/PSS over time
smem -tk -s pss
# Or from /proc
watch 'awk "/Pss:/ {sum+=\$2} END {print sum\" kB\"}" /proc/$PID/smaps'
# With BCC or bpftrace you can trace allocations (requires root and bcc-tools):
# sudo /usr/share/bcc/tools/memleak -p $PID -a -p
When USE shows I/O saturation but not which file:
# biolatency, biotop, biosnoop from BCC or bpftrace
sudo /usr/share/bcc/tools/biotop 5
# Tracing... Output every 5 secs. Hit Ctrl-C to end
# PID COMM D MAJ MIN DISK I/O Kbytes AVGms
# 1234 postgres W 259 0 nvme0n1 500 64000 3.21
These advanced tools (perf, eBPF via BCC or bpftrace) are the second layer. USE tells you which resource is the bottleneck; these tools tell you which function or file in the hot code path.
Key Concepts Summary
- USE = Utilization, Saturation, Errors. Apply it to every resource: CPU, memory, disk, network, file descriptors.
- Saturation is the key signal. High utilization without saturation is just "doing work." High utilization with saturation is a bottleneck.
- PSI is the modern saturation number.
/proc/pressure/{cpu,memory,io}gives you normalized saturation per resource. - Every resource has specific commands: mpstat/vmstat for CPU, free + /proc/meminfo for memory, iostat -xz for disks, sar -n DEV / sar -n TCP for network, /proc/sys/fs/file-nr for file descriptors.
- dmesg is the universal error log for kernel-level issues: OOM kills, disk errors, NIC errors, hardware faults.
- A one-minute triage runs 6–8 commands and tells you which resource is the bottleneck. Build a script.
- USE gets you to the resource; perf/eBPF get you to the code. Start with USE, escalate to deeper tools.
Common Mistakes
- Looking at CPU graphs alone and declaring "CPU is fine." If saturation (run queue, PSI cpu) is high but per-CPU isn't pegged, the scheduler is the bottleneck.
- Trusting
%utilon NVMe drives. On multi-queue devices it can read 100% with plenty of headroom. Useawaitandaqu-szinstead. - Interpreting
%iowaitas "disk load." It's CPU time idle while waiting on I/O — it depends on how much else the CPU has to do, so it is a very noisy signal. - Ignoring dmesg. It is where every kernel-level error surfaces — bad disks, bad NICs, bad memory, OOM kills, kernel panics. Make
dmesg -T | tailpart of every triage. - Monitoring
free memoryand alerting when it is low. Linux aggressively caches disk data in "free" memory —MemAvailableis the real pressure signal. - Not alerting on PSI metrics. They are the single best "is this machine healthy?" signals available.
- Stopping at "high CPU usage" without per-process or per-thread attribution.
pidstat 1andtop -Htell you which process or thread is the consumer. - Using one data point. USE needs two-sample minimum (vmstat's first output is averages since boot; use
vmstat 1 5and look at the later rows). - Forgetting to check file descriptor limits. "Mystery" request failures at moderate load are often EMFILE from a process hitting its soft rlimit.
Application p99 latency just doubled. On an affected host: `vmstat 1 5` shows `r` consistently at 20 (the host has 16 CPUs), CPU is ~80% busy, iostat shows `%util` low and disk `await` < 1ms, network is unsaturated, memory is 40% used with no swap. PSI CPU `some avg10=35`. Which resource is the bottleneck, and what is your evidence?