Block Devices and Disk I/O
A Postgres replica in production has p99 query latency spiking to 8 seconds. CPU is idle, memory is fine, the network is fine.
iostat -x 1showsawait(average I/O wait time) climbing to 120 ms and%utilpinned at 100% — the disk is saturated. Cloud provider confirms the EBS volume has burst-balance depleted; its baseline IOPS is 3000 but the replica is trying to push 12,000. Restoring the burst credits is not the fix. Matching the workload to provisioned IOPS is.You cannot solve that problem with application-layer thinking. You need to know what "the disk is saturated" means at the OS level — which layer is saturated, what the kernel is buffering, what costs what, and which numbers to watch. This lesson maps the full Linux I/O stack so the next time a latency chart spikes, you know which knob to turn.
The Two Kinds of Devices
The kernel distinguishes two fundamental device types:
- Character devices stream bytes in order. You read one byte or many, but you cannot seek. Terminals, serial ports, random-number generators, keyboards,
/dev/null,/dev/urandom. Their driver'sread()returns bytes as they become available. - Block devices are addressable storage. You read or write fixed-size blocks at specific offsets. Hard drives, SSDs, NVMe, loop devices, LVM volumes. The kernel maintains a block layer that queues, caches, and schedules I/O to these devices.
# Character vs block at a glance
ls -l /dev/null /dev/sda /dev/nvme0n1 /dev/urandom /dev/tty 2>/dev/null
# crw-rw-rw- 1 root root 1, 3 Apr 19 08:00 /dev/null <- c = character
# brw-rw---- 1 root disk 8, 0 Apr 19 08:00 /dev/sda <- b = block
# brw-rw---- 1 root disk 259, 0 Apr 19 08:00 /dev/nvme0n1 <- b
# crw-rw-rw- 1 root root 1, 9 Apr 19 08:00 /dev/urandom <- c
# crw--w---- 1 root tty 4, 0 Apr 19 08:00 /dev/tty <- c
# The entire block-device inventory
lsblk -f
# NAME FSTYPE FSVER LABEL UUID MOUNTPOINTS
# nvme0n1
# ├─nvme0n1p1 vfat FAT32 3f8a-01c9 /boot/efi
# └─nvme0n1p2 ext4 1.0 c112-91bd-4a3e-9d2c-... /
# nvme1n1
# └─nvme1n1p1 ext4 1.0 a87b-02e4-45c1-b0d8-... /home
From here on, this lesson is about block devices. Character devices are simpler and do not have a performance story in the same way.
The Block I/O Stack
When your program calls read(fd, buf, 65536) on a file, the request travels through half a dozen kernel layers before hitting the disk (and another half dozen on the way back). Understanding the stack is how you localize a latency problem.
Application (your code)
│ read() / write() / pread() / pwrite() / mmap()
▼
System call layer
│
▼
Virtual File System (VFS)
│ dispatches to the right filesystem
▼
Filesystem driver (ext4, xfs, overlayfs, btrfs)
│ translates logical file offsets into block numbers
▼
Page cache
│ RAM copy of recently-read/written data
▼ (cache miss — actually go to disk)
Block layer (the 'bio' layer)
│ submits requests to the queue
▼
I/O scheduler (mq-deadline, bfq, none, kyber)
│ merges, reorders, prioritizes requests
▼
Device driver (nvme, virtio_blk, scsi, sd)
│
▼
Hardware (NVMe, SATA SSD, HDD, network block device)
Each layer adds its own latency and observability. A slow read() could be slow in any of them — and the tools to diagnose change at each level.
When someone says "the disk is slow," ask: slow at which layer? Application-reported latency includes the whole stack, all the way down. iostat measures at the block layer. iowait in top measures CPU time stuck waiting on I/O. fio benchmarks raw device performance below the filesystem. Each gives a different answer — and picking the right one tells you where the problem is.
The Page Cache: Why a "Disk Read" Is Usually Not a Disk Read
Linux aggressively caches file data in RAM. When you read a file, the kernel:
- Checks if the needed pages are already in the page cache (RAM).
- If yes, copies them into your buffer and returns immediately. No disk I/O.
- If no, issues a read to the block layer, waits for it, caches the result, then returns.
Writes go the other way: by default they fill the page cache and are marked "dirty," and the kernel flushes them to disk asynchronously (this is called buffered I/O or writeback).
# How much RAM is the page cache using right now?
cat /proc/meminfo | grep -E 'Cached|Dirty|Writeback'
# Cached: 14280432 kB <- page cache
# Dirty: 12856 kB <- modified pages not yet flushed
# Writeback: 0 kB <- pages currently being flushed
# Same info, human-friendly
free -h
# total used free shared buff/cache available
# Mem: 32Gi 8Gi 2Gi 1Gi 22Gi 23Gi
# ^^^^^
# available = free + reclaimable cache
# Drop the page cache (for benchmarks — never on a production system)
sync # flush dirty pages first
echo 3 | sudo tee /proc/sys/vm/drop_caches
# 3 = drop pagecache + dentries + inodes
The page cache is why cat bigfile > /dev/null is blazing fast the second time — the file is in RAM.
"Available memory" in free or MemAvailable in /proc/meminfo includes reclaimable page cache. Monitoring systems that alert on "free memory is low!" are often alerting on "the kernel is cleverly caching disk data." Use MemAvailable as the real "memory pressure" signal, not MemFree.
Writeback and why fsync() matters
Writes buffer in the page cache and flush later. This is fast but risky: if the power drops, dirty pages that never made it to disk are lost.
- Buffered write (default): write to page cache, return immediately. Disk I/O is async.
fsync(fd): force all dirty pages for this file to disk before returning. Slow but durable.O_SYNC/O_DSYNC: open the file so everywrite()is effectivelyfsync()ed. Very slow but safe.O_DIRECT: bypass the page cache entirely. Reads and writes go straight to the device. Used by databases and high-performance applications that do their own caching.
# See writeback in action
dd if=/dev/zero of=/tmp/testfile bs=1M count=100 oflag=direct
# 100+0 records in
# 100+0 records out
# 104857600 bytes (105 MB) copied, 0.987 s, 106 MB/s <- real disk throughput
dd if=/dev/zero of=/tmp/testfile bs=1M count=100
# 100+0 records in
# 100+0 records out
# 104857600 bytes (105 MB) copied, 0.082 s, 1.3 GB/s <- writing to page cache, not disk
# Without fsync, we have no idea if the data actually hit storage yet
A database team shipped a "faster" build that cut write latency in half. A month later, a power event lost ~2 GB of supposedly-committed data. Root cause: someone had changed the durability setting so the WAL writes no longer called fsync() between commits. Every commit returned fast — to the page cache. Power loss dropped the whole cache. The lesson: fsync is not a nuisance; it is the contract between your program and the durability of its data. If you are not paying for fsync, your data is not actually on disk yet.
Benchmarking and Measuring Disk I/O
iostat -x 1 — the production workhorse
iostat -x 1
# Device r/s w/s rkB/s wkB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util
# nvme0n1 241.0 38.0 12032 4864 0.0 12.0 0.00 24.0 0.28 0.75 0.09 50.0 128.0 0.5 16.4
# nvme1n1 1203.0 421.0 90240 16384 0.0 0.0 0.00 0.00 4.21 12.34 5.8 75.0 38.9 0.82 94.7
The fields you watch every day:
r/s,w/s— reads and writes per second (IOPS). The number your cloud provider's "provisioned IOPS" is about.rkB/s,wkB/s— throughput in KB/s.r_await,w_await— average latency per I/O, in milliseconds. For SSDs, expect sub-1ms; for spinning disks, 5–15 ms. Anything above that is trouble.aqu-sz— average queue depth. High queue depth + high await = saturated device.%util— how much of the time the device was busy. For multi-queue SSDs, this is misleading (it can read as 100% while the device still has headroom), but for old SATA HDDs it is honest.
%util pinned at 100% does not necessarily mean the device is saturated on modern NVMe drives. NVMe supports thousands of parallel commands — it can be "busy" for every sample without being at its IOPS ceiling. Use await and aqu-sz instead to judge saturation. %util lies on modern hardware; await tells the truth.
iotop — which process is using the disk
sudo iotop -oPa
# Total DISK READ: 15.42 M/s | Total DISK WRITE: 8.33 M/s
# PID USER DISK READ DISK WRITE COMMAND
# 1234 postgres 12.34 M/s 6.78 M/s postgres: writer
# 5678 root 2.01 M/s 1.22 M/s jbd2/nvme0n1p2-8 <- kernel thread: ext4 journal
pidstat -d 1 — per-process I/O without needing root
pidstat -d 1
# 10:00:01 AM UID PID kB_rd/s kB_wr/s Command
# 10:00:02 AM 1000 12345 11200.0 8300.0 postgres
fio — real benchmarking
iostat tells you what the current workload is doing. fio tells you what the device can do:
# Random 4K reads — emulates OLTP database read workload
fio --name=randread --filename=/data/testfile --rw=randread \
--bs=4k --size=2G --iodepth=32 --numjobs=1 --runtime=30 --time_based --direct=1 --group_reporting
# Output includes IOPS, average latency, p99 latency
# read: IOPS=98.2k, BW=384MiB/s (402MB/s)(11.3GiB/30002msec)
# clat (usec): min=120, max=8992, avg=324.45, stdev=90.15
# clat percentiles (usec): 99.00th=[ 512], 99.99th=[ 2048]
The numbers you get from fio are your device's ceiling. If your app is hitting those numbers, the disk is the bottleneck. If it is not, the disk has headroom and the bottleneck is somewhere else.
I/O Schedulers
Every block device on Linux has an I/O scheduler that decides the order in which requests are sent to hardware. Each scheduler has a different philosophy.
| Scheduler | Good for | How it behaves |
|---|---|---|
none / noop | NVMe, fast SSDs, VMs | Does nothing; passes requests straight to the device. Lowest CPU overhead. |
mq-deadline | SATA SSDs, HDDs with predictable workloads | Tracks per-request deadlines; writes yield to reads. |
bfq | Desktop / latency-sensitive mixed workloads | Fair-share between processes; good interactive responsiveness. |
kyber | Low-latency + throughput, multi-queue devices | Latency target for reads and writes; rate-limits to stay under it. |
# Which scheduler is my device using?
cat /sys/block/nvme0n1/queue/scheduler
# [none] mq-deadline kyber bfq
# ^ brackets show the active one
# Change it at runtime (not persistent across reboot)
echo mq-deadline | sudo tee /sys/block/sda/queue/scheduler
# Make it persistent with udev rules
cat > /etc/udev/rules.d/60-ioscheduler.rules << 'EOF'
ACTION=="add|change", KERNEL=="nvme[0-9]n[0-9]", ATTR{queue/scheduler}="none"
ACTION=="add|change", KERNEL=="sd[a-z]", ATTR{queue/scheduler}="mq-deadline"
EOF
For NVMe on virtualized cloud instances (EC2, GCE, Azure), set the scheduler to none. The NVMe device has its own queue and out-of-order execution; the kernel scheduler just adds overhead. For older SATA SSDs, mq-deadline is a safe default. For spinning disks, bfq can improve interactive responsiveness at the cost of throughput.
Latency Budgets by Hardware
Rough numbers so you can reason about what a disk "should" do:
| Device | Sequential read | 4K random read | 4K random write | Typical IOPS | Typical latency |
|---|---|---|---|---|---|
| 7200 RPM HDD | 100–150 MB/s | ~100 IOPS | ~100 IOPS | 100–150 | 5–15 ms |
| SATA SSD | 500 MB/s | ~50–100k IOPS | ~50k IOPS | 80k | 50–200 µs |
| NVMe SSD (consumer) | 3–7 GB/s | 300k+ IOPS | 200k+ IOPS | 400k+ | 20–100 µs |
| NVMe SSD (datacenter) | 7–14 GB/s | 1M+ IOPS | 500k+ IOPS | 1M+ | < 100 µs |
| AWS EBS gp3 | Configurable | Configurable | Configurable | 3000 (baseline) / 16k (max) | 1–3 ms |
| AWS EBS io2 Block Express | Configurable | Configurable | Configurable | up to 256k | ~500 µs |
If your p99 read latency is 3 ms on an NVMe SSD that should do 100 µs, you are not waiting on the hardware — you are waiting on the page cache, the I/O scheduler, the filesystem, or the block queue. Start at the layer that matches your observed latency.
iowait: What It Actually Means
Look at top:
%Cpu(s): 5.0 us, 2.0 sy, 0.0 ni, 80.0 id, 13.0 wa, 0.0 hi, 0.0 si, 0.0 st
# ^^^^
# iowait
%wa (iowait) is CPU time spent idle while a process on that CPU was waiting on I/O. It is not "how much I/O is happening." It is "how much time the CPU could have been working but was not because its next step was an I/O that had not come back yet."
Two consequences most engineers miss:
- High iowait on a CPU-bound workload is normal idle time relabeled. If nothing else wanted the CPU, iowait is just idle time with a tag.
- Zero iowait does not mean no I/O. If the CPU is also busy with other work, iowait can be 0 even under heavy disk load — the CPU is not waiting, it is doing something else while I/O completes.
Use iowait as a signal that disk is involved, not as a direct measure of disk load. iostat -x is the honest measurement.
Common Failure Modes and How to Spot Them
| Symptom | Likely cause | First diagnostic |
|---|---|---|
await rising, %util high | Device saturated (or cloud IOPS cap hit) | iostat -x 1, check cloud provider's IOPS dashboard |
| Load average climbs with no CPU usage | Processes in D state waiting on I/O | `ps -eo stat,pid,cmd |
| Writes "finish" fast but data lost on reboot | Missing fsync() | strace -e fsync,fdatasync -p $PID |
df shows plenty of space, but "no space left on device" | Inodes exhausted | df -i |
Very high CPU sy time during I/O | Too many small syscalls | strace -c, look at read/write count and average size |
| Read throughput collapses after a period | Page cache evicted | /proc/meminfo Cached:, consider vm.vfs_cache_pressure |
NVMe device shows dmesg errors | Hardware issue | smartctl -a /dev/nvme0, `dmesg -T |
| Slow metadata ops (ls, stat) on a big dir | Directory has millions of entries | `time ls -f dir |
# Quick health check on a block device
sudo smartctl -H /dev/sda # overall SMART status: PASSED or FAILED
sudo smartctl -a /dev/sda | head # full SMART attributes
sudo smartctl -a /dev/nvme0 | grep -iE 'percent|media|error'
# SMART attributes that actually predict failure:
# Reallocated_Sector_Ct (HDD) — anything > 0 and climbing is bad
# Media_Wearout_Indicator / Percentage_Used (SSD/NVMe) — life left
# Power_On_Hours — how old
Key Concepts Summary
- Block devices are addressable storage; character devices are byte streams. Different drivers, different abstractions.
- The Linux I/O stack has many layers. App → syscall → VFS → filesystem → page cache → block layer → scheduler → driver → hardware. Each adds latency and observability.
- The page cache makes most "disk reads" RAM reads. A cold read hits the disk; subsequent reads hit the cache.
- Writeback is the default. Writes go to the page cache and flush async.
fsync()is how you guarantee data is really on disk. iostat -x 1is the production workhorse.await,aqu-sz,r/s,w/sare the numbers that matter.%utillies on modern multi-queue devices.iowaitis idle CPU time waiting on I/O. It is not a direct measure of disk load and it can be zero even under heavy I/O.- I/O schedulers can be tuned per-device.
nonefor NVMe,mq-deadlinefor SATA SSD,bfqfor desktop,kyberfor latency-bounded workloads. - Know the expected latency for your hardware. NVMe at 5 ms means something is wrong above the device. HDD at 15 ms is normal.
fiomeasures what the device can do;iostatmeasures what it is currently doing.
Common Mistakes
- Taking throughput numbers from
dd if=/dev/zero of=fileseriously. The writes go to the page cache, not disk. Addoflag=director callfsyncafter. - Trusting
%utilon NVMe drives. On multi-queue devices,%utilcan be 100% with plenty of headroom left. Useawaitandaqu-sz. - Reading a filesystem's "no space left on device" and clearing files when the real cause was inode exhaustion. Always run
df -ialongsidedf -h. - Running a benchmark without
--direct=1and comparing numbers — you are really benchmarking the page cache, not the device. - Ignoring
dmesgwhen disk latency mysteriously doubles. A failing drive often throws I/O errors silently to dmesg before SMART flags it. - Assuming cloud block storage gives you its burst IOPS sustained. Most cloud gp-tier volumes have burst credits that deplete under sustained load.
- Setting the I/O scheduler to
bfqon a heavily loaded multi-process server and then wondering why throughput dropped. bfq trades throughput for fairness. - Calling
fsync()after every write in a tight loop and wondering why throughput is 1% of the device's capacity. Batch writes before fsync, or usefdatasyncto skip metadata syncs. - Not monitoring inode usage, SMART wearout, and queue depth — all three have caught silent degradation before user-visible outages in real incidents.
A database server shows `iostat -x` reporting `r_await=25 ms` and `aqu-sz=48` on an NVMe device that is specced at 1M IOPS with sub-100µs latency. CPU is 70% idle. What is the most likely explanation and the first thing to try?