Linux Fundamentals for Engineers

Block Devices and Disk I/O

A Postgres replica in production has p99 query latency spiking to 8 seconds. CPU is idle, memory is fine, the network is fine. iostat -x 1 shows await (average I/O wait time) climbing to 120 ms and %util pinned at 100% — the disk is saturated. Cloud provider confirms the EBS volume has burst-balance depleted; its baseline IOPS is 3000 but the replica is trying to push 12,000. Restoring the burst credits is not the fix. Matching the workload to provisioned IOPS is.

You cannot solve that problem with application-layer thinking. You need to know what "the disk is saturated" means at the OS level — which layer is saturated, what the kernel is buffering, what costs what, and which numbers to watch. This lesson maps the full Linux I/O stack so the next time a latency chart spikes, you know which knob to turn.


The Two Kinds of Devices

The kernel distinguishes two fundamental device types:

  • Character devices stream bytes in order. You read one byte or many, but you cannot seek. Terminals, serial ports, random-number generators, keyboards, /dev/null, /dev/urandom. Their driver's read() returns bytes as they become available.
  • Block devices are addressable storage. You read or write fixed-size blocks at specific offsets. Hard drives, SSDs, NVMe, loop devices, LVM volumes. The kernel maintains a block layer that queues, caches, and schedules I/O to these devices.
# Character vs block at a glance
ls -l /dev/null /dev/sda /dev/nvme0n1 /dev/urandom /dev/tty 2>/dev/null
# crw-rw-rw- 1 root root     1,   3 Apr 19 08:00 /dev/null        <- c = character
# brw-rw---- 1 root disk     8,   0 Apr 19 08:00 /dev/sda         <- b = block
# brw-rw---- 1 root disk   259,   0 Apr 19 08:00 /dev/nvme0n1     <- b
# crw-rw-rw- 1 root root     1,   9 Apr 19 08:00 /dev/urandom     <- c
# crw--w---- 1 root tty      4,   0 Apr 19 08:00 /dev/tty         <- c

# The entire block-device inventory
lsblk -f
# NAME         FSTYPE FSVER LABEL UUID                                 MOUNTPOINTS
# nvme0n1
# ├─nvme0n1p1  vfat   FAT32       3f8a-01c9                            /boot/efi
# └─nvme0n1p2  ext4   1.0         c112-91bd-4a3e-9d2c-...              /
# nvme1n1
# └─nvme1n1p1  ext4   1.0         a87b-02e4-45c1-b0d8-...              /home

From here on, this lesson is about block devices. Character devices are simpler and do not have a performance story in the same way.


The Block I/O Stack

When your program calls read(fd, buf, 65536) on a file, the request travels through half a dozen kernel layers before hitting the disk (and another half dozen on the way back). Understanding the stack is how you localize a latency problem.

  Application (your code)
      │  read() / write() / pread() / pwrite() / mmap()
      ▼
  System call layer
      │
      ▼
  Virtual File System (VFS)
      │  dispatches to the right filesystem
      ▼
  Filesystem driver (ext4, xfs, overlayfs, btrfs)
      │  translates logical file offsets into block numbers
      ▼
  Page cache
      │  RAM copy of recently-read/written data
      ▼  (cache miss — actually go to disk)
  Block layer (the 'bio' layer)
      │  submits requests to the queue
      ▼
  I/O scheduler  (mq-deadline, bfq, none, kyber)
      │  merges, reorders, prioritizes requests
      ▼
  Device driver (nvme, virtio_blk, scsi, sd)
      │
      ▼
  Hardware (NVMe, SATA SSD, HDD, network block device)

Each layer adds its own latency and observability. A slow read() could be slow in any of them — and the tools to diagnose change at each level.

KEY CONCEPT

When someone says "the disk is slow," ask: slow at which layer? Application-reported latency includes the whole stack, all the way down. iostat measures at the block layer. iowait in top measures CPU time stuck waiting on I/O. fio benchmarks raw device performance below the filesystem. Each gives a different answer — and picking the right one tells you where the problem is.


The Page Cache: Why a "Disk Read" Is Usually Not a Disk Read

Linux aggressively caches file data in RAM. When you read a file, the kernel:

  1. Checks if the needed pages are already in the page cache (RAM).
  2. If yes, copies them into your buffer and returns immediately. No disk I/O.
  3. If no, issues a read to the block layer, waits for it, caches the result, then returns.

Writes go the other way: by default they fill the page cache and are marked "dirty," and the kernel flushes them to disk asynchronously (this is called buffered I/O or writeback).

# How much RAM is the page cache using right now?
cat /proc/meminfo | grep -E 'Cached|Dirty|Writeback'
# Cached:         14280432 kB     <- page cache
# Dirty:             12856 kB     <- modified pages not yet flushed
# Writeback:             0 kB     <- pages currently being flushed

# Same info, human-friendly
free -h
#                total   used   free  shared  buff/cache   available
# Mem:            32Gi   8Gi   2Gi     1Gi       22Gi         23Gi
#                                                ^^^^^
#                                     available = free + reclaimable cache

# Drop the page cache (for benchmarks — never on a production system)
sync                                    # flush dirty pages first
echo 3 | sudo tee /proc/sys/vm/drop_caches
# 3 = drop pagecache + dentries + inodes

The page cache is why cat bigfile > /dev/null is blazing fast the second time — the file is in RAM.

PRO TIP

"Available memory" in free or MemAvailable in /proc/meminfo includes reclaimable page cache. Monitoring systems that alert on "free memory is low!" are often alerting on "the kernel is cleverly caching disk data." Use MemAvailable as the real "memory pressure" signal, not MemFree.

Writeback and why fsync() matters

Writes buffer in the page cache and flush later. This is fast but risky: if the power drops, dirty pages that never made it to disk are lost.

  • Buffered write (default): write to page cache, return immediately. Disk I/O is async.
  • fsync(fd): force all dirty pages for this file to disk before returning. Slow but durable.
  • O_SYNC / O_DSYNC: open the file so every write() is effectively fsync()ed. Very slow but safe.
  • O_DIRECT: bypass the page cache entirely. Reads and writes go straight to the device. Used by databases and high-performance applications that do their own caching.
# See writeback in action
dd if=/dev/zero of=/tmp/testfile bs=1M count=100 oflag=direct
# 100+0 records in
# 100+0 records out
# 104857600 bytes (105 MB) copied, 0.987 s, 106 MB/s    <- real disk throughput

dd if=/dev/zero of=/tmp/testfile bs=1M count=100
# 100+0 records in
# 100+0 records out
# 104857600 bytes (105 MB) copied, 0.082 s, 1.3 GB/s    <- writing to page cache, not disk
# Without fsync, we have no idea if the data actually hit storage yet
WAR STORY

A database team shipped a "faster" build that cut write latency in half. A month later, a power event lost ~2 GB of supposedly-committed data. Root cause: someone had changed the durability setting so the WAL writes no longer called fsync() between commits. Every commit returned fast — to the page cache. Power loss dropped the whole cache. The lesson: fsync is not a nuisance; it is the contract between your program and the durability of its data. If you are not paying for fsync, your data is not actually on disk yet.


Benchmarking and Measuring Disk I/O

iostat -x 1 — the production workhorse

iostat -x 1
# Device            r/s     w/s    rkB/s    wkB/s  rrqm/s  wrqm/s  %rrqm  %wrqm  r_await  w_await  aqu-sz  rareq-sz  wareq-sz  svctm  %util
# nvme0n1         241.0    38.0   12032    4864    0.0     12.0   0.00    24.0   0.28     0.75     0.09      50.0     128.0    0.5    16.4
# nvme1n1         1203.0   421.0  90240   16384    0.0     0.0    0.00    0.00   4.21    12.34     5.8       75.0      38.9    0.82    94.7

The fields you watch every day:

  • r/s, w/s — reads and writes per second (IOPS). The number your cloud provider's "provisioned IOPS" is about.
  • rkB/s, wkB/s — throughput in KB/s.
  • r_await, w_awaitaverage latency per I/O, in milliseconds. For SSDs, expect sub-1ms; for spinning disks, 5–15 ms. Anything above that is trouble.
  • aqu-sz — average queue depth. High queue depth + high await = saturated device.
  • %util — how much of the time the device was busy. For multi-queue SSDs, this is misleading (it can read as 100% while the device still has headroom), but for old SATA HDDs it is honest.
WARNING

%util pinned at 100% does not necessarily mean the device is saturated on modern NVMe drives. NVMe supports thousands of parallel commands — it can be "busy" for every sample without being at its IOPS ceiling. Use await and aqu-sz instead to judge saturation. %util lies on modern hardware; await tells the truth.

iotop — which process is using the disk

sudo iotop -oPa
# Total DISK READ:   15.42 M/s | Total DISK WRITE:   8.33 M/s
#  PID   USER     DISK READ   DISK WRITE   COMMAND
#  1234  postgres 12.34 M/s   6.78 M/s     postgres: writer
#  5678  root     2.01 M/s    1.22 M/s     jbd2/nvme0n1p2-8      <- kernel thread: ext4 journal

pidstat -d 1 — per-process I/O without needing root

pidstat -d 1
# 10:00:01 AM   UID       PID   kB_rd/s   kB_wr/s   Command
# 10:00:02 AM  1000     12345   11200.0    8300.0   postgres

fio — real benchmarking

iostat tells you what the current workload is doing. fio tells you what the device can do:

# Random 4K reads — emulates OLTP database read workload
fio --name=randread --filename=/data/testfile --rw=randread \
    --bs=4k --size=2G --iodepth=32 --numjobs=1 --runtime=30 --time_based --direct=1 --group_reporting

# Output includes IOPS, average latency, p99 latency
#   read: IOPS=98.2k, BW=384MiB/s (402MB/s)(11.3GiB/30002msec)
#     clat (usec): min=120, max=8992, avg=324.45, stdev=90.15
#     clat percentiles (usec): 99.00th=[ 512], 99.99th=[ 2048]

The numbers you get from fio are your device's ceiling. If your app is hitting those numbers, the disk is the bottleneck. If it is not, the disk has headroom and the bottleneck is somewhere else.


I/O Schedulers

Every block device on Linux has an I/O scheduler that decides the order in which requests are sent to hardware. Each scheduler has a different philosophy.

SchedulerGood forHow it behaves
none / noopNVMe, fast SSDs, VMsDoes nothing; passes requests straight to the device. Lowest CPU overhead.
mq-deadlineSATA SSDs, HDDs with predictable workloadsTracks per-request deadlines; writes yield to reads.
bfqDesktop / latency-sensitive mixed workloadsFair-share between processes; good interactive responsiveness.
kyberLow-latency + throughput, multi-queue devicesLatency target for reads and writes; rate-limits to stay under it.
# Which scheduler is my device using?
cat /sys/block/nvme0n1/queue/scheduler
# [none] mq-deadline kyber bfq
#  ^ brackets show the active one

# Change it at runtime (not persistent across reboot)
echo mq-deadline | sudo tee /sys/block/sda/queue/scheduler

# Make it persistent with udev rules
cat > /etc/udev/rules.d/60-ioscheduler.rules << 'EOF'
ACTION=="add|change", KERNEL=="nvme[0-9]n[0-9]", ATTR{queue/scheduler}="none"
ACTION=="add|change", KERNEL=="sd[a-z]", ATTR{queue/scheduler}="mq-deadline"
EOF
PRO TIP

For NVMe on virtualized cloud instances (EC2, GCE, Azure), set the scheduler to none. The NVMe device has its own queue and out-of-order execution; the kernel scheduler just adds overhead. For older SATA SSDs, mq-deadline is a safe default. For spinning disks, bfq can improve interactive responsiveness at the cost of throughput.


Latency Budgets by Hardware

Rough numbers so you can reason about what a disk "should" do:

DeviceSequential read4K random read4K random writeTypical IOPSTypical latency
7200 RPM HDD100–150 MB/s~100 IOPS~100 IOPS100–1505–15 ms
SATA SSD500 MB/s~50–100k IOPS~50k IOPS80k50–200 µs
NVMe SSD (consumer)3–7 GB/s300k+ IOPS200k+ IOPS400k+20–100 µs
NVMe SSD (datacenter)7–14 GB/s1M+ IOPS500k+ IOPS1M+< 100 µs
AWS EBS gp3ConfigurableConfigurableConfigurable3000 (baseline) / 16k (max)1–3 ms
AWS EBS io2 Block ExpressConfigurableConfigurableConfigurableup to 256k~500 µs

If your p99 read latency is 3 ms on an NVMe SSD that should do 100 µs, you are not waiting on the hardware — you are waiting on the page cache, the I/O scheduler, the filesystem, or the block queue. Start at the layer that matches your observed latency.


iowait: What It Actually Means

Look at top:

%Cpu(s):  5.0 us, 2.0 sy, 0.0 ni, 80.0 id, 13.0 wa, 0.0 hi, 0.0 si, 0.0 st
#                                         ^^^^
#                                         iowait

%wa (iowait) is CPU time spent idle while a process on that CPU was waiting on I/O. It is not "how much I/O is happening." It is "how much time the CPU could have been working but was not because its next step was an I/O that had not come back yet."

Two consequences most engineers miss:

  1. High iowait on a CPU-bound workload is normal idle time relabeled. If nothing else wanted the CPU, iowait is just idle time with a tag.
  2. Zero iowait does not mean no I/O. If the CPU is also busy with other work, iowait can be 0 even under heavy disk load — the CPU is not waiting, it is doing something else while I/O completes.

Use iowait as a signal that disk is involved, not as a direct measure of disk load. iostat -x is the honest measurement.


Common Failure Modes and How to Spot Them

SymptomLikely causeFirst diagnostic
await rising, %util highDevice saturated (or cloud IOPS cap hit)iostat -x 1, check cloud provider's IOPS dashboard
Load average climbs with no CPU usageProcesses in D state waiting on I/O`ps -eo stat,pid,cmd
Writes "finish" fast but data lost on rebootMissing fsync()strace -e fsync,fdatasync -p $PID
df shows plenty of space, but "no space left on device"Inodes exhausteddf -i
Very high CPU sy time during I/OToo many small syscallsstrace -c, look at read/write count and average size
Read throughput collapses after a periodPage cache evicted/proc/meminfo Cached:, consider vm.vfs_cache_pressure
NVMe device shows dmesg errorsHardware issuesmartctl -a /dev/nvme0, `dmesg -T
Slow metadata ops (ls, stat) on a big dirDirectory has millions of entries`time ls -f dir
# Quick health check on a block device
sudo smartctl -H /dev/sda          # overall SMART status: PASSED or FAILED
sudo smartctl -a /dev/sda | head   # full SMART attributes
sudo smartctl -a /dev/nvme0 | grep -iE 'percent|media|error'
# SMART attributes that actually predict failure:
#   Reallocated_Sector_Ct (HDD) — anything > 0 and climbing is bad
#   Media_Wearout_Indicator / Percentage_Used (SSD/NVMe) — life left
#   Power_On_Hours — how old

Key Concepts Summary

  • Block devices are addressable storage; character devices are byte streams. Different drivers, different abstractions.
  • The Linux I/O stack has many layers. App → syscall → VFS → filesystem → page cache → block layer → scheduler → driver → hardware. Each adds latency and observability.
  • The page cache makes most "disk reads" RAM reads. A cold read hits the disk; subsequent reads hit the cache.
  • Writeback is the default. Writes go to the page cache and flush async. fsync() is how you guarantee data is really on disk.
  • iostat -x 1 is the production workhorse. await, aqu-sz, r/s, w/s are the numbers that matter. %util lies on modern multi-queue devices.
  • iowait is idle CPU time waiting on I/O. It is not a direct measure of disk load and it can be zero even under heavy I/O.
  • I/O schedulers can be tuned per-device. none for NVMe, mq-deadline for SATA SSD, bfq for desktop, kyber for latency-bounded workloads.
  • Know the expected latency for your hardware. NVMe at 5 ms means something is wrong above the device. HDD at 15 ms is normal.
  • fio measures what the device can do; iostat measures what it is currently doing.

Common Mistakes

  • Taking throughput numbers from dd if=/dev/zero of=file seriously. The writes go to the page cache, not disk. Add oflag=direct or call fsync after.
  • Trusting %util on NVMe drives. On multi-queue devices, %util can be 100% with plenty of headroom left. Use await and aqu-sz.
  • Reading a filesystem's "no space left on device" and clearing files when the real cause was inode exhaustion. Always run df -i alongside df -h.
  • Running a benchmark without --direct=1 and comparing numbers — you are really benchmarking the page cache, not the device.
  • Ignoring dmesg when disk latency mysteriously doubles. A failing drive often throws I/O errors silently to dmesg before SMART flags it.
  • Assuming cloud block storage gives you its burst IOPS sustained. Most cloud gp-tier volumes have burst credits that deplete under sustained load.
  • Setting the I/O scheduler to bfq on a heavily loaded multi-process server and then wondering why throughput dropped. bfq trades throughput for fairness.
  • Calling fsync() after every write in a tight loop and wondering why throughput is 1% of the device's capacity. Batch writes before fsync, or use fdatasync to skip metadata syncs.
  • Not monitoring inode usage, SMART wearout, and queue depth — all three have caught silent degradation before user-visible outages in real incidents.

KNOWLEDGE CHECK

A database server shows `iostat -x` reporting `r_await=25 ms` and `aqu-sz=48` on an NVMe device that is specced at 1M IOPS with sub-100µs latency. CPU is 70% idle. What is the most likely explanation and the first thing to try?