The Kernel, Userspace, and System Calls
An engineer on your team runs
strace -c ls /tmpand stares at the output: thousands of lines, dozens of syscalls, numbers they do not understand. "Why does listing five files take 80 syscalls?" They assumedlswas "just" reading a directory. It is not. Every byte read, every file stat, every string printed to the terminal, every memory page allocated — each one crosses a hard boundary between two completely separate worlds. If you do not understand that boundary, you will spend your career guessing at why containers behave strangely, whystraceoutput is overwhelming, why some processes can do things others cannot, and why Linux performance debugging feels like black magic.This lesson is the one mental model that makes every other Linux concept click. Get this right and the rest of the course — processes, filesystems, systemd, cgroups, namespaces — stops being a pile of disconnected features and starts being one coherent system.
Two Worlds on One Machine
Every program running on your Linux box lives in one of two modes. Every single operation — allocating memory, opening a file, sending a network packet, reading the time, forking a process — happens in one of these modes or crosses between them.
- Kernel mode (also called supervisor mode, ring 0 on x86): full hardware access. Can talk directly to disks, network cards, memory controllers, CPU registers. Can modify page tables, handle interrupts, reschedule processes.
- User mode (ring 3): no hardware access. Cannot touch the disk directly. Cannot read another process's memory. Cannot change the system clock. Cannot do anything that would affect another process.
The CPU itself enforces this. When you read "the kernel is privileged," it is not a software policy you could bypass with the right flags — it is a hardware-level mode bit the CPU checks on every instruction. If a user-mode program tries to execute a privileged instruction (like hlt to halt the CPU), the CPU traps the attempt and the kernel kills the process with SIGSEGV or SIGILL.
User-mode code is powerless on its own. Every useful operation a program performs — reading a file, printing output, sending a packet, asking what time it is — requires crossing into kernel mode. That crossing is called a system call, and it is the only legal doorway between the two worlds. Understanding this boundary is the foundation of understanding Linux.
What Lives in Each World
| Kernel mode | User mode |
|---|---|
| Device drivers (disk, NIC, GPU) | Your applications (nginx, Python, Go binaries) |
| Virtual memory manager, page tables | Shells (bash, zsh) |
| Scheduler (picks the next process to run) | Libraries (glibc, OpenSSL) |
| TCP/IP stack, firewall (iptables/nftables) | Container runtimes (containerd, runc) |
| Filesystems (ext4, xfs, overlayfs) | Language runtimes (JVM, Node.js, CPython) |
| Syscall dispatcher | systemd, cron, sshd |
Notice where systemd, nginx, and sshd sit. Even PID 1 — even the program that boots your system — runs in user mode. The only thing in kernel mode is the kernel itself and code loaded into it (modules like nvidia.ko or overlay.ko).
What the Kernel Actually Does
Most engineers talk about "the kernel" as if it were one big program. It is, in a sense — a single binary at /boot/vmlinuz-* that the bootloader loads into memory. But functionally it is a collection of services, all running in kernel mode, that your processes rely on constantly.
The kernel has four jobs:
- Manage the CPU. Decide which process runs next, for how long, on which core. This is the scheduler.
- Manage memory. Hand out pages to processes, enforce isolation between them, swap pages to disk when memory is tight, handle page faults when a process touches a page that is not resident.
- Manage I/O. Drive disks, network cards, keyboards, GPUs. Translate "write these bytes to
/var/log/syslog" into the SATA or NVMe commands the hardware understands. - Mediate access. Enforce permissions. Stop process A from reading process B's memory. Stop a non-root user from binding to port 80. Stop anyone from writing directly to
/dev/sda.
Everything else Linux does is built on top of these four responsibilities.
When you read kernel source code or strace output, look for which of these four jobs the code is doing. A syscall like read() is job 3 (I/O). fork() is job 1 (CPU). mmap() is job 2 (memory). setuid() is job 4 (mediation). This framing makes the firehose of kernel functionality much easier to navigate.
System Calls — The Only Doorway
A system call is a controlled, hardware-assisted transition from user mode into kernel mode to ask the kernel to do something on your behalf.
Not a function call. Not an API. A full-blown CPU-level mode switch with a dedicated instruction.
On x86-64 Linux, the instruction is literally syscall. The process:
- User-mode code puts a syscall number in the
raxregister and arguments inrdi,rsi,rdx,r10,r8,r9. - Executes the
syscallinstruction. - The CPU switches to kernel mode, jumps to a fixed address (
entry_SYSCALL_64in the kernel), and begins executing kernel code. - The kernel looks up the syscall number in its table, calls the corresponding function, does the work.
- The kernel puts a return value in
rax, executessysret, and the CPU switches back to user mode.
That is it. Every file read, every network packet sent, every process created — the same mechanism.
# See the full list of syscalls on your system
grep -c '^[^#]' /usr/include/asm-generic/unistd.h
# ~450 on a modern kernel
# Or look at the syscall table by number
ausyscall --dump | head -20
# 0 read
# 1 write
# 2 open
# 3 close
# 4 stat
# 5 fstat
# 6 lstat
# 7 poll
# 8 lseek
# 9 mmap
# ...
The Syscalls You Use Every Day (Whether You Know It or Not)
You do not call these directly in most languages — your standard library does it for you — but they are the ones that show up constantly in strace output:
| Syscall | What it does | Example trigger |
|---|---|---|
read | Read bytes from a file descriptor | open("/etc/passwd").read() in Python |
write | Write bytes to a file descriptor | print("hello"), logger.info(...) |
open / openat | Open a file, get a file descriptor | Any file access |
close | Release a file descriptor | End of a with open(...) block |
stat / fstat | Get file metadata (size, mtime, permissions) | os.path.exists, ls -l |
mmap | Map a file or anonymous memory into your address space | malloc for large allocations, loading a shared library |
brk | Grow the heap | Small malloc calls |
execve | Replace the current process with a new program | Running any command |
fork / clone | Create a new process (or thread) | Popen(...), go func() {...} |
wait4 | Wait for a child process to exit | Shells waiting for commands |
socket, bind, accept, connect | Network I/O | Any networked program |
epoll_wait | Wait for events on many file descriptors | nginx, Redis, Node.js event loops |
ioctl | "Anything else that does not fit a normal syscall" | Terminal control, device-specific operations |
When you learn a new Linux feature (cgroups, namespaces, io_uring, eBPF), your first question should be: which syscall exposes this to userspace? Every kernel feature is reachable from userspace through exactly one or two syscalls. Knowing the syscall lets you read the man page, trace calls to it with strace, and build a mental model grounded in what actually happens rather than marketing.
The User → Kernel Round Trip
Let us walk through what happens when a Python program runs open("/etc/hostname").read():
Anatomy of a syscall: read a file
Click each step to explore
Two lessons from this flow:
- Syscalls are not free. A mode switch costs a few hundred nanoseconds even in the best case — page table flushes, cache pollution, and (since Spectre/Meltdown mitigations) even more. Programs that do millions of tiny reads perform terribly compared to programs that do thousands of bigger ones.
- You are always a guest in your own process. When
read()is running, the kernel is executing on your process's behalf, using your process's kernel stack, but it can see and touch things you cannot. Understanding which code is running in which mode is the difference between readingstraceoutput and guessing at it.
strace: See the Boundary With Your Own Eyes
strace attaches to a process and prints every syscall it makes. It is the single most useful tool for understanding what a program is actually doing.
# Trace a simple command
strace -f -o /tmp/ls.trace ls /tmp
wc -l /tmp/ls.trace
# Something like 134 for "ls /tmp"
# Look at a slice of the output
head -30 /tmp/ls.trace
# execve("/usr/bin/ls", ["ls", "/tmp"], 0x7fff...) = 0
# brk(NULL) = 0x55b...
# openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
# fstat(3, {st_mode=S_IFREG|0644, st_size=125842, ...}) = 0
# mmap(NULL, 125842, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f...
# close(3) = 0
# openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
# ...
Every line is one syscall. You can literally see the process loading shared libraries, reading the directory, and printing output. There is no magic — just hundreds of these round trips.
# Summary mode: count syscalls by type
strace -c ls /tmp 2>&1 | tail -20
# % time seconds usecs/call calls errors syscall
# ------ ----------- ----------- --------- --------- ----------------
# 27.45 0.000112 3 37 mmap
# 18.87 0.000077 3 26 close
# 13.48 0.000055 2 27 fstat
# 11.52 0.000047 2 24 openat
# 8.58 0.000035 2 18 read
# ...
# Follow child processes too
strace -f -p $PID
# Only trace specific syscalls
strace -e trace=openat,read,write -p $PID
A team was debugging a Python service that took 40 seconds to start in Kubernetes but 2 seconds locally. No application logs, no errors — it just sat there. strace -f -c on the container showed 180,000 stat calls during startup. The service was importing a library that walked sys.path on every import, and in the container sys.path had 14 entries on an NFS-backed volume. Each stat was a round trip over the network. A one-line PYTHONDONTWRITEBYTECODE change cut startup to 3 seconds. strace found it in under a minute — application logs would never have shown it.
The Virtual File: /proc and the Kernel as a Filesystem
The Linux kernel exposes much of its state through two special filesystems: /proc and /sys. These are not real disks — reading from them triggers kernel code that generates the output on the fly.
# How many syscalls has this process made? Look at /proc
cat /proc/self/status | grep -i ctxt
# voluntary_ctxt_switches: 12
# nonvoluntary_ctxt_switches: 3
# What syscalls does the running kernel even support?
ls /sys/kernel/debug/tracing/events/syscalls/ 2>/dev/null | head
# Needs CAP_SYS_ADMIN or root
# sys_enter_accept
# sys_enter_accept4
# sys_enter_access
# ...
# Which syscall is a process currently blocked on?
cat /proc/$PID/syscall
# 0 0x3 0x7ffc... 0x400 0x0 0x0 0x0 0x7ffc... 0x7ff...
# First number = syscall number. 0 = read. The process is blocked in read().
This is the "everything is a file" philosophy at work — and it is the subject of the next lesson. For now, the point is: the kernel gives you an honest window into itself through these filesystems. When you want to know what Linux is really doing, the answer is almost always in /proc or /sys.
Why Understanding This Matters in Production
This is not theory. Here is what having this model lets you do:
- Read strace output without panic. When a process is "stuck," strace shows you exactly which syscall it is blocked on.
readon a socket? Waiting for network.futex? Waiting for a lock.epoll_wait? Idling for events. - Understand container performance. Containers run in user mode like everything else. A "container overhead" discussion is really a discussion of extra syscalls (for namespaces, seccomp filters, cgroup accounting) layered on top of normal process startup.
- Read CPU time sensibly.
topshows%us(user time) and%sy(system time) separately for a reason. High%symeans your process is spending a lot of time in syscalls — usually I/O-bound or doing too much fine-grained work. - Debug permission errors correctly. "Permission denied" does not come from your app — it comes from a syscall returning
-EACCES. The app is just the messenger. Knowing which syscall failed (strace) is the fast path to fixing it. - Reason about seccomp and security. seccomp filters block specific syscalls. When a container fails mysteriously after hardening, knowing which syscalls the process needs is the whole debug story.
Install strace on every server you manage and every container image you ship to dev environments. A production debug session that starts with strace -f -p $(pgrep app) finds root causes in minutes that application-level logging would never reveal. It is the difference between "the app is slow" and "the app is making 40,000 open() calls per second against a directory that returns -ENOENT."
User Mode vs Kernel Mode: The One-Page Summary
┌────────────────────────────────────────────────────────┐
│ USER MODE (ring 3) │
│ │
│ Your programs: nginx bash python sshd systemd │
│ Libraries: glibc openssl libstdc++ │
│ │
│ Cannot: touch hardware, read other processes, │
│ change page tables, disable interrupts │
│ │
└────────────────────────┬───────────────────────────────┘
│
│ syscall instruction
│ (the only doorway)
│
┌────────────────────────▼───────────────────────────────┐
│ KERNEL MODE (ring 0) │
│ │
│ Subsystems: scheduler, VM, VFS, net stack, drivers │
│ Modules: ext4.ko overlay.ko nvidia.ko │
│ │
│ Can: anything the hardware allows │
│ │
└────────────────────────────────────────────────────────┘
Key Concepts Summary
- Two modes, enforced by hardware. User mode and kernel mode are CPU-level states, not software policies. A user-mode program literally cannot execute privileged instructions — the CPU traps the attempt.
- Kernel mode has four jobs. Scheduling, memory management, I/O, and mediation. Every kernel feature maps to one of these.
- Syscalls are the only way across. Every file read, process creation, network send, or time lookup goes through a syscall. There is no other way for user code to do anything useful.
- Syscalls cost real time. A mode switch is a few hundred nanoseconds minimum. Programs that make millions of tiny syscalls are slow for reasons that have nothing to do with their algorithm.
straceshows the boundary. Every line of strace output is one round trip from user to kernel and back. Reading strace output is reading your program's true behavior, not the pretty version in your source code./procand/sysare the kernel's window. They are not real files — they are live kernel state exposed through the filesystem API.- User-mode code is the majority. systemd, sshd, nginx, Python, your app — all of it runs in user mode. The kernel is just the substrate they all stand on.
Common Mistakes
- Treating "the kernel" as an opaque black box instead of a concrete set of subsystems with well-defined syscalls as their API.
- Assuming "fast" and "slow" programs differ in their algorithm when the real difference is syscall frequency — a tight loop doing 2 million
write(1, ..., 1)calls is 1000× slower than onewrite(1, ..., 2000000). - Reading strace output and giving up because it is noisy. The noise is the program. Learning to skim it is learning to see your program clearly.
- Confusing library calls with syscalls.
printfis a library function; the syscall under it iswrite.mallocis a library function; the syscall under it isbrkormmap. The library can batch, cache, and optimize — but eventually, every operation that touches the outside world is a syscall. - Believing containers or VMs "bypass" the kernel. They do not. A container is a user-mode process with extra kernel bookkeeping (namespaces, cgroups). A VM is a user-mode process (qemu/kvm) that the kernel lets talk to virtualization hardware. Everything still runs through the same kernel on the host.
- Thinking you need to read kernel source to understand Linux. You need to understand the syscall interface. The kernel source is how it is implemented; the syscall interface is the contract.
A Python service is hung — it accepts new connections but never responds. You run `cat /proc/$(pgrep -f myapp)/syscall` and see a number at the start of the line that corresponds to `futex`. What does that tell you?