Linux Fundamentals for Engineers

The Kernel, Userspace, and System Calls

An engineer on your team runs strace -c ls /tmp and stares at the output: thousands of lines, dozens of syscalls, numbers they do not understand. "Why does listing five files take 80 syscalls?" They assumed ls was "just" reading a directory. It is not. Every byte read, every file stat, every string printed to the terminal, every memory page allocated, each one crosses a hard boundary between two completely separate worlds. If you do not understand that boundary, you will spend your career guessing at why containers behave strangely, why strace output is overwhelming, why some processes can do things others cannot, and why Linux performance debugging feels like black magic.

This lesson is the one mental model that makes every other Linux concept click. Get this right and the rest of the course: processes, filesystems, systemd, cgroups, namespaces, stops being a pile of disconnected features and starts being one coherent system.

Two Worlds on One Machine

Every program running on your Linux box lives in one of two modes. Every single operation: allocating memory, opening a file, sending a network packet, reading the time, forking a process, happens in one of these modes or crosses between them.

Kernel mode (also called supervisor mode, ring 0 on x86): full hardware access. Can talk directly to disks, network cards, memory controllers, CPU registers. Can modify page tables, handle interrupts, reschedule processes.
User mode (ring 3): no hardware access. Cannot touch the disk directly. Cannot read another process's memory. Cannot change the system clock. Cannot do anything that would affect another process.

The CPU itself enforces this. When you read "the kernel is privileged," it is not a software policy you could bypass with the right flags, it is a hardware-level mode bit the CPU checks on every instruction. If a user-mode program tries to execute a privileged instruction (like hlt to halt the CPU), the CPU traps the attempt and the kernel kills the process with SIGSEGV or SIGILL.

KEY CONCEPT

User-mode code is powerless on its own. Every useful operation a program performs: reading a file, printing output, sending a packet, asking what time it is, requires crossing into kernel mode. That crossing is called a system call, and it is the only legal doorway between the two worlds. Understanding this boundary is the foundation of understanding Linux.

What Lives in Each World

Kernel mode	User mode
Device drivers (disk, NIC, GPU)	Your applications (nginx, Python, Go binaries)
Virtual memory manager, page tables	Shells (bash, zsh)
Scheduler (picks the next process to run)	Libraries (glibc, OpenSSL)
TCP/IP stack, firewall (iptables/nftables)	Container runtimes (containerd, runc)
Filesystems (ext4, xfs, overlayfs)	Language runtimes (JVM, Node.js, CPython)
Syscall dispatcher	systemd, cron, sshd

Notice where systemd, nginx, and sshd sit. Even PID 1, even the program that boots your system, runs in user mode. The only thing in kernel mode is the kernel itself and code loaded into it (modules like nvidia.ko or overlay.ko).

What the Kernel Actually Does

Most engineers talk about "the kernel" as if it were one big program. It is, in a sense, a single binary at /boot/vmlinuz-* that the bootloader loads into memory. But functionally it is a collection of services, all running in kernel mode, that your processes rely on constantly.

The kernel has four jobs:

Manage the CPU. Decide which process runs next, for how long, on which core. This is the scheduler.
Manage memory. Hand out pages to processes, enforce isolation between them, swap pages to disk when memory is tight, handle page faults when a process touches a page that is not resident.
Manage I/O. Drive disks, network cards, keyboards, GPUs. Translate "write these bytes to /var/log/syslog" into the SATA or NVMe commands the hardware understands.
Mediate access. Enforce permissions. Stop process A from reading process B's memory. Stop a non-root user from binding to port 80. Stop anyone from writing directly to /dev/sda.

Everything else Linux does is built on top of these four responsibilities.

PRO TIP

When you read kernel source code or strace output, look for which of these four jobs the code is doing. A syscall like read() is job 3 (I/O). fork() is job 1 (CPU). mmap() is job 2 (memory). setuid() is job 4 (mediation). This framing makes the firehose of kernel functionality much easier to navigate.

System Calls: The Only Doorway

A system call is a controlled, hardware-assisted transition from user mode into kernel mode to ask the kernel to do something on your behalf.

Not a function call. Not an API. A full-blown CPU-level mode switch with a dedicated instruction.

On x86-64 Linux, the instruction is literally syscall. The process:

User-mode code puts a syscall number in the rax register and arguments in rdi, rsi, rdx, r10, r8, r9.
Executes the syscall instruction.
The CPU switches to kernel mode, jumps to a fixed address (entry_SYSCALL_64 in the kernel), and begins executing kernel code.
The kernel looks up the syscall number in its table, calls the corresponding function, does the work.
The kernel puts a return value in rax, executes sysret, and the CPU switches back to user mode.

That is it. Every file read, every network packet sent, every process created, the same mechanism.

# See the full list of syscalls on your system
grep -c '^[^#]' /usr/include/asm-generic/unistd.h
# ~450 on a modern kernel

# Or look at the syscall table by number
ausyscall --dump | head -20
# 0  read
# 1  write
# 2  open
# 3  close
# 4  stat
# 5  fstat
# 6  lstat
# 7  poll
# 8  lseek
# 9  mmap
# ...

The Syscalls You Use Every Day (Whether You Know It or Not)

You do not call these directly in most languages, your standard library does it for you, but they are the ones that show up constantly in strace output:

Syscall	What it does	Example trigger
`read`	Read bytes from a file descriptor	`open("/etc/passwd").read()` in Python
`write`	Write bytes to a file descriptor	`print("hello")`, `logger.info(...)`
`open` / `openat`	Open a file, get a file descriptor	Any file access
`close`	Release a file descriptor	End of a `with open(...)` block
`stat` / `fstat`	Get file metadata (size, mtime, permissions)	`os.path.exists`, `ls -l`
`mmap`	Map a file or anonymous memory into your address space	`malloc` for large allocations, loading a shared library
`brk`	Grow the heap	Small `malloc` calls
`execve`	Replace the current process with a new program	Running any command
`fork` / `clone`	Create a new process (or thread)	`Popen(...)`, `go func() {...}`
`wait4`	Wait for a child process to exit	Shells waiting for commands
`socket`, `bind`, `accept`, `connect`	Network I/O	Any networked program
`epoll_wait`	Wait for events on many file descriptors	nginx, Redis, Node.js event loops
`ioctl`	"Anything else that does not fit a normal syscall"	Terminal control, device-specific operations

KEY CONCEPT

When you learn a new Linux feature (cgroups, namespaces, io_uring, eBPF), your first question should be: which syscall exposes this to userspace? Every kernel feature is reachable from userspace through exactly one or two syscalls. Knowing the syscall lets you read the man page, trace calls to it with strace, and build a mental model grounded in what actually happens rather than marketing.

The User → Kernel Round Trip

Let us walk through what happens when a Python program runs open("/etc/hostname").read():

Anatomy of a syscall: read a file

Click each step to explore

Two lessons from this flow:

Syscalls are not free. A mode switch costs a few hundred nanoseconds even in the best case: page table flushes, cache pollution, and (since Spectre/Meltdown mitigations) even more. Programs that do millions of tiny reads perform terribly compared to programs that do thousands of bigger ones.
You are always a guest in your own process. When read() is running, the kernel is executing on your process's behalf, using your process's kernel stack, but it can see and touch things you cannot. Understanding which code is running in which mode is the difference between reading strace output and guessing at it.

strace: See the Boundary With Your Own Eyes

strace attaches to a process and prints every syscall it makes. It is the single most useful tool for understanding what a program is actually doing.

# Trace a simple command
strace -f -o /tmp/ls.trace ls /tmp
wc -l /tmp/ls.trace
# Something like 134 for "ls /tmp"

# Look at a slice of the output
head -30 /tmp/ls.trace
# execve("/usr/bin/ls", ["ls", "/tmp"], 0x7fff...) = 0
# brk(NULL)                               = 0x55b...
# openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
# fstat(3, {st_mode=S_IFREG|0644, st_size=125842, ...}) = 0
# mmap(NULL, 125842, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f...
# close(3)                                = 0
# openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
# ...

Every line is one syscall. You can literally see the process loading shared libraries, reading the directory, and printing output. There is no magic, just hundreds of these round trips.

# Summary mode: count syscalls by type
strace -c ls /tmp 2>&1 | tail -20
# % time     seconds  usecs/call     calls    errors syscall
# ------ ----------- ----------- --------- --------- ----------------
#  27.45    0.000112           3        37           mmap
#  18.87    0.000077           3        26           close
#  13.48    0.000055           2        27           fstat
#  11.52    0.000047           2        24           openat
#   8.58    0.000035           2        18           read
#   ...

# Follow child processes too
strace -f -p $PID

# Only trace specific syscalls
strace -e trace=openat,read,write -p $PID

WAR STORY

A team was debugging a Python service that took 40 seconds to start in Kubernetes but 2 seconds locally. No application logs, no errors, it just sat there. strace -f -c on the container showed 180,000 stat calls during startup. The service was importing a library that walked sys.path on every import, and in the container sys.path had 14 entries on an NFS-backed volume. Each stat was a round trip over the network. A one-line PYTHONDONTWRITEBYTECODE change cut startup to 3 seconds. strace found it in under a minute, application logs would never have shown it.

The Virtual File: /proc and the Kernel as a Filesystem

The Linux kernel exposes much of its state through two special filesystems: /proc and /sys. These are not real disks, reading from them triggers kernel code that generates the output on the fly.

# How many syscalls has this process made? Look at /proc
cat /proc/self/status | grep -i ctxt
# voluntary_ctxt_switches:   12
# nonvoluntary_ctxt_switches: 3

# What syscalls does the running kernel even support?
ls /sys/kernel/debug/tracing/events/syscalls/ 2>/dev/null | head
# Needs CAP_SYS_ADMIN or root
# sys_enter_accept
# sys_enter_accept4
# sys_enter_access
# ...

# Which syscall is a process currently blocked on?
cat /proc/$PID/syscall
# 0 0x3 0x7ffc...  0x400  0x0 0x0 0x0 0x7ffc... 0x7ff...
# First number = syscall number. 0 = read. The process is blocked in read().

This is the "everything is a file" philosophy at work, and it is the subject of the next lesson. For now, the point is: the kernel gives you an honest window into itself through these filesystems. When you want to know what Linux is really doing, the answer is almost always in /proc or /sys.

Why Understanding This Matters in Production

This is not theory. Here is what having this model lets you do:

Read strace output without panic. When a process is "stuck," strace shows you exactly which syscall it is blocked on. read on a socket? Waiting for network. futex? Waiting for a lock. epoll_wait? Idling for events.
Understand container performance. Containers run in user mode like everything else. A "container overhead" discussion is really a discussion of extra syscalls (for namespaces, seccomp filters, cgroup accounting) layered on top of normal process startup.
Read CPU time sensibly. top shows %us (user time) and %sy (system time) separately for a reason. High %sy means your process is spending a lot of time in syscalls, usually I/O-bound or doing too much fine-grained work.
Debug permission errors correctly. "Permission denied" does not come from your app, it comes from a syscall returning -EACCES. The app is just the messenger. Knowing which syscall failed (strace) is the fast path to fixing it.
Reason about seccomp and security. seccomp filters block specific syscalls. When a container fails mysteriously after hardening, knowing which syscalls the process needs is the whole debug story.

PRO TIP

Install strace on every server you manage and every container image you ship to dev environments. A production debug session that starts with strace -f -p $(pgrep app) finds root causes in minutes that application-level logging would never reveal. It is the difference between "the app is slow" and "the app is making 40,000 open() calls per second against a directory that returns -ENOENT."

User Mode vs Kernel Mode: The One-Page Summary

Key Concepts Summary

Two modes, enforced by hardware. User mode and kernel mode are CPU-level states, not software policies. A user-mode program literally cannot execute privileged instructions, the CPU traps the attempt.
Kernel mode has four jobs. Scheduling, memory management, I/O, and mediation. Every kernel feature maps to one of these.
Syscalls are the only way across. Every file read, process creation, network send, or time lookup goes through a syscall. There is no other way for user code to do anything useful.
Syscalls cost real time. A mode switch is a few hundred nanoseconds minimum. Programs that make millions of tiny syscalls are slow for reasons that have nothing to do with their algorithm.
strace shows the boundary. Every line of strace output is one round trip from user to kernel and back. Reading strace output is reading your program's true behavior, not the pretty version in your source code.
/proc and /sys are the kernel's window. They are not real files, they are live kernel state exposed through the filesystem API.
User-mode code is the majority. systemd, sshd, nginx, Python, your app, all of it runs in user mode. The kernel is just the substrate they all stand on.

Common Mistakes

Treating "the kernel" as an opaque black box instead of a concrete set of subsystems with well-defined syscalls as their API.
Assuming "fast" and "slow" programs differ in their algorithm when the real difference is syscall frequency, a tight loop doing 2 million write(1, ..., 1) calls is 1000× slower than one write(1, ..., 2000000).
Reading strace output and giving up because it is noisy. The noise is the program. Learning to skim it is learning to see your program clearly.
Confusing library calls with syscalls. printf is a library function; the syscall under it is write. malloc is a library function; the syscall under it is brk or mmap. The library can batch, cache, and optimize, but eventually, every operation that touches the outside world is a syscall.
Believing containers or VMs "bypass" the kernel. They do not. A container is a user-mode process with extra kernel bookkeeping (namespaces, cgroups). A VM is a user-mode process (qemu/kvm) that the kernel lets talk to virtualization hardware. Everything still runs through the same kernel on the host.
Thinking you need to read kernel source to understand Linux. You need to understand the syscall interface. The kernel source is how it is implemented; the syscall interface is the contract.

KNOWLEDGE CHECK

A Python service is hung, it accepts new connections but never responds. You run `cat /proc/$(pgrep -f myapp)/syscall` and see a number at the start of the line that corresponds to `futex`. What does that tell you?

Continue

Everything Is a File (Really)

←→ navigateM toggle sidebar