Linux Fundamentals for Engineers

Signals and Process Control

Your team ships a Go service to Kubernetes. It has a lovely graceful-shutdown handler: drain in-flight HTTP requests, flush buffered writes to the database, close the Kafka producer cleanly. In staging, restarts are smooth. In production, every rollout loses a handful of requests, and the Kafka consumer lag spikes briefly on every pod termination.

Someone blames Kubernetes. Someone else blames the load balancer. The actual culprit is signals. Kubernetes sends your pod SIGTERM and waits 30 seconds (terminationGracePeriodSeconds). If you have not exited by then, it sends SIGKILL. Your graceful shutdown code runs — but one Go library you pulled in installs its own SIGTERM handler, which calls os.Exit(0) before your cleanup finishes. Another service runs as PID 1 in the container; a bash entrypoint does not forward signals to the child, so SIGTERM goes to /bin/bash -c ... and the actual app never sees it.

Signals are the thinnest layer in all of Linux. A single number, a single sentence in the man page — and yet every production incident tied to "it does not shut down cleanly" is a signals problem. This lesson gives you the model that makes those incidents diagnosable in minutes.


What a Signal Is

A signal is a small integer (1 through 64, give or take) that one process sends to another to say "something happened" or "do this." The kernel delivers signals asynchronously: the target process is interrupted — possibly mid-instruction — and either runs a handler function, gets killed, gets stopped, or ignores it entirely, depending on how the signal is configured.

Signals are the Linux analogue of interrupts. They are:

  • Fast. A single kernel operation; no shared memory or sockets needed.
  • Limited. Only a number. Some signals can carry a small siginfo_t payload, but you cannot pass a string.
  • Asynchronous. The target process does not know when the signal will arrive, only that it might.
  • Easy to get wrong. Handlers run in a constrained environment (only async-signal-safe functions); handlers can interrupt each other; signals can be lost if sent faster than they can be handled.
KEY CONCEPT

A signal is not a message. It is an event. The payload is the fact that it happened, not what it contains. Knowing which signals you should listen for and what the default action is for each — that is 80% of getting signals right in production.


The Signals You Actually Use

There are 64 signal slots, but in practice you care about maybe a dozen.

SignalNumberDefault actionWhat it means
SIGHUP1Terminate"Controlling terminal closed" — also used by daemons as "reload config"
SIGINT2TerminateCtrl-C from a terminal
SIGQUIT3Terminate + core dumpCtrl-\ from a terminal — like SIGINT but dumps core
SIGKILL9TerminateCannot be caught, blocked, or ignored — the nuclear option
SIGBUS7Terminate + coreBus error (misaligned memory, mmapped file truncated)
SIGSEGV11Terminate + coreSegmentation fault — bad memory access
SIGPIPE13TerminateWrote to a pipe/socket with no readers
SIGALRM14Terminatealarm() timer expired
SIGTERM15TerminatePolite "please exit" — the one you catch for graceful shutdown
SIGCHLD17IgnoreA child process exited (reap it!)
SIGCONT18ContinueResume a stopped process
SIGSTOP19StopCannot be caught — hard-stops the process
SIGTSTP20StopCtrl-Z from a terminal — like SIGSTOP but catchable
SIGUSR110TerminateUser-defined — apps do whatever they want with it
SIGUSR212TerminateUser-defined — same idea
# The complete list on your system
kill -l
#  1) SIGHUP   2) SIGINT   3) SIGQUIT  4) SIGILL    5) SIGTRAP
#  6) SIGABRT  7) SIGBUS   8) SIGFPE   9) SIGKILL   10) SIGUSR1
# 11) SIGSEGV 12) SIGUSR2 13) SIGPIPE 14) SIGALRM  15) SIGTERM
# ...

# See what handlers a running process has installed
cat /proc/$PID/status | grep -E '^Sig'
# SigQ:   2/31389
# SigPnd: 0000000000000000
# SigBlk: 0000000000000000
# SigIgn: 0000000000001000
# SigCgt: 0000000180014a07     <- bitmask of caught signals

# Decode the bitmask (requires parsing but tools exist)
awk '/^SigCgt/{print $2}' /proc/$PID/status | xargs -I{} bash -c 'python3 -c "n=int(\"{}\",16); [print(i+1) for i in range(64) if n>>i & 1]"'
WARNING

SIGKILL (9) and SIGSTOP (19) are special: no program can catch, block, or ignore them. The kernel enforces this. They are the two signals that always work. When you absolutely need to kill a misbehaving process, kill -9 is the answer. But it gives the process zero chance to clean up — no flushing buffers, no closing connections, no unlocking files. Reach for SIGKILL only after SIGTERM has failed.


What Happens When a Signal Arrives

When the kernel decides to deliver a signal to a process, it looks at the process's signal disposition for that signal:

  1. Default action — kill, core dump, stop, continue, or ignore, per the table above.
  2. Ignore — the process has explicitly chosen to ignore this signal (via signal(SIGPIPE, SIG_IGN) or similar).
  3. Catch — the process has installed a handler function; the kernel interrupts the process and runs that function, then (usually) resumes where it left off.
  4. Block — the process has temporarily masked this signal; delivery is deferred until the process unblocks it.

The kernel also tracks pending signals that have been sent but not yet delivered. For standard signals (1–31), only one instance of each signal can be pending at a time — if you send SIGUSR1 to a process 1000 times while it is blocked, it gets delivered once when unblocked. Real-time signals (32–64) queue properly, which is the only reason to ever use them.

# Pending signals for a process
cat /proc/$PID/status | grep SigPnd
# SigPnd: 0000000000000000

Sending Signals

# By PID
kill -TERM 12345          # same as kill -15 12345
kill -9 12345             # SIGKILL
kill -HUP 12345           # ask daemon to reload

# By name
pkill -TERM nginx         # kill all nginx processes politely
pkill -9 -f 'myapp --prod'  # SIGKILL anything whose command line matches

# To a whole process group (note the dash before the PGID)
kill -- -12345            # signal the whole group

# To every process of a user
pkill -u alice

# To the current shell's children
kill %1                   # job 1 in this shell

From inside a program

Every language exposes signals somehow. Here are the patterns:

# Python
import signal, sys, time

def on_term(signum, frame):
    print("Got SIGTERM — shutting down cleanly", flush=True)
    # do cleanup
    sys.exit(0)

signal.signal(signal.SIGTERM, on_term)
signal.signal(signal.SIGHUP,  lambda *_: reload_config())
while True: time.sleep(1)
// Go
package main

import (
    "context"
    "os/signal"
    "syscall"
    "log"
    "time"
)

func main() {
    ctx, stop := signal.NotifyContext(context.Background(),
        syscall.SIGTERM, syscall.SIGINT)
    defer stop()

    <-ctx.Done()
    log.Println("draining for up to 25s...")
    shutdownCtx, cancel := context.WithTimeout(context.Background(), 25*time.Second)
    defer cancel()
    server.Shutdown(shutdownCtx)
}
PRO TIP

If you remember one pattern from this lesson, remember this: catch SIGTERM and SIGINT, run the same cleanup for both, with a deadline. That handles Kubernetes pod termination, Docker stops, systemd systemctl stop, and Ctrl-C in development all at once. The deadline matters because your orchestrator will SIGKILL you if you take too long — so your cleanup should finish inside whatever grace period your platform gives.


The Graceful Shutdown Flow

Most modern orchestrators — Kubernetes, systemd, Docker — follow the same pattern:

How an orchestrator stops your process

Click each step to explore

The 30-second Kubernetes reality

# Kubernetes pod spec
spec:
  terminationGracePeriodSeconds: 30   # default
  containers:
    - name: app
      image: myapp:v1
      lifecycle:
        preStop:
          exec:
            command: ["/bin/sh", "-c", "sleep 5"]  # buy time for endpoints to update

The exact sequence when Kubernetes terminates a pod:

  1. Pod is marked Terminating.
  2. The pod is removed from Service Endpoints (traffic stops arriving — eventually).
  3. preStop hook runs (if defined).
  4. SIGTERM is sent to PID 1 in each container.
  5. Kubelet waits up to terminationGracePeriodSeconds.
  6. If still running, SIGKILL to PID 1 in each container.

Every one of those steps has ways to go wrong.

WAR STORY

A team's API pods were dropping ~1% of requests during every rollout despite careful SIGTERM handling. Traces showed the drops happened in the first second after SIGTERM, not at the SIGKILL boundary. Root cause: Service Endpoints take ~1–2 seconds to propagate across the cluster, so traffic was still landing on pods that had already stopped accepting. The fix was a preStop: sleep 5 hook — no code change, just give the endpoint controllers time to catch up before closing the listener. Request drops went to zero. Signals alone cannot fix races that live above the process level.


The PID 1 Signal Problem in Containers

Linux treats PID 1 specially: it ignores every signal it has not installed a handler for. This is a safety mechanism so that PID 1 (systemd, init) cannot be accidentally killed.

Problem: when you run a container with CMD ["python", "app.py"], the Python process is PID 1 inside the container. If Python has not installed a SIGTERM handler, SIGTERM is silently ignored. Your pod hits the grace period, then SIGKILL, and you lose in-flight work every time.

Worse: many Dockerfiles look like this:

# BAD: shell form — runs /bin/sh -c "python app.py"
CMD python app.py

Now /bin/sh is PID 1. sh does not forward signals to its children. SIGTERM hits sh, gets ignored, your Python process never hears about it.

The fixes, in order of quality:

# GOOD: exec form — python is PID 1 directly
CMD ["python", "app.py"]

# BETTER: use a real init
RUN apt-get install -y tini
ENTRYPOINT ["tini", "--"]
CMD ["python", "app.py"]

# ALSO GOOD: let Docker add tini for you
# $ docker run --init myapp
# $ in k8s, there's no built-in --init; use tini or run systemd
# In Python, always install SIGTERM when you run as PID 1
import signal, sys
signal.signal(signal.SIGTERM, lambda *_: sys.exit(0))
KEY CONCEPT

Two rules for containers: (1) use exec form (CMD ["python", "app.py"]) in Dockerfiles, not shell form; (2) install a SIGTERM handler in your app, or use tini as the entrypoint. Skipping either one turns graceful shutdown into a coin flip.


Debugging Signal Problems

Did the process even get the signal?

# Trace a running process and watch for signals
sudo strace -p $PID -e trace=signal
# ...
# --- SIGTERM {si_signo=SIGTERM, si_code=SI_USER, si_pid=12345, si_uid=1000} ---
# ... (the signal arrived)

# Or show all signals a process has caught so far
grep -E 'SigCgt|SigIgn|SigBlk' /proc/$PID/status
# SigIgn: 0000000000001000   <- SIGPIPE (bit 13)
# SigCgt: 0000000180014a07   <- a handful of handled signals
# SigBlk: 0000000000000000

Is the process stuck in a handler?

# What is this process doing right now?
cat /proc/$PID/stack 2>/dev/null | head -10    # kernel stack; root-only
cat /proc/$PID/wchan                            # what syscall is it blocked in

What signal killed this process?

When a process dies from an unhandled signal, the exit status is 128 + signal_number:

Exit codeWhat killed it
130SIGINT (Ctrl-C)
137SIGKILL
139SIGSEGV
143SIGTERM
# Your last command's exit status
./maybe_crashy
echo $?
# 139  -> segfault

# systemd tells you directly
systemctl status myapp.service
# ... Active: failed (Result: signal) since ...; code=killed, signal=TERM

# Kubernetes tells you in `kubectl describe pod`
# State: Terminated
#   Reason: Error
#   Exit Code: 137      <- SIGKILL, usually OOM-kill by the kubelet
#   Signal: 9
PRO TIP

Exit Code: 137 in a Kubernetes pod almost always means OOMKilled. The kubelet or the kernel OOM killer sent SIGKILL because the container exceeded its memory limit. If you see 137 on a pod that was not OOMKilled, check kubectl get events — sometimes the kubelet escalates to SIGKILL because your SIGTERM handler took too long.


Job Control: Signals at the Shell Level

Every interactive shell is full of signal magic you use without thinking.

# Start a long-running process
sleep 100
# Ctrl-Z      -> shell sends SIGTSTP; sleep stops
# [1]+ Stopped  sleep 100

jobs
# [1]+ Stopped  sleep 100

bg        # send SIGCONT, run in background
fg        # bring back to foreground

# Start in background
sleep 100 &
# [1] 12345

# Survive logout (ignore SIGHUP)
nohup sleep 100 &
disown %1      # shell stops tracking it
  • Ctrl-C → SIGINT → to the foreground process group
  • Ctrl-\ → SIGQUIT → to the foreground process group (core-dumps on crash)
  • Ctrl-Z → SIGTSTP → to the foreground process group
  • Closing the terminal → SIGHUP → to the session leader, which forwards to all children

nohup works by setting SIGHUP to ignored and detaching from the terminal, so the hang-up from closing the SSH session is safely dropped.


Key Concepts Summary

  • Signals are small integers that the kernel delivers asynchronously. No payload beyond the fact itself.
  • SIGKILL (9) and SIGSTOP (19) cannot be caught. The kernel enforces this.
  • SIGTERM (15) is the polite "please exit." Catch this for graceful shutdown.
  • SIGHUP (1) has two meanings. For daemons it historically meant "reload config." For interactive processes it means "the terminal went away."
  • Every orchestrator's graceful stop is SIGTERM → wait → SIGKILL. Kubernetes defaults to 30s. Your handler must finish inside that window.
  • PID 1 is special. The kernel ignores unhandled signals to PID 1. In containers this is usually your app — install a SIGTERM handler or use tini.
  • Shell form CMD python app.py makes /bin/sh PID 1. It does not forward signals. Use exec form CMD ["python", "app.py"].
  • Exit code N ≥ 128 means "killed by signal N-128." 137 = SIGKILL, 143 = SIGTERM, 139 = SIGSEGV.
  • Use strace -e trace=signal to see signals in flight. Use /proc/[pid]/status Sig* fields to see handler state.

Common Mistakes

  • Writing beautiful SIGTERM cleanup code, then shipping with CMD python app.py so /bin/sh is PID 1 and the handler never runs.
  • Relying on SIGKILL for shutdown. SIGKILL bypasses all your cleanup — flushed buffers, connection draining, distributed locks — nothing runs. Reach for it only after SIGTERM fails.
  • Assuming SIGINT and SIGTERM behave the same. They usually should, but a library you depend on might handle one and not the other. Catch both.
  • Putting I/O or complex logic inside a signal handler. Handlers run in a restricted async-signal-safe context; printf, malloc, and most library functions are unsafe. A common pattern: the handler sets a flag, and the main loop checks the flag.
  • Letting the graceful shutdown deadline match the orchestrator's deadline exactly. If Kubernetes gives you 30s, aim to finish in 25. The extra margin is for clock skew, paging delays, and unexpectedly slow I/O.
  • Forgetting that SIGPIPE kills your process by default. A web server writing to a client that hung up will die silently unless it has SIGPIPE handled or ignored.
  • Killing the process group when you meant the process, or vice versa. kill -9 1234 is one process; kill -9 -1234 (note the dash) is the whole group.
  • Reading exit code 137 in logs and blaming "signal 137." It is signal 137-128 = 9 = SIGKILL, almost always from the OOM killer or a grace-period expiry.

KNOWLEDGE CHECK

Your Kubernetes pod logs show clean shutdown messages sometimes, but not others. kubectl describe pod shows `Exit Code: 143` on clean exits and `Exit Code: 137` on the bad ones. Your code catches SIGTERM and takes about 40 seconds to finish draining connections. What is happening and what should you change?