Signals and Process Control
Your team ships a Go service to Kubernetes. It has a lovely graceful-shutdown handler: drain in-flight HTTP requests, flush buffered writes to the database, close the Kafka producer cleanly. In staging, restarts are smooth. In production, every rollout loses a handful of requests, and the Kafka consumer lag spikes briefly on every pod termination.
Someone blames Kubernetes. Someone else blames the load balancer. The actual culprit is signals. Kubernetes sends your pod SIGTERM and waits 30 seconds (
terminationGracePeriodSeconds). If you have not exited by then, it sends SIGKILL. Your graceful shutdown code runs — but one Go library you pulled in installs its own SIGTERM handler, which callsos.Exit(0)before your cleanup finishes. Another service runs as PID 1 in the container; a bash entrypoint does not forward signals to the child, so SIGTERM goes to/bin/bash -c ...and the actual app never sees it.Signals are the thinnest layer in all of Linux. A single number, a single sentence in the man page — and yet every production incident tied to "it does not shut down cleanly" is a signals problem. This lesson gives you the model that makes those incidents diagnosable in minutes.
What a Signal Is
A signal is a small integer (1 through 64, give or take) that one process sends to another to say "something happened" or "do this." The kernel delivers signals asynchronously: the target process is interrupted — possibly mid-instruction — and either runs a handler function, gets killed, gets stopped, or ignores it entirely, depending on how the signal is configured.
Signals are the Linux analogue of interrupts. They are:
- Fast. A single kernel operation; no shared memory or sockets needed.
- Limited. Only a number. Some signals can carry a small
siginfo_tpayload, but you cannot pass a string. - Asynchronous. The target process does not know when the signal will arrive, only that it might.
- Easy to get wrong. Handlers run in a constrained environment (only async-signal-safe functions); handlers can interrupt each other; signals can be lost if sent faster than they can be handled.
A signal is not a message. It is an event. The payload is the fact that it happened, not what it contains. Knowing which signals you should listen for and what the default action is for each — that is 80% of getting signals right in production.
The Signals You Actually Use
There are 64 signal slots, but in practice you care about maybe a dozen.
| Signal | Number | Default action | What it means |
|---|---|---|---|
SIGHUP | 1 | Terminate | "Controlling terminal closed" — also used by daemons as "reload config" |
SIGINT | 2 | Terminate | Ctrl-C from a terminal |
SIGQUIT | 3 | Terminate + core dump | Ctrl-\ from a terminal — like SIGINT but dumps core |
SIGKILL | 9 | Terminate | Cannot be caught, blocked, or ignored — the nuclear option |
SIGBUS | 7 | Terminate + core | Bus error (misaligned memory, mmapped file truncated) |
SIGSEGV | 11 | Terminate + core | Segmentation fault — bad memory access |
SIGPIPE | 13 | Terminate | Wrote to a pipe/socket with no readers |
SIGALRM | 14 | Terminate | alarm() timer expired |
SIGTERM | 15 | Terminate | Polite "please exit" — the one you catch for graceful shutdown |
SIGCHLD | 17 | Ignore | A child process exited (reap it!) |
SIGCONT | 18 | Continue | Resume a stopped process |
SIGSTOP | 19 | Stop | Cannot be caught — hard-stops the process |
SIGTSTP | 20 | Stop | Ctrl-Z from a terminal — like SIGSTOP but catchable |
SIGUSR1 | 10 | Terminate | User-defined — apps do whatever they want with it |
SIGUSR2 | 12 | Terminate | User-defined — same idea |
# The complete list on your system
kill -l
# 1) SIGHUP 2) SIGINT 3) SIGQUIT 4) SIGILL 5) SIGTRAP
# 6) SIGABRT 7) SIGBUS 8) SIGFPE 9) SIGKILL 10) SIGUSR1
# 11) SIGSEGV 12) SIGUSR2 13) SIGPIPE 14) SIGALRM 15) SIGTERM
# ...
# See what handlers a running process has installed
cat /proc/$PID/status | grep -E '^Sig'
# SigQ: 2/31389
# SigPnd: 0000000000000000
# SigBlk: 0000000000000000
# SigIgn: 0000000000001000
# SigCgt: 0000000180014a07 <- bitmask of caught signals
# Decode the bitmask (requires parsing but tools exist)
awk '/^SigCgt/{print $2}' /proc/$PID/status | xargs -I{} bash -c 'python3 -c "n=int(\"{}\",16); [print(i+1) for i in range(64) if n>>i & 1]"'
SIGKILL (9) and SIGSTOP (19) are special: no program can catch, block, or ignore them. The kernel enforces this. They are the two signals that always work. When you absolutely need to kill a misbehaving process, kill -9 is the answer. But it gives the process zero chance to clean up — no flushing buffers, no closing connections, no unlocking files. Reach for SIGKILL only after SIGTERM has failed.
What Happens When a Signal Arrives
When the kernel decides to deliver a signal to a process, it looks at the process's signal disposition for that signal:
- Default action — kill, core dump, stop, continue, or ignore, per the table above.
- Ignore — the process has explicitly chosen to ignore this signal (via
signal(SIGPIPE, SIG_IGN)or similar). - Catch — the process has installed a handler function; the kernel interrupts the process and runs that function, then (usually) resumes where it left off.
- Block — the process has temporarily masked this signal; delivery is deferred until the process unblocks it.
The kernel also tracks pending signals that have been sent but not yet delivered. For standard signals (1–31), only one instance of each signal can be pending at a time — if you send SIGUSR1 to a process 1000 times while it is blocked, it gets delivered once when unblocked. Real-time signals (32–64) queue properly, which is the only reason to ever use them.
# Pending signals for a process
cat /proc/$PID/status | grep SigPnd
# SigPnd: 0000000000000000
Sending Signals
# By PID
kill -TERM 12345 # same as kill -15 12345
kill -9 12345 # SIGKILL
kill -HUP 12345 # ask daemon to reload
# By name
pkill -TERM nginx # kill all nginx processes politely
pkill -9 -f 'myapp --prod' # SIGKILL anything whose command line matches
# To a whole process group (note the dash before the PGID)
kill -- -12345 # signal the whole group
# To every process of a user
pkill -u alice
# To the current shell's children
kill %1 # job 1 in this shell
From inside a program
Every language exposes signals somehow. Here are the patterns:
# Python
import signal, sys, time
def on_term(signum, frame):
print("Got SIGTERM — shutting down cleanly", flush=True)
# do cleanup
sys.exit(0)
signal.signal(signal.SIGTERM, on_term)
signal.signal(signal.SIGHUP, lambda *_: reload_config())
while True: time.sleep(1)
// Go
package main
import (
"context"
"os/signal"
"syscall"
"log"
"time"
)
func main() {
ctx, stop := signal.NotifyContext(context.Background(),
syscall.SIGTERM, syscall.SIGINT)
defer stop()
<-ctx.Done()
log.Println("draining for up to 25s...")
shutdownCtx, cancel := context.WithTimeout(context.Background(), 25*time.Second)
defer cancel()
server.Shutdown(shutdownCtx)
}
If you remember one pattern from this lesson, remember this: catch SIGTERM and SIGINT, run the same cleanup for both, with a deadline. That handles Kubernetes pod termination, Docker stops, systemd systemctl stop, and Ctrl-C in development all at once. The deadline matters because your orchestrator will SIGKILL you if you take too long — so your cleanup should finish inside whatever grace period your platform gives.
The Graceful Shutdown Flow
Most modern orchestrators — Kubernetes, systemd, Docker — follow the same pattern:
How an orchestrator stops your process
Click each step to explore
The 30-second Kubernetes reality
# Kubernetes pod spec
spec:
terminationGracePeriodSeconds: 30 # default
containers:
- name: app
image: myapp:v1
lifecycle:
preStop:
exec:
command: ["/bin/sh", "-c", "sleep 5"] # buy time for endpoints to update
The exact sequence when Kubernetes terminates a pod:
- Pod is marked
Terminating. - The pod is removed from Service
Endpoints(traffic stops arriving — eventually). preStophook runs (if defined).- SIGTERM is sent to PID 1 in each container.
- Kubelet waits up to
terminationGracePeriodSeconds. - If still running, SIGKILL to PID 1 in each container.
Every one of those steps has ways to go wrong.
A team's API pods were dropping ~1% of requests during every rollout despite careful SIGTERM handling. Traces showed the drops happened in the first second after SIGTERM, not at the SIGKILL boundary. Root cause: Service Endpoints take ~1–2 seconds to propagate across the cluster, so traffic was still landing on pods that had already stopped accepting. The fix was a preStop: sleep 5 hook — no code change, just give the endpoint controllers time to catch up before closing the listener. Request drops went to zero. Signals alone cannot fix races that live above the process level.
The PID 1 Signal Problem in Containers
Linux treats PID 1 specially: it ignores every signal it has not installed a handler for. This is a safety mechanism so that PID 1 (systemd, init) cannot be accidentally killed.
Problem: when you run a container with CMD ["python", "app.py"], the Python process is PID 1 inside the container. If Python has not installed a SIGTERM handler, SIGTERM is silently ignored. Your pod hits the grace period, then SIGKILL, and you lose in-flight work every time.
Worse: many Dockerfiles look like this:
# BAD: shell form — runs /bin/sh -c "python app.py"
CMD python app.py
Now /bin/sh is PID 1. sh does not forward signals to its children. SIGTERM hits sh, gets ignored, your Python process never hears about it.
The fixes, in order of quality:
# GOOD: exec form — python is PID 1 directly
CMD ["python", "app.py"]
# BETTER: use a real init
RUN apt-get install -y tini
ENTRYPOINT ["tini", "--"]
CMD ["python", "app.py"]
# ALSO GOOD: let Docker add tini for you
# $ docker run --init myapp
# $ in k8s, there's no built-in --init; use tini or run systemd
# In Python, always install SIGTERM when you run as PID 1
import signal, sys
signal.signal(signal.SIGTERM, lambda *_: sys.exit(0))
Two rules for containers: (1) use exec form (CMD ["python", "app.py"]) in Dockerfiles, not shell form; (2) install a SIGTERM handler in your app, or use tini as the entrypoint. Skipping either one turns graceful shutdown into a coin flip.
Debugging Signal Problems
Did the process even get the signal?
# Trace a running process and watch for signals
sudo strace -p $PID -e trace=signal
# ...
# --- SIGTERM {si_signo=SIGTERM, si_code=SI_USER, si_pid=12345, si_uid=1000} ---
# ... (the signal arrived)
# Or show all signals a process has caught so far
grep -E 'SigCgt|SigIgn|SigBlk' /proc/$PID/status
# SigIgn: 0000000000001000 <- SIGPIPE (bit 13)
# SigCgt: 0000000180014a07 <- a handful of handled signals
# SigBlk: 0000000000000000
Is the process stuck in a handler?
# What is this process doing right now?
cat /proc/$PID/stack 2>/dev/null | head -10 # kernel stack; root-only
cat /proc/$PID/wchan # what syscall is it blocked in
What signal killed this process?
When a process dies from an unhandled signal, the exit status is 128 + signal_number:
| Exit code | What killed it |
|---|---|
| 130 | SIGINT (Ctrl-C) |
| 137 | SIGKILL |
| 139 | SIGSEGV |
| 143 | SIGTERM |
# Your last command's exit status
./maybe_crashy
echo $?
# 139 -> segfault
# systemd tells you directly
systemctl status myapp.service
# ... Active: failed (Result: signal) since ...; code=killed, signal=TERM
# Kubernetes tells you in `kubectl describe pod`
# State: Terminated
# Reason: Error
# Exit Code: 137 <- SIGKILL, usually OOM-kill by the kubelet
# Signal: 9
Exit Code: 137 in a Kubernetes pod almost always means OOMKilled. The kubelet or the kernel OOM killer sent SIGKILL because the container exceeded its memory limit. If you see 137 on a pod that was not OOMKilled, check kubectl get events — sometimes the kubelet escalates to SIGKILL because your SIGTERM handler took too long.
Job Control: Signals at the Shell Level
Every interactive shell is full of signal magic you use without thinking.
# Start a long-running process
sleep 100
# Ctrl-Z -> shell sends SIGTSTP; sleep stops
# [1]+ Stopped sleep 100
jobs
# [1]+ Stopped sleep 100
bg # send SIGCONT, run in background
fg # bring back to foreground
# Start in background
sleep 100 &
# [1] 12345
# Survive logout (ignore SIGHUP)
nohup sleep 100 &
disown %1 # shell stops tracking it
- Ctrl-C → SIGINT → to the foreground process group
- Ctrl-\ → SIGQUIT → to the foreground process group (core-dumps on crash)
- Ctrl-Z → SIGTSTP → to the foreground process group
- Closing the terminal → SIGHUP → to the session leader, which forwards to all children
nohup works by setting SIGHUP to ignored and detaching from the terminal, so the hang-up from closing the SSH session is safely dropped.
Key Concepts Summary
- Signals are small integers that the kernel delivers asynchronously. No payload beyond the fact itself.
- SIGKILL (9) and SIGSTOP (19) cannot be caught. The kernel enforces this.
- SIGTERM (15) is the polite "please exit." Catch this for graceful shutdown.
- SIGHUP (1) has two meanings. For daemons it historically meant "reload config." For interactive processes it means "the terminal went away."
- Every orchestrator's graceful stop is SIGTERM → wait → SIGKILL. Kubernetes defaults to 30s. Your handler must finish inside that window.
- PID 1 is special. The kernel ignores unhandled signals to PID 1. In containers this is usually your app — install a SIGTERM handler or use
tini. - Shell form
CMD python app.pymakes/bin/shPID 1. It does not forward signals. Use exec formCMD ["python", "app.py"]. - Exit code N ≥ 128 means "killed by signal N-128." 137 = SIGKILL, 143 = SIGTERM, 139 = SIGSEGV.
- Use
strace -e trace=signalto see signals in flight. Use/proc/[pid]/statusSig*fields to see handler state.
Common Mistakes
- Writing beautiful SIGTERM cleanup code, then shipping with
CMD python app.pyso/bin/shis PID 1 and the handler never runs. - Relying on SIGKILL for shutdown. SIGKILL bypasses all your cleanup — flushed buffers, connection draining, distributed locks — nothing runs. Reach for it only after SIGTERM fails.
- Assuming SIGINT and SIGTERM behave the same. They usually should, but a library you depend on might handle one and not the other. Catch both.
- Putting I/O or complex logic inside a signal handler. Handlers run in a restricted async-signal-safe context;
printf,malloc, and most library functions are unsafe. A common pattern: the handler sets a flag, and the main loop checks the flag. - Letting the graceful shutdown deadline match the orchestrator's deadline exactly. If Kubernetes gives you 30s, aim to finish in 25. The extra margin is for clock skew, paging delays, and unexpectedly slow I/O.
- Forgetting that SIGPIPE kills your process by default. A web server writing to a client that hung up will die silently unless it has
SIGPIPEhandled or ignored. - Killing the process group when you meant the process, or vice versa.
kill -9 1234is one process;kill -9 -1234(note the dash) is the whole group. - Reading exit code 137 in logs and blaming "signal 137." It is signal 137-128 = 9 = SIGKILL, almost always from the OOM killer or a grace-period expiry.
Your Kubernetes pod logs show clean shutdown messages sometimes, but not others. kubectl describe pod shows `Exit Code: 143` on clean exits and `Exit Code: 137` on the bad ones. Your code catches SIGTERM and takes about 40 seconds to finish draining connections. What is happening and what should you change?