Linux Fundamentals for Engineers

Services in Production

Your team runs a Go binary as a systemd service. The unit file is eight lines. It has worked for two years. Then a memory leak appears: after about three days of uptime the process grows to 20 GB and the OOM killer takes it out. Every restart resets the clock — but every restart also takes down the pod for 45 seconds, because the old connections cannot drain before systemd brings the service back up with a hard Restart=always.

Most production service unit files are too simple. They have an ExecStart= and nothing else. When things go wrong — crashes, hangs, memory leaks, graceful shutdown, resource isolation, log verbosity, seccomp sandboxing — the answers all live in unit-file options you did not know existed. This lesson is the production-grade service unit: every option that matters, what it does, and when to reach for it.


A Minimal but Correct Service

Start with a baseline that is safe by default:

# /etc/systemd/system/myapp.service
[Unit]
Description=My Web App
After=network-online.target
Wants=network-online.target
Documentation=https://docs.example.com/myapp

[Service]
Type=simple
User=myapp
Group=myapp
WorkingDirectory=/opt/myapp
ExecStart=/opt/myapp/bin/server
Restart=on-failure
RestartSec=5
TimeoutStartSec=60
TimeoutStopSec=30

# Hardening — safe on almost any app
NoNewPrivileges=true
ProtectSystem=strict
ProtectHome=true
PrivateTmp=true
ReadWritePaths=/var/lib/myapp /var/log/myapp

# Resource limits
MemoryMax=2G
TasksMax=256

[Install]
WantedBy=multi-user.target

That is a solid production template. The rest of this lesson walks through each section and when to tune it.


Type=: How systemd Decides the Service Is "Started"

The single most misunderstood option. systemd needs to know when your ExecStart has "finished starting" so it can start units with After= on yours. The Type= tells it how to decide.

TypeWhen systemd considers the service startedUse for
simpleImmediately after forking ExecStart (no wait)Modern foreground programs that do not daemonize
execAfter execve() of ExecStart returns (subtle improvement over simple)Same as simple; safer
forkingWhen the parent ExecStart process exits, leaving a childTraditional daemons that double-fork (old nginx, old postfix)
oneshotWhen ExecStart exits (does not keep running)Scripts, setup tasks; RemainAfterExit=yes for config-only units
notifyWhen the service sends READY=1 via sd_notifyServices that report readiness explicitly
notify-reloadSame as notify, also uses sd_notify for reloadsModern well-behaved daemons
dbusWhen the service takes a name on the D-BusD-Bus-based services
idleLike simple, but delays start until other jobs finish (reduces log interleaving at boot)Services where log output is important at boot

Most of the time you want simple (or exec). If your app forks into the background and writes a PID file, that is the old pattern — you can still make it work with forking, but you will need PIDFile= and the app had better be reliable about it.

KEY CONCEPT

If you are writing a new service and you have the ability to change the app: use Type=notify and have the app call sd_notify(READY=1) when it is ready to accept traffic. systemd waits for that signal to consider the service started — which means After=yourservice.service actually means "after it is ready," not "after the process exists." This is the single biggest win in production unit files.

Type=notify in practice

// Go — import github.com/coreos/go-systemd/v22/daemon
daemon.SdNotify(false, daemon.SdNotifyReady)

// After graceful shutdown handler runs, signal stopping
daemon.SdNotify(false, daemon.SdNotifyStopping)

// Periodic health beat (if WatchdogSec= set)
daemon.SdNotify(false, daemon.SdNotifyWatchdog)
# Python — no dependencies, just talk to the Unix socket directly
import os, socket
notify = os.environ.get("NOTIFY_SOCKET")
if notify:
    s = socket.socket(socket.AF_UNIX, socket.SOCK_DGRAM)
    s.connect(notify)
    s.sendall(b"READY=1")

And in the unit file:

[Service]
Type=notify
ExecStart=/opt/myapp/bin/server
WatchdogSec=30s    # if the service doesn't ping watchdog within 30s, kill it
Restart=on-failure

Restart=: What Happens When the Service Exits

Restart= tells systemd what to do when your process exits.

ValueMeaning
noNever restart. Default.
on-successOnly restart if the exit code was 0. Weird.
on-failureRestart on non-zero exit, signal, timeout, or watchdog. The most common choice.
on-abnormalRestart on signal or timeout (but not clean non-zero exits).
on-abortOnly restart on unclean termination (SIGABRT etc.).
on-watchdogOnly restart when the watchdog fires.
alwaysRestart no matter why it exited. Useful for "must always be running" services.

Pair Restart= with RestartSec= (how long to wait before restarting) and StartLimitIntervalSec= / StartLimitBurst= (crash loop protection):

[Unit]
StartLimitIntervalSec=60
StartLimitBurst=3            # max 3 starts in 60 seconds

[Service]
Restart=on-failure
RestartSec=5                 # wait 5s between restart attempts

With the above, if your service crashes 3 times within 60 seconds, systemd gives up and leaves it failed. This prevents runaway crash loops from burning CPU.

WARNING

Restart=always is tempting but dangerous without StartLimitBurst. A service that crashes on startup (misconfig, missing dependency) will restart in an infinite loop, filling the journal and burning cycles. Always pair Restart= with a sensible burst limit — the default of 5 starts in 10 seconds is a reasonable minimum.


Graceful Shutdown: The TimeoutStop, KillSignal, and KillMode Trio

When you run systemctl stop myapp, here is what happens by default:

  1. systemd sends SIGTERM to the main process.
  2. Waits up to TimeoutStopSec= (default 90 seconds).
  3. If still running, sends SIGKILL.
  4. If the process group has stragglers, also sends SIGKILL to the rest.

You tune this with:

[Service]
KillSignal=SIGTERM              # the "please exit" signal (default)
TimeoutStopSec=30s              # how long to wait before SIGKILL
SendSIGKILL=yes                 # default; set to no to never escalate (dangerous)
KillMode=control-group          # default: SIGTERM the main process, SIGKILL the whole cgroup at timeout
# Alternatives: mixed (SIGTERM main, SIGKILL rest), process (only main), none (never kill)

The default KillMode=control-group is usually what you want: give the main process a chance to do a graceful shutdown, then clean up any stragglers with SIGKILL.

Make sure your app actually handles SIGTERM (see Module 2, Lesson 2). Otherwise you are just waiting 30 seconds before the inevitable SIGKILL.


Running As a Non-Root User

Never run a service as root unless it genuinely needs root. The User= and Group= options drop privileges before execve():

[Service]
User=myapp
Group=myapp
# or, to create a dynamic transient user just for this service (no /etc/passwd entry needed):
DynamicUser=yes

DynamicUser=yes is a systemd feature that creates a transient UID/GID pair just for this service, scoped to the service's lifetime. Combined with ProtectHome=, ProtectSystem=, PrivateTmp=, it gives you near-container-level isolation without a container runtime.


Environment, Working Directory, and Paths

[Service]
WorkingDirectory=/opt/myapp

# Inline environment
Environment="FOO=bar" "PORT=8080" "LOG_LEVEL=info"

# Environment from a file (the common pattern — put secrets here, mode 0600)
EnvironmentFile=/etc/myapp/env
EnvironmentFile=-/etc/myapp/env.optional    # leading '-' = ignore if missing

# Pre- and post-start hooks
ExecStartPre=/opt/myapp/bin/migrate-db
ExecStartPost=/opt/myapp/bin/register
ExecStop=/opt/myapp/bin/deregister
ExecStopPost=/opt/myapp/bin/cleanup
ExecReload=/bin/kill -HUP $MAINPID

The EnvironmentFile= pattern is how production services load secrets and config without baking them into the unit. Keep the file 0600 and owned by the service user (or root), and drop it in from your config management.


Hardening: The Options Everyone Should Use

The following options take a service from "runs as a user" to "runs in a sandbox." They are free — no code changes, no performance cost on normal workloads — and they block entire classes of attacks.

[Service]
# Filesystem access
ProtectSystem=strict          # /usr, /boot, /efi mounted read-only; /var, /etc mostly read-only
ProtectHome=true              # /home, /root, /run/user inaccessible (set to read-only if needed)
PrivateTmp=true               # private /tmp and /var/tmp — service cannot see host tmp
ReadWritePaths=/var/lib/myapp /var/log/myapp    # the specific dirs my app needs to write

# Process capabilities
NoNewPrivileges=true          # disables suid, privilege escalation
RestrictSUIDSGID=true         # cannot create suid binaries
CapabilityBoundingSet=CAP_NET_BIND_SERVICE   # only the listed capabilities allowed
AmbientCapabilities=CAP_NET_BIND_SERVICE      # and these are granted to the process

# Network
PrivateNetwork=true           # no network at all (for sandboxed batch jobs)
RestrictAddressFamilies=AF_INET AF_INET6 AF_UNIX    # block AF_NETLINK, AF_PACKET, etc.
IPAddressDeny=any             # then allow specific ranges with IPAddressAllow=
IPAddressAllow=10.0.0.0/8 127.0.0.0/8

# Kernel
ProtectKernelTunables=true    # cannot write to /proc/sys, /sys
ProtectKernelModules=true     # cannot load kernel modules
ProtectKernelLogs=true        # cannot read the kernel log
ProtectClock=true             # cannot change the system clock
ProtectControlGroups=true     # cannot modify cgroups

# Syscalls
SystemCallFilter=@system-service   # a curated allow-list for typical services
SystemCallArchitectures=native     # block obscure architecture emulations
PRO TIP

systemd-analyze security myapp.service grades your unit on a 0–10 scale for hardening, listing every option that could be tightened. Run it on a service you wrote and on a distro-packaged one like sshd and compare — you will learn more from that diff than from any security guide.


Resource Limits: cgroup-Backed Isolation

systemd uses cgroups under the hood. Every service gets its own cgroup, and you can set resource limits on it in the unit file:

[Service]
# CPU
CPUQuota=200%                 # max 2 full cores of CPU time
CPUWeight=500                 # relative priority (default 100; higher = more)
AllowedCPUs=0-3               # pin to specific cores

# Memory
MemoryMax=2G                  # hard limit; OOM-killed if exceeded
MemoryHigh=1.5G               # soft limit; throttled beyond this
MemorySwapMax=0               # no swap for this service

# Tasks (processes + threads)
TasksMax=512

# I/O
IOWeight=500                  # relative I/O priority
IOReadBandwidthMax=/dev/nvme0n1 100M    # 100 MB/s cap

These limits are enforced by the kernel, not by systemd. If your process tries to allocate past MemoryMax=, it gets OOM-killed — not your whole system.

# See a service's current resource usage
systemctl status myapp.service
# ...
#    Memory: 483.2M (max: 2.0G)           <- from MemoryMax
#     Tasks: 17 (limit: 512)
#       CPU: 2m 12.345s

# Or the full cgroup view
systemd-cgtop

Socket Activation: Services That Only Run When Needed

Let systemd listen on the port and only start your service when a connection comes in.

# /etc/systemd/system/myapp.socket
[Unit]
Description=My App Socket

[Socket]
ListenStream=8080
Accept=no                     # one connection per service invocation is Accept=yes (rarely what you want)

[Install]
WantedBy=sockets.target
# /etc/systemd/system/myapp.service
[Unit]
Description=My App
Requires=myapp.socket

[Service]
Type=notify
ExecStart=/opt/myapp/bin/server
StandardInput=socket           # the listening socket is passed in as fd 3

[Install]
Also=myapp.socket

Enable the socket, not the service:

sudo systemctl enable --now myapp.socket
# systemd now listens on 8080. First connection triggers the service.

Why you might care:

  • Faster boot. systemd binds the port early; actual services start lazily.
  • Zero-downtime restarts. Socket stays bound while the service process restarts; connections queue.
  • Resource savings. On a machine with 50 rarely-used services, they do not have to all be running.

Getting socket activation right requires app cooperation — your app must accept a pre-bound listener instead of binding its own. Go's activation.Listeners() (from coreos/go-systemd), Python's systemd.daemon.listen_fds(), and nginx's systemd directive all support this.


ExecStart Gotchas

A few things about ExecStart= that catch people:

  • Arguments are not split by the shell. ExecStart=/usr/bin/sh -c "echo $FOO" needs explicit sh -c. Do not write ExecStart=echo hello | tee file and expect a pipeline.
  • $VARIABLE expansion only for $MAINPID, $INVOCATION_ID, and a couple others. For actual environment variables in the command, use ${FOO} — but only strings passed via Environment=, not arbitrary shell evaluation.
  • Relative paths do not work. Always absolute paths for the binary.
  • Multiple ExecStart= lines are allowed only for Type=oneshot.
  • ExecStartPre= with a leading - ignores failure. Useful for idempotent setup: ExecStartPre=-/usr/bin/mkdir -p /var/lib/myapp.
  • ExecStopPost= runs even on start failure. Use it for cleanup that must run regardless.

Debugging a Service That Is Not Behaving

# Current state
systemctl status myapp.service

# Last 50 lines of its log, including previous runs
journalctl -u myapp.service -n 50 --no-pager

# Live tail
journalctl -u myapp.service -f

# Only errors
journalctl -u myapp.service -p err..crit

# What does the effective unit look like, after drop-ins?
systemctl cat myapp.service

# Show every single resolved property
systemctl show myapp.service

# Security grade
systemd-analyze security myapp.service

# Dependencies in both directions
systemctl list-dependencies myapp.service
systemctl list-dependencies --reverse myapp.service

# See how long start took
systemd-analyze blame | grep myapp
WAR STORY

A team had a Python service that would "work for a day, then freeze." No crashes in the log, the process was still running, just not responding. Adding WatchdogSec=60 and calling sd_notify(WATCHDOG=1) every 30 seconds from the main loop fixed it: when the main loop hung (turned out to be a bad DB connection pool), the watchdog expired, systemd killed and restarted the service. Before, the process lived on in a zombie-like state and needed manual intervention. The total unit-file change was two lines. Every long-running service should have a watchdog.


Key Concepts Summary

  • Type=simple or Type=notify are the modern choices. notify is best if your app can signal readiness.
  • Restart=on-failure with RestartSec and StartLimitBurst is the crash-handling default. Always pair Restart= with a burst limit.
  • KillMode=control-group plus TimeoutStopSec= gives your app a chance to drain before SIGKILL.
  • Never run as root if you can avoid it. User=, Group=, or DynamicUser=yes.
  • The hardening block is nearly free: NoNewPrivileges, ProtectSystem=strict, ProtectHome, PrivateTmp, RestrictSUIDSGID on every service.
  • Resource limits are cgroup-backed. MemoryMax=, CPUQuota=, TasksMax= are enforced by the kernel.
  • Drop-ins for customization. systemctl edit + systemctl cat. Never edit the distro-shipped file.
  • Socket activation decouples listening from running. Useful for zero-downtime restarts and rarely-used services.
  • Watchdog detects hangs. WatchdogSec= + sd_notify(WATCHDOG=1) turns "the service hung again" into "it restarted itself."
  • systemd-analyze security grades your hardening. systemctl show reveals every resolved property.

Common Mistakes

  • Using Type=simple when the service daemonizes. systemd thinks the service has exited and marks it failed. Use Type=forking with a PIDFile=, or better, remove the daemonization from the app.
  • Setting Restart=always with no burst limit. A crash-on-startup bug becomes an infinite loop.
  • Running as root when a dedicated user would work. Root inside a service is root on the host — ransomware for free.
  • Skipping the hardening block. Every option in the "Hardening" section is essentially free. Not setting them is leaving exploits on the table.
  • Calling daemon-reload is optional. It is not. Any change to a unit file requires it before the change takes effect.
  • Putting secrets in Environment="KEY=value". Other users can read them via /proc/$PID/environ or systemctl show. Use EnvironmentFile= with 0600 permissions instead.
  • Forgetting that After= alone does not start the dependency. Use Wants= or Requires= alongside.
  • Assuming KillMode= applies to your graceful shutdown. SIGTERM is sent first — if your app does not handle it, the value of KillMode= does not matter.
  • Setting TimeoutStopSec=infinity for safety. You will eventually leave a zombie on a hang that needs a reboot to fix.

KNOWLEDGE CHECK

Your service unit has `Type=simple`, `Restart=always`, `RestartSec=1`. A bug in your latest deploy makes the binary crash on startup immediately. What happens?