Too Many Open Files: The Linux Limit That Crashes Production at 3 AM
Your service runs fine for weeks, then suddenly fails with 'too many open files' under load. Three layers of fd limits, why the wrong one bites first, and how to set them so this stops happening.
Your Go service has been running fine for weeks. Suddenly, every request starts failing with logs like:
http: Accept error: accept tcp 0.0.0.0:8080: accept4: too many open files; retrying in 5ms
Or your Java service:
java.io.IOException: Too many open files
at sun.nio.ch.FileDispatcherImpl.init0(Native Method)
Or your Python service:
OSError: [Errno 24] Too many open files
Same error, three names: EMFILE. The kernel refused to give your process another file descriptor because some limit was hit. The service crashes, the orchestrator restarts it, the pattern repeats every few minutes under load.
What confuses people: ulimit -n says 1,048,576. fs.file-max is unlimited. lsof | wc -l says 5,000 open. Where is the limit?
Three different limits exist. They overlap in confusing ways, and the one that bites depends on whether you are bare metal, a container, a process inside a container, or a non-root process. This post is the map.
What an "open file" actually is#
A file descriptor is a small integer (0, 1, 2, 3, ...) that the kernel hands back when a process opens something. The "something" is broader than file:
- An actual file on disk
- A TCP or UDP socket
- A Unix domain socket
- A pipe
- An eventfd, signalfd, timerfd
- An epoll instance
- A directory open with
opendir - A device handle (
/dev/null,/dev/random, etc.)
A typical Go HTTP service holds:
- 1 fd for the listening socket
- 1 fd per accepted connection
- 1 fd per outbound HTTP client connection (often pooled)
- A few fds for log files, stdin/stdout/stderr, epoll instances
A service handling 10,000 concurrent connections plus 1,000 outbound connections plus instrumentation is easily at 12,000+ fds. A service that leaks (forgets to close idle connections) climbs without bound until it hits a limit and dies.
The three limits#
There are three distinct fd limits you can hit. The smallest of them wins.
Limit 1: Per-process soft limit (ulimit -n from inside the process)#
The classic "ulimit." Each process has a soft limit and a hard limit. The soft limit is what calls to open() actually check against. The hard limit is the ceiling the process can raise its own soft limit to.
# From inside a process or shell
ulimit -n # current soft limit
ulimit -Hn # hard limit
ulimit -n 65536 # raise soft limit (up to hard limit)
For a process inside a Kubernetes pod, this is what the kernel enforces on open() calls.
Limit 2: System-wide limit (fs.file-max and fs.nr_open)#
The kernel has a global cap on total open files across all processes (fs.file-max) and a cap on the maximum any single process can have (fs.nr_open).
sysctl fs.file-max # system-wide max (typically very large, e.g., 9_223_372)
sysctl fs.nr_open # per-process hard ceiling (typically 1_048_576)
# What is currently in use system-wide
cat /proc/sys/fs/file-nr
# Output: <allocated> <free> <max> e.g., 4032 0 9223372
You typically only run into these on heavily loaded shared hosts. On a Kubernetes worker node running 50 pods, fs.file-max is far above what you will hit unless something is leaking massively.
fs.nr_open matters more: it caps how high any single process can raise its own soft limit. You cannot ulimit -n 2_000_000 if fs.nr_open is 1_048_576.
Limit 3: cgroup pids.max (related, but a different resource)#
Not strictly an fd limit, but commonly confused: pids.max caps the number of processes/threads in a cgroup. A workload that leaks threads (a Go service that creates a goroutine per request and never lets them finish) can hit this and look like a "fork" failure rather than EMFILE.
cat /sys/fs/cgroup/pids.max # max in this cgroup
cat /sys/fs/cgroup/pids.current # current count
Not the focus of this post, but worth knowing for adjacent debugging.
How limits interact in Kubernetes#
Here is where it gets tangled. When a pod starts:
- The kubelet asks the container runtime (containerd, CRI-O) to start the container.
- The container runtime does not by default set any
ulimit -non the process. The process inherits whatever the runtime was configured with. - The runtime's ulimit defaults vary: containerd defaults to no specific override (uses host defaults); some Kubernetes distros override this in the kubelet config.
- The host's defaults come from systemd unit limits (
LimitNOFILE), which on most modern Linux distros are set to 1,048,576 or higher.
Result: a pod typically starts with ulimit -n between 1024 (rare, old) and 1,048,576 (modern). You usually do not know which until you check.
# Check from inside the pod
kubectl exec -it $POD -- sh -c 'ulimit -n; ulimit -Hn'
# Check the actual limit on the running process
PID=$(kubectl exec -it $POD -- pgrep -f myapp | head -1 | tr -d '\r')
kubectl exec -it $POD -- cat /proc/$PID/limits | grep "Max open files"
The third value comes from /proc/PID/limits, which is the unambiguous truth: what limit the kernel will enforce on this specific process.
How to set the right limit in Kubernetes#
You cannot set ulimit declaratively in a pod spec the way you set CPU and memory. There is no resources.limits.openFiles. Three approaches in practice:
Approach 1: containerRuntime defaults (the right answer)#
Configure the kubelet's container runtime to set sane fd limits for all containers. For containerd:
# /etc/containerd/config.toml
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
# ... other options ...
Rlimits = [
{ type = "RLIMIT_NOFILE", hard = 1048576, soft = 1048576 }
]
For CRI-O:
# /etc/crio/crio.conf
[crio.runtime]
default_ulimits = [
"nofile=1048576:1048576",
]
After restart, all new containers inherit these limits. This is the cleanest answer because every workload gets a sane default without per-pod config.
Approach 2: securityContext.sysctls (limited)#
Kubernetes lets you set certain sysctls per pod via securityContext.sysctls. fs.file-max is namespaced and can be set, but this is a system-wide limit, not a per-process one. It does not set the per-process ulimit.
spec:
securityContext:
sysctls:
- name: fs.file-max
value: "1048576"
Useful in some cases but does not solve the typical EMFILE problem (which is a per-process soft limit issue).
Approach 3: explicit ulimit in the entrypoint#
Have the container raise its own ulimit at startup before exec'ing the real binary:
ENTRYPOINT ["/bin/sh", "-c", "ulimit -n 1048576 && exec /usr/local/bin/myapp"]
The catch: the process can only raise its soft limit up to the hard limit. If the hard limit is 1024, ulimit -n 1048576 fails. To raise the hard limit, you need CAP_SYS_RESOURCE, which your hardened (no-capabilities) container does not have.
This approach works as a band-aid when you cannot change the runtime config. It is not the right long-term answer.
Diagnosing EMFILE in production#
When a process starts failing with "too many open files":
Step 1: Confirm the limit and the count.
PID=$(pgrep -f myapp | head -1)
# What's the limit?
cat /proc/$PID/limits | grep "Max open files"
# How many is it actually using?
ls /proc/$PID/fd | wc -l
If the count is at or near the limit, you have either a leak or a capacity problem.
Step 2: Categorize what is open.
# Group fds by type
ls -la /proc/$PID/fd | awk '{print $NF}' | grep -oE 'socket|^/.*' | sort | uniq -c | sort -rn | head
This tells you whether you are leaking file handles, sockets, or both.
Step 3: For sockets, dig deeper.
# Show socket details (state, remote addresses)
ss -tnp | grep "pid=$PID"
# Sockets in CLOSE_WAIT often indicate a leak (peer closed, you didn't)
ss -tnp state close-wait | wc -l
CLOSE_WAIT is a famous indicator of HTTP client leaks. The remote server closed the connection, your client never read EOF, the socket sits in CLOSE_WAIT until the process is killed.
Step 4: If file handles, find what's open.
ls -la /proc/$PID/fd | awk '{print $NF}' | grep '^/' | sort | uniq -c | sort -rn | head
Repeated entries for the same path mean a code path that opens but does not close.
The common code-level causes#
Most EMFILE bugs are application code, not infrastructure:
1. HTTP client without connection pooling. Every request creates a new connection; under load, connections accumulate faster than they close.
// BAD: new client per request
http.Get(url) // uses default client, but creates new connections
// BETTER: shared client with sane Transport
var client = &http.Client{
Transport: &http.Transport{
MaxIdleConns: 100,
MaxIdleConnsPerHost: 10,
IdleConnTimeout: 90 * time.Second,
},
}
2. Database connections without pooling. Same issue with SQL drivers. Most modern drivers pool by default but only if you reuse the *sql.DB instance (it is the pool).
3. Files opened in a hot loop without defer close() (Go) or with (Python) or try-with-resources (Java). A function that opens a file and returns early on error without closing leaks one fd per error.
4. Goroutines / threads holding sockets indefinitely. A worker pool that grows unboundedly under load creates a worker per connection; each worker holds the connection open; fds accumulate.
5. Long-lived processes with internal caches. A "cache of opened files" without eviction keeps adding fds.
For each, the fix is at the application layer. Raising ulimits hides the leak; it does not fix it.
When to actually raise the limit#
Sometimes the limit is genuinely too low for the legitimate workload:
- A reverse proxy (NGINX, Envoy, HAProxy) handling tens of thousands of concurrent connections.
- A WebSocket server with persistent connections.
- A database server with many concurrent client connections.
- A CDN edge node.
For these, set the runtime default to 1,048,576 or higher. Make sure the application's own internal limits (NGINX worker_connections, Envoy concurrency, etc.) are tuned to match.
Quick reference: the EMFILE checklist#
1. Identify the process:
PID=$(pgrep -f myapp | head -1)
2. Check the per-process limit:
cat /proc/$PID/limits | grep "Max open files"
3. Check current usage:
ls /proc/$PID/fd | wc -l
4. If usage is climbing toward limit:
- Categorize: ls -la /proc/$PID/fd | awk '{print $NF}' | sort | uniq -c
- For sockets: ss -tnp | grep "pid=$PID"
- Look for CLOSE_WAIT (peer closed, you leaked)
5. Application fix first, infra fix second:
- Add connection pooling
- Fix close()/defer/with patterns
- Cap worker pools
6. If genuinely need higher limit:
- Set in containerd/CRI-O default ulimits
- Restart kubelet (existing pods need restart to pick up)
- Verify with /proc/PID/limits
7. Set up monitoring:
- alert on (open_fds / max_fds) > 0.8
- alert on CLOSE_WAIT count growing unboundedly
What to monitor#
The Prometheus metrics from process_exporter or built-in language metrics give you what you need:
# Per-process fd usage ratio
process_open_fds / process_max_fds > 0.8
# Per-pod from cAdvisor / kubelet
container_file_descriptors / container_ulimits_soft{ulimit="open_files"} > 0.8
# CLOSE_WAIT detector (needs a tcp_states exporter)
node_netstat_Tcp_CloseWait > 1000
A dashboard with fd usage per service, plus an alert at 80%, catches leaks days before they bring down production.
The mental model#
ulimit -n is a soft cap on a per-process resource. It is set by whatever launched the process: systemd unit, containerd config, shell session. Inside Kubernetes, the chain is kubelet -> runtime -> container.
EMFILE in production is almost always one of two things: an application-layer leak (the cap is fine, the app is using fds it should have closed), or a workload that legitimately needs more than the runtime's default (fix the runtime config, not per-pod hacks).
When it happens, the diagnostic chain is mechanical: confirm the limit, count the open fds, categorize them, find the leak. The hardest part is knowing where to look. Now you do.
Linux fundamentals like fd limits, cgroups, namespaces, and process model are the bread and butter of the Linux Fundamentals course. The Kubernetes-specific debugging patterns (kubectl debug, /proc inspection, runtime tuning) are covered in depth in the Kubernetes Debugging course.
More in Linux
cgroups, Pod Memory Limits, and What Actually Gets Counted
Your pod's memory limit isn't measuring what you think it is. A tour of cgroup v2 accounting and the surprises hiding inside memory.current.
Read post