Image Layers and Union Filesystems
A team's Python service image is 2.1 GB. Their application code is 80 MB. Someone asks on Slack: "Why is our image 25x bigger than our code?" The first answer is "Python is heavy." The next is "base images are big." Neither is wrong, but neither is the real story either. The real story is that Docker images are built from layers — one per instruction in the Dockerfile — and each layer is permanent. Installing build tools, then deleting them in a later layer, does not reduce the image size; the deleted files are still sitting in the earlier layer, shipped with every pull. The 2 GB image is the sum of every tool that was ever installed during the build, even if the final
RUNcommand deleted them.Understanding image layers is the difference between "my image is slow and big, and I do not know why" and "I can cut this image from 1.2 GB to 80 MB with three specific changes." This lesson explains how OverlayFS stacks layers, how Docker uses that to make image pulls fast, and the specific patterns that control image size in practice.
What an Image Actually Is
A container image is not a single file. It is:
- A manifest (JSON) — lists layers, their digests, and their order.
- A config (JSON) — command to run, environment, ports, labels.
- A set of layers — each is a tarball (usually gzipped) containing the files added/changed/deleted by one Dockerfile instruction.
When you docker pull nginx:1.25, Docker fetches the manifest, reads the list of layers, downloads each layer as a gzipped tarball, and unpacks each into /var/lib/docker/overlay2/.
# Inspect an image's layers
docker pull nginx:1.25-alpine
docker inspect nginx:1.25-alpine --format='{{range .RootFS.Layers}}{{println .}}{{end}}'
# sha256:a2abf6c4d29d43a4bf9fbb769f524d0fb36a2edab49819c1bf3e76f409f953ea
# sha256:e1e3d4a7b38fb4b99bfae3b26f27d2c8c33d3d35a16f0f6aa7d3e2f0b7f8e0c2
# sha256:40d9ed1bfe29a7c30ba08b4d5a46f0cf7a32f4f2d5f7a1e0d4ce3e4a1e7d4f5a
# ... (7 layers total)
# Each layer's size
docker history nginx:1.25-alpine --no-trunc
# IMAGE CREATED CREATED BY SIZE
# <missing> 2 weeks ago CMD ["nginx" "-g" "daemon off;"] 0B
# <missing> 2 weeks ago STOPSIGNAL SIGQUIT 0B
# <missing> 2 weeks ago EXPOSE 80 0B
# <missing> 2 weeks ago ENTRYPOINT ["docker-entrypoint..." 0B
# <missing> 2 weeks ago COPY 30-tune-worker-processes.sh 4.62kB
# <missing> 2 weeks ago COPY docker-entrypoint.sh /docker 1.62kB
# <missing> 2 weeks ago RUN /bin/sh -c set -x && ... 13.3MB
# <missing> 2 weeks ago ENV NGINX_VERSION=1.25.3 0B
# <missing> 2 weeks ago /bin/sh -c #(nop) CMD ["/bin/sh"] 0B
# <missing> 2 weeks ago ADD file:1d2e2f0c0... in / 7.38MB ← base alpine
Every line is a layer or a zero-byte metadata change. Each instruction in the Dockerfile produces one layer.
Image layers are immutable and content-addressed. Each layer has a SHA-256 digest — two layers with the same bytes are the same layer, period. That is how docker pull is fast for repeat pulls: if your image shares a base layer with one you have already pulled, Docker skips downloading it. This also means you cannot "modify" a layer; you can only add a new layer on top.
How OverlayFS Stacks Them
When you start a container from an image, Docker does not copy the image. It creates an OverlayFS mount that stacks the image's read-only layers and adds a writable upper layer just for this container.
When the container reads a file, OverlayFS walks the stack top-down and returns the first version it finds. When the container writes a file, OverlayFS copies it up to the writable layer and applies the change there — the lower layers are never touched. This is copy-on-write (CoW).
# Run a container and find its overlay mount
docker run -d --name demo alpine sh -c "while true; do sleep 1; done"
docker inspect --format='{{.GraphDriver.Data.MergedDir}}' demo
# /var/lib/docker/overlay2/<long-hash>/merged
# Check what the container's / looks like from the host
sudo ls /var/lib/docker/overlay2/<long-hash>/merged
# bin etc lib var ...
# The writable upper layer
sudo ls /var/lib/docker/overlay2/<long-hash>/diff
# (empty — nothing written yet)
# Write a file in the container
docker exec demo sh -c 'echo hello > /some-file'
# It lands in the upper layer only
sudo cat /var/lib/docker/overlay2/<long-hash>/diff/some-file
# hello
# And the lower layers are unchanged
docker rm -f demo
Every container gets its own writable upper layer. 100 containers from one image share one copy of the image's read-only layers but have 100 separate upper layers. This is why starting containers is cheap: the image is pulled once, unpacked once, and every container just needs a fresh empty diff/ directory.
Why Your Image Is 2 GB
Now the important part: image size = sum of all layers, permanently. Deleting a file in a later layer does not reduce image size; it just marks the file as deleted in the merged view. The bytes are still in the earlier layer, still shipped on every pull.
The classic anti-pattern
# DON'T DO THIS
FROM ubuntu:22.04
RUN apt-get update && apt-get install -y build-essential
RUN ./compile-my-app.sh
RUN apt-get remove -y build-essential && apt-get autoremove -y
This Dockerfile looks like it installs build tools, compiles the app, then cleans up. But each RUN is a separate layer:
- Layer 1: base Ubuntu (~80 MB)
- Layer 2:
apt install build-essential→ adds ~500 MB of tools - Layer 3: compiles the app → adds ~50 MB of binaries
- Layer 4:
apt remove build-essential→ marks those 500 MB as deleted but does not reclaim them
Final image: ~700 MB, not ~130 MB as intended. The 500 MB of build tools is still in layer 2, which is part of every pull and every stored copy.
The fix: single RUN, or multi-stage
# OPTION 1: combine into one RUN so intermediate files never land in a layer
FROM ubuntu:22.04
RUN apt-get update && \
apt-get install -y build-essential && \
./compile-my-app.sh && \
apt-get remove -y build-essential && \
apt-get autoremove -y && \
rm -rf /var/lib/apt/lists/*
# Final size: ~130 MB. The build tools appeared and disappeared within a single layer.
# OPTION 2 (better): multi-stage — we cover this in detail in the next lesson
FROM ubuntu:22.04 AS build
RUN apt-get update && apt-get install -y build-essential
COPY . /src
RUN cd /src && ./compile
FROM ubuntu:22.04
COPY --from=build /src/app /usr/local/bin/app
CMD ["/usr/local/bin/app"]
# Final size: ~85 MB. Only the compiled binary and base OS ship.
The "one RUN" trick works because layers are only created between instructions. Everything that happens within a single RUN is in that layer; intermediate files that are created and deleted within one RUN leave no trace.
COPY, ADD, and every RUN create a new layer. ENV, CMD, ENTRYPOINT, EXPOSE, LABEL are metadata-only instructions that create zero-byte layers. This matters for Dockerfile ordering (put changing-frequently instructions last, so the cache stays warm — covered next lesson) and for understanding why some instructions take up space and others do not.
Layer Caching (The Daily Developer Benefit)
Layers are cacheable by their content. If layer N in your Dockerfile produces the same bytes as it did last build, Docker reuses the cached layer and skips everything up to that point.
# First build
docker build -t myapp .
# Step 1/5 : FROM python:3.11-slim ← pulls base
# Step 2/5 : COPY requirements.txt . ← new layer
# Step 3/5 : RUN pip install -r requirements.txt ← installs, slow
# Step 4/5 : COPY . /app ← new layer
# Step 5/5 : CMD ["python", "/app/main.py"]
# Took 2 minutes.
# Modify app code only
vi app/main.py
# Second build
docker build -t myapp .
# Step 1/5 : FROM python:3.11-slim ← cached
# Step 2/5 : COPY requirements.txt . ← cached (file unchanged)
# Step 3/5 : RUN pip install ... ← cached (previous step cached)
# Step 4/5 : COPY . /app ← rebuilds (source changed)
# Step 5/5 : CMD ...
# Took 3 seconds.
Cache invalidation happens in order: once a layer is invalidated, every layer after it is also rebuilt. This is why Dockerfile ordering matters — we cover it in detail in Lesson 2.2.
Pull Efficiency — Why docker pull Is Fast on a Second Run
The layer-cache property extends to pulls. When you docker pull an image, Docker:
- Fetches the manifest.
- For each layer in the manifest, checks if it already exists locally by SHA-256 digest.
- Downloads only the missing layers.
Two images that share a base layer (e.g., nginx:1.25-alpine and redis:7-alpine both share alpine:3.18) will pull the alpine layer only once.
# Pull one alpine-based image
docker pull nginx:1.25-alpine
# ... pulls 7 layers, ~50 MB
# Pull another — 5 of the 7 alpine layers are already here
docker pull redis:7-alpine
# Already exists ← alpine base layer
# Already exists ← apk setup
# Already exists ← ca-certificates
# ... only redis-specific layers pulled
This is the "shared base layer" optimization. When your CI builds 50 services all based on python:3.11-slim, the base only lives on each runner once.
Inspecting Layer Contents
# Save an image as a tar and poke around
docker save nginx:1.25-alpine -o /tmp/nginx.tar
mkdir /tmp/nginx-img && tar -xf /tmp/nginx.tar -C /tmp/nginx-img
ls /tmp/nginx-img
# <hash1>/ manifest.json
# <hash2>/ repositories
# each <hash>/ is one layer, with a layer.tar inside
# Peek at what's in a specific layer
tar -tf /tmp/nginx-img/<hash>/layer.tar | head
# etc/nginx/conf.d/default.conf
# etc/nginx/nginx.conf
# ...
# The dive tool (a community favorite) visualizes layer contents
# brew install dive OR apt install dive
dive nginx:1.25-alpine
# Interactive TUI — see exactly what each layer added and what wasted space there is
dive is the single best tool for auditing image size. It shows you every layer, what each added, and flags wasted space — files that were added in one layer and deleted in another, duplicate files, entire directories you did not mean to include. Run dive on every image before shipping it; you will find 200-500 MB of waste you did not know was there.
The .dockerignore File: Keep Junk Out of Layers
COPY . /app copies everything in your build context. If your repo has node_modules/, .git/, .venv/, build artifacts, local databases, or test fixtures, all of that lands in a layer. A .dockerignore file at the repo root tells the builder what to skip:
# .dockerignore
.git
.gitignore
node_modules
.venv
__pycache__
*.pyc
.DS_Store
.env
.env.local
coverage/
dist/
build/
*.log
*.md
tests/
This is equivalent to .gitignore but for the Docker build context. Every file it excludes is a file that will not be included in a COPY layer and not transmitted to the daemon during the build.
# Size of your build context
du -sh .
# 450M .
# With .dockerignore in place
docker build --no-cache .
# Sending build context to Docker daemon 12.3MB
# ← down from 450MB to 12MB
A team's images were 3 GB. Their app was 80 MB. Root cause: no .dockerignore, and the project root had a .git directory (900 MB), node_modules/ (1.2 GB), a local SQLite database (200 MB), and accumulated build logs (300 MB). A single COPY . /app instruction pulled all of that into a layer. A 10-line .dockerignore cut the image to 120 MB. Total engineering effort: 15 minutes. Days of build time saved across CI per week: many.
Summary: Where the Bytes Come From
| Source | Size contribution | How to reduce |
|---|---|---|
| Base image | 5–200 MB | Use alpine or distroless when you can |
Package manager caches (apt, apk, pip) | 50–500 MB | Delete caches in the same RUN: rm -rf /var/lib/apt/lists/* |
| Build tools (gcc, make, etc.) | 300 MB – 1 GB | Multi-stage build — compile in one stage, copy artifact to another |
| Stale node_modules / .git / local files | 100 MB – several GB | .dockerignore |
| Duplicate files (copied into multiple layers) | Varies | Inspect with dive, consolidate |
| Debug symbols and man pages | 10–100 MB | Use distroless / slim images or strip in final RUN |
A well-built Python or Node image for a typical web service is 80–150 MB. A well-built Go image with distroless base is 20–50 MB. Anything over 500 MB deserves a dive audit.
Key Concepts Summary
- Images are collections of layers. Each Dockerfile instruction produces one layer, a tarball of added/changed/deleted files.
- Layers are content-addressed and immutable. Identical layers across images share bytes — the basis of efficient pulls and storage.
- OverlayFS stacks layers at runtime. Read-only image layers + a writable upper layer per container = copy-on-write.
- Image size = sum of all layers. Deleting a file in a later layer does not shrink the image; the deleted file is still in the earlier layer.
- The "single RUN" trick reduces size. Install, use, and clean up within one instruction so intermediate bytes never land in a separate layer.
- Multi-stage builds are the proper answer. Compile in a heavy stage; copy only the artifact into a minimal final stage. Covered in Lesson 2.2.
- Layer caching speeds up builds. Identical layers are reused; invalidation cascades to every later layer.
.dockerignorekeeps junk out. Exclude.git,node_modules, caches, and local state from the build context.diveaudits image size. The best tool for finding wasted bytes per layer.
Common Mistakes
RUN apt-get install … && RUN apt-get remove …expecting the remove to shrink the image. Wrong — the install is in a previous layer; the remove only affects layers on top of it.- Skipping
.dockerignoreand then wondering whyCOPY . /appis slow or huge. - Using
ADDwhereCOPYwould work.ADDhas magic behavior (extracts tarballs, fetches URLs) that is rarely what you want. UseCOPYby default. - Making many small
RUNinstructions to "be modular." Each is a layer; prefer concatenating related commands. - Forgetting to delete package manager caches.
apt-get install ... && rm -rf /var/lib/apt/lists/*in the sameRUN. Same forapk,yum,pip. - Pinning base images to
latest. Every build might get a different base; cache behavior becomes unpredictable. Pin to a specific tag (ideally a digest). - Using a heavyweight base like
ubuntu:22.04whenalpineordistrolesswould do. - Ignoring build-context size. Running
docker build .from a directory with 10 GB of unrelated files is slow even with caching. - Assuming two rebuilds of the same Dockerfile produce the same digest. They do not — timestamps, installed package versions, and build-time randomness all vary. Use reproducible-build techniques (pinned versions,
TZ=UTC,SOURCE_DATE_EPOCH) only when you need binary-identical images.
You have a Dockerfile that installs build tools, builds a binary, and then removes the build tools — each in its own RUN instruction. The final image is still 800 MB. A colleague claims the `RUN apt-get remove` cleaned everything up. Why is the image still huge, and what is the minimum change to fix it?