Image Registries
At 3 AM, a Kubernetes cluster suddenly starts failing health checks across every service.
kubectl describe podshows:ErrImagePull: toomanyrequests: You have reached your pull rate limit. The cluster's node pools were rotating, each new node pulling base images from Docker Hub, and the team's anonymous pull quota was exhausted. Nothing was broken in the code. Nothing was wrong with the images. The entire outage was caused by a registry's unauthenticated rate limit and a team that did not know it existed.Registries feel like a commodity until they are in the critical path — which they always are, the moment an orchestrator needs a fresh image and cannot get one. This lesson covers the parts of registries that actually matter in production: tags vs. digests, rate limits, private registries and authentication, and the specific failure modes that take down fleets. If you only remember one thing:
nginx:latestin production is a recipe for unpredictable deployments, and digest pinning is the fix.
What a Registry Is
An image registry is an HTTP server that stores and serves OCI images. The API is defined by the OCI Distribution Spec, which every major registry implements:
- Docker Hub (hub.docker.com) — the default, most popular, free for public images, paid for private and high-throughput.
- GitHub Container Registry (GHCR) — ghcr.io, free for public and private images for GitHub users.
- Amazon ECR — AWS-native, deep IAM integration, tight with ECS / EKS.
- Google Artifact Registry (GAR) — GCP equivalent, replaces the older GCR.
- Azure Container Registry (ACR) — Azure equivalent.
- Quay.io — Red Hat's, popular for open-source projects.
- Harbor — self-hosted, the standard open-source private registry.
- GitLab Container Registry — integrated with GitLab CI.
Because they all implement OCI Distribution, the push/pull protocol is identical. You can docker pull from any of them, docker push to any of them (with auth), and images move between them with docker tag + docker push.
Anatomy of a Pull
docker pull nginx:1.25-alpine
# Using default tag: latest (if you omit the tag)
# 1.25-alpine: Pulling from library/nginx
# a2abf6c4d29d: Pull complete
# e1e3d4a7b38f: Pull complete
# 40d9ed1bfe29: Pull complete
# ... (one line per layer)
# Digest: sha256:eb05700fe7baa6890b74278e39b66b2ed1326831f9ec3ed4bdc6361a4ac2f333
# Status: Downloaded newer image for nginx:1.25-alpine
What happened under the hood:
- Resolve: HTTP GET to
https://index.docker.io/v2/library/nginx/manifests/1.25-alpine. Registry returns the manifest (JSON). - Deduplicate: for each layer in the manifest, check local cache by SHA-256 digest. Skip layers already present.
- Download layers: HTTP GET to
https://.../v2/library/nginx/blobs/sha256:<digest>for each missing layer. Layers are parallelized (default: 3 concurrent downloads). - Unpack: each layer's tarball is extracted into
/var/lib/docker/overlay2/<digest>/. - Register: image is tagged locally as
nginx:1.25-alpine.
A tag (nginx:1.25-alpine) is a mutable reference to an immutable digest. The same tag can point to different images over time — this is how security updates work: the publisher rebuilds and re-tags 1.25-alpine with a new digest, and docker pull nginx:1.25-alpine fetches the new version. The immutable thing is the digest (sha256:eb05...), not the tag.
Tags vs Digests: The Most Important Distinction in Registries
Tags are mutable
# Today
docker pull nginx:1.25
# Pulls the image that 1.25 currently points to.
docker inspect --format='{{.Id}}' nginx:1.25
# sha256:abc123...
# A month later
docker pull nginx:1.25
# Same tag, but possibly a different image — the publisher rebuilt with security patches.
docker inspect --format='{{.Id}}' nginx:1.25
# sha256:def456... ← different digest
This is by design. When CVEs are patched in the base image, the publisher updates the tag so that pulls of nginx:1.25 get the fix. This is useful for dev but dangerous for reproducible production deploys.
:latest is especially dangerous
latest is not "the newest version." It is "whatever tag the publisher most recently assigned as latest." Which could be:
- A stable release (most publishers)
- A release candidate (some publishers)
- A rolling nightly (some teams)
- An older version that was re-tagged (occasionally)
Worse: using :latest means every docker pull / kubectl apply might bring in a different image. Your "same deploy" gets different behavior across nodes.
Digests are immutable
# Pin to a specific digest
docker pull nginx@sha256:eb05700fe7baa6890b74278e39b66b2ed1326831f9ec3ed4bdc6361a4ac2f333
# Now and forever, this command pulls the SAME image. Period.
A digest is the SHA-256 of the image manifest. Two images with the same digest are the same bytes. Pinning to a digest is the only way to guarantee "the exact image I tested is the exact image that ships."
In production Kubernetes manifests, pin to digests (or at minimum, a specific semver tag — never latest). image: myapp:v1.2.3 is OK; image: myapp:v1.2.3@sha256:... is better; image: myapp:latest is a mistake waiting to happen.
Authentication
Most production workflows pull from private registries or authenticated endpoints. How auth works:
Docker Hub
docker login
# Username: you@example.com
# Password: ... (or personal access token, strongly preferred)
# Login Succeeded
# Credentials stored in ~/.docker/config.json
cat ~/.docker/config.json
# {
# "auths": {
# "https://index.docker.io/v1/": {
# "auth": "dXNlcjpwYXNz..." ← base64(user:password)
# }
# }
# }
Use a Personal Access Token (Docker Hub settings → Security), never your password. Passwords get cached; tokens can be scoped and revoked.
GHCR
# With a GitHub PAT scoped to read:packages, write:packages
echo $GITHUB_TOKEN | docker login ghcr.io -u USERNAME --password-stdin
# Pull
docker pull ghcr.io/myorg/myapp:v1.2.3
AWS ECR
# ECR's login is a temp token — valid for 12 hours
aws ecr get-login-password --region us-east-1 | \
docker login --username AWS --password-stdin \
123456789012.dkr.ecr.us-east-1.amazonaws.com
# Pull
docker pull 123456789012.dkr.ecr.us-east-1.amazonaws.com/myapp:v1.2.3
ECR's auth is IAM-backed. On EC2 / EKS / ECS, the ambient IAM role is used; aws ecr get-login-password turns the IAM credentials into a docker-login token.
Kubernetes: imagePullSecrets
For a private registry, Kubernetes pods need imagePullSecrets:
apiVersion: v1
kind: Secret
metadata:
name: regcred
type: kubernetes.io/dockerconfigjson
data:
.dockerconfigjson: <base64 of ~/.docker/config.json>
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: myapp
spec:
template:
spec:
imagePullSecrets:
- name: regcred
containers:
- name: app
image: ghcr.io/myorg/myapp:v1.2.3
For cloud-native registries (ECR on EKS, GAR on GKE, ACR on AKS), workload identity / IRSA typically removes the need for explicit secrets.
Rate Limits: The Silent Killer
Docker Hub
As of 2024, anonymous pulls are limited to 100 pulls per 6 hours per IP. Authenticated free pulls are 200 per 6 hours. Docker Pro/Team accounts get 5000-50000 depending on plan.
A Kubernetes cluster with 50 nodes, each pulling 10 images on startup, exceeds 100 in a single node-pool rotation. When you hit the limit:
failed to pull image "nginx:1.25": toomanyrequests: You have reached your pull rate limit.
Fixes, in order of effort
- Authenticate pulls. Pull from an account instead of anonymous — 2× the quota on free plans.
- Use a pull-through cache (Harbor, ECR, Artifactory in proxy mode). The cache pulls once from Docker Hub and serves all your nodes.
- Mirror to your own registry. CI pushes base images to your private registry; pods pull from there.
- Cloud-native registries (ECR, GAR, ACR) have no external rate limits for their own images — pulling
123456789012.dkr.ecr.us-east-1.amazonaws.com/my-nginxis rate-limit-free.
A team deployed 40 new nodes to a Kubernetes cluster during an incident, hoping to scale out. Every new node failed ImagePull on their base nginx image. The cluster that was supposed to scale up to meet load instead stayed at its old capacity because the new nodes could not start. Root cause: Docker Hub rate limits on anonymous pulls from a shared NAT gateway IP. Fix: authenticate pulls + mirror frequently-pulled base images to the company's Harbor registry. Once they did this, ImagePullBackOff incidents dropped from weekly to zero.
Tag Conventions That Save You
Pick a tagging scheme and enforce it. The good ones:
| Tag | Meaning | When it is safe |
|---|---|---|
myapp:1.2.3 | Semantic version, pinned | Production (but tag could still be force-repushed) |
myapp:1.2.3-abc1234 | Semver + short git SHA | Production, excellent |
myapp:abc1234567890 | Full git SHA | CI artifacts, canonical |
myapp:sha-YYYYMMDD-HHMMSS-sha | Timestamp + SHA | Avoid — too long |
myapp@sha256:eb05... | Immutable digest | Production, gold standard |
myapp:latest | Whatever was last pushed | Local dev only |
myapp:dev / myapp:staging | Environment rolling tags | Accept that they drift |
The best combo: CI builds with the full git SHA (myapp:abc1234567890), then on deploy looks up the digest and writes that into the Kubernetes manifest. So the deployed manifest references image: myapp@sha256:..., while humans still work with tags.
# Resolve a tag to a digest
docker buildx imagetools inspect myorg/myapp:v1.2.3
# Name: docker.io/myorg/myapp:v1.2.3
# MediaType: application/vnd.oci.image.index.v1+json
# Digest: sha256:eb05700fe7baa6890b74278e39b66b2ed1326831f9ec3ed4bdc6361a4ac2f333
Multi-Arch Images
Modern images support multiple CPU architectures (amd64, arm64, sometimes armv7, ppc64le, s390x). The registry serves an image index (aka manifest list) that points to per-arch manifests; clients pull the one matching their platform.
# See the index
docker buildx imagetools inspect nginx:1.25-alpine
# Name: docker.io/library/nginx:1.25-alpine
# MediaType: application/vnd.oci.image.index.v1+json
# Digest: sha256:...
#
# Manifests:
# Name: docker.io/library/nginx:1.25-alpine@sha256:...
# MediaType: application/vnd.oci.image.manifest.v1+json
# Platform: linux/amd64
#
# Name: docker.io/library/nginx:1.25-alpine@sha256:...
# MediaType: application/vnd.oci.image.manifest.v1+json
# Platform: linux/arm64/v8
# ...
# Build a multi-arch image
docker buildx create --name multi --use
docker buildx build \
--platform linux/amd64,linux/arm64 \
-t myorg/myapp:1.2.3 \
--push .
For teams running on mixed fleets (AWS Graviton + x86, Apple Silicon dev + x86 CI), multi-arch builds are essential. Without them, you get "exec format error" when an arm64 node tries to pull an amd64-only image.
Garbage Collection and Storage Cost
Registries accumulate layers. Each push adds new layer blobs; tag updates only rewrite the manifest. Blobs with no manifest referencing them are "unreferenced" but still stored until garbage collection runs.
- Docker Hub / ECR / GAR — they manage GC for you, subject to their retention policies (often "keep all" on paid plans, "keep 100 latest" or similar on free).
- Self-hosted (Harbor, registry:2) — you run GC manually or via retention policies. Without it, a busy CI can fill TBs of storage with unreferenced layers.
# Harbor example: define a retention policy in the UI
# - Keep the 20 most recent tags
# - Keep tags matching 'v*.*.*'
# - Delete everything else
# - Run weekly
# registry:2 example
docker exec registry bin/registry garbage-collect /etc/docker/registry/config.yml
CI tends to push dozens of images per commit (per-arch, per-stage, preview builds). Without a retention policy, registry storage grows by 50-100 GB per week for even a small team. Set a policy early: keep tagged releases forever, keep the N latest non-semver pushes, delete everything else after 30 days. Your storage bill and registry performance thank you.
Pulling and Pushing: Day-to-Day
# Push an image you built
docker build -t myorg/myapp:v1.2.3 .
docker push myorg/myapp:v1.2.3
# Tag an existing image for a different registry
docker tag myorg/myapp:v1.2.3 ghcr.io/myorg/myapp:v1.2.3
docker push ghcr.io/myorg/myapp:v1.2.3
# Pull a specific architecture
docker pull --platform linux/arm64 nginx:1.25-alpine
# Pull by digest (pin)
docker pull nginx@sha256:eb05700fe7baa6890b74278e39b66b2ed1326831f9ec3ed4bdc6361a4ac2f333
# Inspect without pulling (BuildKit)
docker buildx imagetools inspect myorg/myapp:v1.2.3
Signing and Verifying Images
Content provenance is now standard in serious deployments. Two main options:
- Sigstore / Cosign — sign images with short-lived keys tied to an OIDC identity (no long-term private keys to manage). Growing fastest.
- Notation (CNCF) — certificate-based signing, more traditional PKI approach.
# Sign with cosign (keyless, via OIDC)
cosign sign myorg/myapp:v1.2.3
# Verify
cosign verify --certificate-identity you@example.com \
--certificate-oidc-issuer https://github.com/login/oauth \
myorg/myapp:v1.2.3
Kubernetes admission controllers (Kyverno, OPA Gatekeeper, Connaisseur) can enforce "only deploy images signed by our CI identity" as a policy. This is the baseline for regulated workloads.
Key Concepts Summary
- A registry is an OCI-Distribution-compliant HTTP service. Every major cloud has one; self-hosted options include Harbor.
- Tags are mutable; digests are immutable. Pin to digests for reproducible production deploys.
:latestis never safe for production. Use semver tags or git-SHA tags, ideally resolved to digests in manifests.- Rate limits exist and will find you. Docker Hub anonymous is 100/6h/IP; solutions are authentication, mirrors, or cloud-native registries.
- Auth models vary. Docker Hub (PAT), GHCR (GitHub PAT), ECR/GAR/ACR (IAM + short-lived tokens), private (user+password or token).
- Multi-arch images ship one tag for many platforms. Use
docker buildx build --platform. - Storage grows forever without a retention policy. Configure cleanup rules early.
- Image signing (Sigstore/Cosign) is the current best practice for verifying image provenance.
Common Mistakes
- Deploying
:latestto production. Every pull might be a different image; deploys become non-deterministic. - Ignoring rate limits until they cause an outage. Authenticate pulls or use a mirror before the incident.
- Caching credentials in a plain text
.docker/config.jsonon shared build servers. Use credential helpers (docker-credential-helpers) or OIDC-based auth. - Pushing images with the ops engineer's personal account. Use a CI service account with scoped permissions.
- Forgetting to rebuild for arm64 when the deploy target is AWS Graviton. Symptom: "exec format error" on pod startup.
- Pinning tags but not using immutable-tag protection on the registry. A teammate can force-push to
v1.2.3; enable immutable tags in your registry settings. - Skipping signature verification. On regulated workloads, unsigned images should be admission-rejected — otherwise you cannot prove what ran.
- Pulling from Docker Hub on hundreds of nodes concurrently during a deploy. Use a pull-through cache or a local mirror; you will save rate-limit headaches and startup latency.
- Assuming ECR / GAR / ACR credentials work forever. They are short-lived (12h on ECR). If a pod starts after the credential expired, pull fails.
kubelet-credential-provider/ workload identity solve this at scale.
Your production Kubernetes cluster starts failing `ImagePullBackOff` with the error `toomanyrequests: You have reached your pull rate limit`. The deploy has not changed in days, but you recently doubled the cluster's node count. What happened, and which of these is the most robust long-term fix?