Compose in Production?
A founding engineer at a growing startup runs every service on a single
docker compose up -don a single EC2 instance. Postgres, Redis, API, worker, cron jobs — all in one compose file. It has been running for 18 months, handling 500 req/s, with effectively zero downtime. The team keeps "migrating to Kubernetes" on the roadmap and keeps bumping it down the priority list because — genuinely — they do not need it yet. Eventually the compose stack will hit a ceiling, but the trigger has not come, and the team is shipping features instead of managing a control plane."Compose in production" is a contested topic. Some engineers swear it is always wrong. Others run production on it for years. Both are right, depending on the workload. This lesson is the honest tradeoff analysis: what compose gives you in production, what it does not, when the jump to Kubernetes actually pays off, and how the concepts transfer when you do make that jump.
What Compose Actually Is in Production Terms
A single compose file on a single host is a production-grade setup for single-host workloads. It gives you:
- Declarative service definition. Deploys are
git pull && docker compose up -d --build. - Restart policies.
restart: unless-stoppedsurvives host reboots (if the Docker daemon is enabled at boot). - Resource limits. Cgroup-backed memory/CPU caps per service.
- Network isolation between stacks. Compose projects get their own networks.
- Volume management. Named volumes backed by local disk or network storage drivers.
- Dependency gating.
depends_onwith healthchecks. - Secrets. File-based secrets (simple but real).
It does not give you:
- High availability. One host = one point of failure.
- Horizontal scaling across hosts. You can
--scale web=3, but all three are on the same machine. - Self-healing across nodes. If the host dies, everything dies.
- Rolling updates with health gates. Compose stops and starts; it does not "drain connections, bring up new, check health, shift traffic."
- Service mesh, ingress, mTLS, policy — the cloud-native control plane.
These are Kubernetes concerns. The question is: do you need them yet?
The single best predictor of "should I use compose or Kubernetes in prod" is whether your workload fits on one machine and how much downtime you can tolerate. If you can tolerate "30 seconds of downtime per deploy" and your traffic fits on a single beefy VM, compose is viable for years. The moment you need "zero-downtime deploys across a fleet," you want an orchestrator. Pretending otherwise just moves pain around.
When Compose Is Good Enough
Signals that compose is the right tool for now:
- One (or two) hosts total. Primary production + a backup/standby.
- Low-to-moderate traffic. Fits comfortably on one beefy VM (say < 2000 req/s at typical p99 targets).
- Low downtime tolerance — minutes per deploy is acceptable, rather than seconds.
- Small team. Nobody has time to operate Kubernetes.
- Stateful services work naturally. Postgres + volume + simple compose file; no StatefulSet / CSI headache.
Real-world fits:
- Internal tools. A company's admin dashboard, metabase, internal API — no public-facing SLA pressure.
- Early-stage startups. Pre-product-market-fit, simple deploy model, one compose file per environment.
- Edge / on-prem deployments. A single compose file on a customer's VM; no cloud-native expectations.
- Simple stacks that don't need the fleet story. A wiki, a pricing API, a Slack bot — these do not need 30 replicas and a service mesh.
What a production-leaning compose file looks like
# compose.prod.yaml
name: myapp
services:
api:
image: ghcr.io/myorg/api:${VERSION} # NOT :latest
restart: unless-stopped
environment:
NODE_ENV: production
DATABASE_URL: ${DATABASE_URL}
secrets:
- api_key
deploy:
resources:
limits:
memory: 1G
cpus: "1.0"
healthcheck:
test: ["CMD", "wget", "--spider", "-q", "http://localhost:8080/health"]
interval: 15s
timeout: 3s
retries: 3
logging:
driver: json-file
options:
max-size: "10m"
max-file: "5"
depends_on:
db: { condition: service_healthy }
security_opt:
- no-new-privileges:true
read_only: true
tmpfs:
- /tmp
db:
image: postgres:16
restart: unless-stopped
volumes:
- pgdata:/var/lib/postgresql/data
- ./backups:/backups
environment:
POSTGRES_PASSWORD_FILE: /run/secrets/postgres_password
secrets:
- postgres_password
healthcheck:
test: ["CMD-SHELL", "pg_isready -U postgres"]
proxy:
image: nginx:alpine
restart: unless-stopped
ports:
- "80:80"
- "443:443"
volumes:
- ./nginx/nginx.conf:/etc/nginx/nginx.conf:ro
- ./certs:/etc/nginx/certs:ro
depends_on:
api: { condition: service_healthy }
volumes:
pgdata:
secrets:
api_key:
file: ./secrets/api_key
postgres_password:
file: ./secrets/postgres_password
Pair it with:
- Host-level boot enable for Docker:
systemctl enable docker. - Volume backups: a cron job that
pg_dumps and uploads to S3. - Host monitoring: node-exporter + Prometheus somewhere collecting metrics.
- A reverse proxy for TLS: nginx or Caddy in a compose service, or the hosted
traefikwith Let's Encrypt.
This stack is "production" in a meaningful sense — for its target tier.
Deploying With Compose
# On the production host (over SSH or via CI)
cd /srv/myapp
git pull
docker compose -f compose.prod.yaml pull # pull new image tags
docker compose -f compose.prod.yaml up -d # recreate changed services
This is the "git-based" deploy: the compose file is in git, the prod host pulls it, pull new images, up -d. Simple, auditable, rollback-by-git-revert.
Alternatives:
- CI builds + pushes image + SSHes to prod to run
docker compose pull && up. Cleaner; removes the "prod needs source checkout" dependency. watchtower— a compose service that auto-pulls and restarts when new image tags are available. Works but makes deploys less deterministic.
Zero-downtime (ish) with compose
services:
api:
deploy:
update_config:
order: start-first # start new before stopping old
With order: start-first, docker compose up -d starts the new container, waits for its healthcheck, then stops the old. Not truly zero-downtime — two containers briefly bind the same port? — but handling shared ports via the reverse proxy (nginx picking up the new container via healthcheck) gets you close.
For real zero-downtime you generally want an orchestrator.
Where Compose Falls Short
No rolling updates
docker compose up -d stops all changed containers and starts new ones. For services with multiple replicas on one host (--scale web=3), this is close to coordinated recreation, but it is not a rolling update with per-instance health gates.
No horizontal scaling across hosts
docker compose --scale web=5 runs 5 containers on the same host. If you need to spread across hosts, compose does not help. Docker Swarm was the "compose on a cluster" answer, but it has fallen out of favor — most teams going multi-host jump to Kubernetes instead.
Limited secrets story
secrets: in compose loads from files on the host. Fine for small scale; it does not integrate with Vault, AWS Secrets Manager, GCP Secret Manager the way Kubernetes External Secrets or CSI Secret drivers do.
No built-in observability
Kubernetes gives you metrics, events, audit logs, and a declared-state loop for free. Compose gives you docker logs and docker stats. You bolt on observability externally (Prometheus + cAdvisor + node-exporter + a dashboard).
Host failure = full outage
One host dying takes everything with it. A standby host + some DNS failover helps but it is manual. No node auto-healing.
When to Move to Kubernetes
You are likely ready for Kubernetes when:
- You have multiple hosts to manage. Running the same compose file on 5 hosts and not using an orchestrator is an accident waiting to happen.
- You need zero-downtime deploys at meaningful traffic. Rolling updates across replicas are Kubernetes' job.
- You have more than one team deploying to shared infrastructure. K8s' namespaces, RBAC, and resource quotas are designed for this.
- Your workload needs self-healing across nodes. A host dies → pods reschedule. Compose cannot do this.
- You want a platform. Kubernetes gives your team a common substrate; different services, same deploy patterns.
You are not ready when:
- You have one host and no strong SLA.
- Your team does not have the bandwidth to operate a control plane.
- Most of your workload is stateful and poorly-suited to orchestration (big monolithic DBs).
Middle ground: managed Kubernetes (GKE, EKS, AKS, DigitalOcean K8s) dramatically lowers the operational cost. The control plane is managed; you operate workloads. For a team of 5-10 running 10+ services, a managed K8s cluster is usually worth it. For a team of 2 with 3 services, compose on a VM is probably fine.
How Compose Concepts Map to Kubernetes
When you do make the jump, the mental model is surprisingly linear:
| Compose concept | Kubernetes equivalent | Notes |
|---|---|---|
services: entry | Deployment | Long-running container(s); multiple replicas across nodes |
services: for a DB | StatefulSet | Stable network identity, ordered start/stop |
networks: default | Cluster-wide SDN (CNI) | Every pod can reach every pod by default |
networks: internal: true | NetworkPolicy | Explicit allow/deny between pods |
volumes: named | PersistentVolumeClaim + PersistentVolume | Declarative storage, often backed by CSI |
ports: "80:8080" | Service (type ClusterIP / NodePort / LoadBalancer) | Virtual IP + label selector to pods |
| Public HTTP entry | Ingress + controller (nginx, Traefik, ALB) | Compose's nginx is the Kubernetes Ingress pattern |
depends_on: db: condition: service_healthy | initContainers + readiness probes | K8s doesn't gate pod start on another pod's health; probes drain traffic |
environment: | ConfigMap + env refs, Secret + env refs | Separates values from manifests |
secrets: | Secret (base64 in etcd, or external) | K8s Secrets are not encrypted by default; use External Secrets for Vault/AWS SM |
deploy.resources | resources.requests + resources.limits | Same cgroup underpinnings |
healthcheck: | livenessProbe, readinessProbe, startupProbe | More granular than compose |
restart: unless-stopped | Default pod behavior | K8s restarts failed pods automatically |
A near-literal translation
# Compose
services:
api:
image: ghcr.io/myorg/api:v1.2.3
ports:
- "8080:8080"
environment:
DATABASE_URL: postgres://db:5432/app
depends_on:
db: { condition: service_healthy }
db:
image: postgres:16
volumes:
- pgdata:/var/lib/postgresql/data
volumes:
pgdata:
# Kubernetes (abbreviated)
apiVersion: apps/v1
kind: Deployment
metadata: { name: api }
spec:
replicas: 3
selector: { matchLabels: { app: api } }
template:
metadata: { labels: { app: api } }
spec:
containers:
- name: api
image: ghcr.io/myorg/api:v1.2.3
env:
- name: DATABASE_URL
value: postgres://db:5432/app
ports:
- containerPort: 8080
---
apiVersion: v1
kind: Service
metadata: { name: api }
spec:
selector: { app: api }
ports: [{ port: 8080, targetPort: 8080 }]
---
apiVersion: apps/v1
kind: StatefulSet
metadata: { name: db }
spec:
serviceName: db
replicas: 1
selector: { matchLabels: { app: db } }
template:
metadata: { labels: { app: db } }
spec:
containers:
- name: db
image: postgres:16
volumeMounts:
- { name: data, mountPath: /var/lib/postgresql/data }
volumeClaimTemplates:
- metadata: { name: data }
spec:
accessModes: [ReadWriteOnce]
resources: { requests: { storage: 20Gi } }
---
apiVersion: v1
kind: Service
metadata: { name: db }
spec:
selector: { app: db }
clusterIP: None # headless, for StatefulSet DNS
ports: [{ port: 5432 }]
More YAML, but each piece corresponds directly to a compose concept. The skills you built with compose transfer.
If you are moving compose → Kubernetes, kompose convert -f compose.yaml is a real tool that generates Kubernetes manifests from your compose file. It is not the right answer for long-term maintenance (generated YAML is ugly) but it is a great way to see the mapping and get started.
Swarm: The "Compose on a Cluster" Answer Nobody Uses
Docker Swarm lets you run compose-like stacks across multiple nodes. It supports rolling updates, per-service secrets, cluster networking, and works with your existing compose files plus a few deploy: fields.
It was popular ~2016-2018 and has since lost mindshare to Kubernetes. Most production fleets today are Kubernetes; Swarm persists in specific niches (on-prem simplicity, IoT fleets) but the ecosystem has shrunk.
If you are considering Swarm today, you are probably better off with either (a) compose on one host, or (b) Kubernetes (managed). Swarm's middle ground is a lonely place.
Real-World Examples
- Production compose is fine for: WordPress blogs, internal admin panels, small SaaS pilots, side projects with real users, edge deployments, pre-Series-A startups.
- Production compose breaks down for: consumer products with real SLA pressure, multi-tenant platforms, workloads that need horizontal scale, teams where operations and development are separate roles.
A team ran a "temporary" compose stack on a single EC2 instance for two years. It grew to handle 40 million requests per month with one unplanned outage (a full-disk event that took about an hour to fix). They eventually migrated to EKS — not because compose stopped working, but because their team tripled and the new engineers expected a Kubernetes workflow. The migration took 3 weeks. The 18 months of "we should migrate to K8s" before that would have been pure opportunity cost if they had started earlier. Rule of thumb: migrate when the cost of not migrating exceeds the cost of migrating, not before.
Key Concepts Summary
- Compose is single-host. Production-viable for the right workloads, obviously wrong for multi-host.
- Production compose signals: one host fits, minutes of downtime per deploy is OK, small team, simple architecture.
- Kubernetes signals: multiple hosts, zero-downtime deploys needed, self-healing required, platform for multiple teams.
- The translation is direct: compose services → Deployments, networks → CNI, volumes → PVCs,
depends_on→ probes,secrets:→ Secrets + External Secrets. - Swarm is rarely the right answer today. Either compose or Kubernetes.
- Managed Kubernetes (EKS, GKE, AKS) lowers operational cost dramatically; it is the usual destination when compose outgrows its role.
- Migration does not have to be all-at-once. Stateless services move first; stateful services last.
Common Mistakes
- "Compose is never production-ready." It is, for the right scale. The argument becomes real as you scale up.
- Running multi-host without an orchestrator. "Five nodes all with the same compose file" is a recipe for drift and manual fixes.
- Migrating to Kubernetes too early. A team without K8s experience running 3 services on K8s is paying heavy overhead for no benefit.
- Migrating to Kubernetes too late. A team of 30 engineers running 20 services on a single compose host is about to have a bad time when that host dies.
- Using
:latesttags in production compose files. Same mistake as with Kubernetes; deploys become unpredictable. - Forgetting the backup story. Compose is easy to deploy; nothing about it backs up your data. Volume backups are your responsibility.
- Running compose on a laptop and claiming "it works in production." Production has SLAs, monitoring, logs, disaster recovery — compose is one piece of that, not the whole thing.
- Pretending Swarm fills the gap between compose and Kubernetes. It does not anymore; the ecosystem moved on.
- Skipping healthchecks in production compose. Without them,
restart: unless-stoppedonly catches hard crashes, not hangs. - Bind-mounting config files that the prod team edits directly. Commit the configs to git and mount from the checkout, or use a proper config service — never "ssh in and edit."
Your team has grown from 3 to 15 engineers. Production is 4 services on a single compose file on one beefy VM; you deploy 3x/week with ~60s of downtime per deploy. Traffic is growing. Several teammates are pushing to migrate to Kubernetes because they think you are too big for compose. What is the right framing to decide?