Git Internals for Engineers

Submodules vs Subtrees vs Monorepos

A platform team maintains 14 microservices in 14 repos. They share a common library, platform-utils. Version management is manual; every change to the library requires a release, followed by 14 coordinated PRs to bump the dependency. They try submodules. After three weeks of broken CIs and "my submodule is out of sync" Slack messages, they switch to subtrees. After a month, they merge everything into a monorepo. Six months later they are happy. Six months after that, the monorepo is 4 GB and CI takes 45 minutes, and they start breaking pieces back out. Each approach is the right answer for some situation and the wrong answer for others. This lesson walks through the trade-offs so you can pick deliberately.

Submodules, subtrees, and monorepos are three answers to "multiple repos that need to share code." None of them is magic. Each has specific benefits and specific pain. This lesson covers what each actually is, the situations where it shines, and the situations where it falls over.

The Three Options in One Sentence

Submodules: a repo contains references to other repos at specific commits; the other repos live as nested Git repos.
Subtrees: a repo contains the content of other repos merged into a subdirectory; no nested Git.
Monorepo: everything in one repo, period.

None is inherently better. The right choice depends on team size, cross-repo dependency patterns, release cadence, and tooling maturity.

KEY CONCEPT

The three approaches solve the same problem ("I want to share code between projects") with different trade-offs on coupling. Submodules keep projects loosely coupled with version pinning. Subtrees keep them copy-merged into one repo. Monorepos collapse everything into one place. The choice is mostly about team dynamics and tooling, not about the tech.

Submodules: Nested Repos by Reference

# Add a submodule
git submodule add https://github.com/org/shared-lib lib/shared

# A new file was created: .gitmodules
cat .gitmodules
# [submodule "lib/shared"]
#     path = lib/shared
#     url = https://github.com/org/shared-lib

# In the parent repo, lib/shared is now a special kind of entry:
git ls-tree HEAD lib/shared
# 160000 commit abc123...    lib/shared
#  ↑ mode 160000 = "gitlink" (reference to a commit in another repo)

# Cloning gets the pointer, not the content:
git clone https://github.com/org/parent
cd parent
ls lib/shared
# (empty)

git submodule update --init --recursive
ls lib/shared
# (now populated)

# Or in one shot:
git clone --recurse-submodules https://github.com/org/parent

The parent repo stores a pointer to a specific commit in the submodule repo. Cloning requires an extra step to fetch the submodule's content.

Updating a submodule

# Get new changes from the submodule's remote
cd lib/shared
git pull origin main
cd ../..

# The parent sees the submodule's HEAD changed
git status
# modified:   lib/shared (new commits)

# Commit the new pointer in the parent
git add lib/shared
git commit -m "bump shared-lib to latest"
git push

The pain of submodules

Submodules introduce a layer of indirection that causes specific friction:

Clone needs a flag. Forgetting --recurse-submodules gives you an empty submodule dir; first-time users are confused.
Branch switching does not touch submodules. git checkout feature doesn't automatically update submodules to feature's version. You need git submodule update.
Detached HEAD in submodules. git submodule update checks out the commit the parent points at, putting the submodule in detached HEAD. Work inside the submodule requires creating a branch first, surprises beginners.
Coordinated commits. Changing something in both the parent and a submodule requires two commits in two repos, and the parent's commit must update the submodule pointer. Easy to mess up, easy to forget.
CI complexity. CI must clone recursively, check out the right submodule versions, and often needs credentials for private submodules.

When submodules are right

Third-party dependencies you do not modify. Pinning an exact commit of an external library via submodule is clean and reproducible.
Completely separate release cadences. The parent and submodule truly are independent projects that occasionally integrate.
Security-audited binary artifacts or libraries. Submodules point at an audited commit; you cannot accidentally pull unaudited code into the build.

Submodule commands cheat sheet

git submodule add <url> <path>          # add a new submodule
git submodule init                       # read .gitmodules into your config
git submodule update                     # check out the recorded commits
git submodule update --init --recursive  # one-shot for fresh clones
git submodule update --remote            # pull submodules to their tracked branch's latest
git submodule foreach <cmd>              # run a command in every submodule
git clone --recurse-submodules <url>     # clone + init + update in one step
git config --global submodule.recurse true   # auto-update submodules on pull/checkout

That last config (submodule.recurse true) is the single biggest QoL improvement for submodule workflows. It makes Git do the right thing automatically on pull/checkout.

Subtrees: Content Merged In

Subtrees are the opposite philosophy: instead of a reference, the shared code is copied in as part of your repo's history.

# Add a subtree (the content of shared-lib lives under lib/shared)
git subtree add --prefix=lib/shared https://github.com/org/shared-lib main --squash
# --squash: collapse shared-lib's history into one import commit
# Without --squash, your repo absorbs shared-lib's full history

After this:

lib/shared/ is a normal directory in your repo.
No .gitmodules file.
git clone gets everything, no special flags.
No "detached HEAD" in some nested repo.
Your repo is self-contained.

Updating a subtree

# Pull new changes from shared-lib
git subtree pull --prefix=lib/shared https://github.com/org/shared-lib main --squash
# This creates a merge commit absorbing shared-lib's latest changes

Pushing changes upstream

If you modified lib/shared/ and want to send the changes back to the original shared-lib repo:

git subtree push --prefix=lib/shared https://github.com/org/shared-lib my-branch
# Extracts the lib/shared-local commits into a branch on shared-lib
# (the original repo) and pushes them there

This is the magic and the headache of subtree: you can modify subtree content in your repo and push it back, or let it drift and maintain a local fork.

Trade-offs

Pro: self-contained repos, no submodule confusion, full checkout just works.
Pro: no one else needs to know the repo uses subtrees, from their perspective, it is just a normal repo.
Con: your repo size grows with each import (all content + history absorbed).
Con: syncing is a manual command (subtree pull); easy to drift.
Con: the history looks noisy if you do not --squash, hundreds of imported commits mixed with yours.

When subtrees are right

You want a self-contained repo but are pulling in a vendored library.
The team does not have the discipline for submodules (no flags to forget, no detached HEAD surprises).
Small number of vendored pieces that change slowly.

Monorepos: Everything In One Place

# One repo, many projects
monorepo/
├── services/
│   ├── api/
│   ├── web/
│   └── worker/
├── libs/
│   ├── auth/
│   ├── database/
│   └── utils/
├── infra/
│   └── terraform/
└── tools/

One clone, one git log, one branch strategy, one PR process. All services and libraries share commits atomically: a single commit can change an API, its server, its client, and its tests in one reviewable unit.

Why teams adopt monorepos

Atomic cross-project changes. Change an API signature and every consumer in the same commit. Impossible with separate repos.
Shared tooling. One build system (Bazel, Nx, Turborepo), one CI config, one dependency lock.
Instant grep. Search all code at once. No "wait, which repo has that config?"
No version negotiation. Internally, everything uses the same version of every library, whatever is at HEAD.
Unified refactoring. Rename a function and a type; one commit covers everything.

Why teams reject them

Repo size. At scale (thousands of engineers, 1000+ projects), monorepos reach terabytes. Clone time, CI time, IDE memory all suffer without specialized tooling.
Tooling demands. A naive monorepo triggers full CI on every commit, which scales terribly. Build systems that understand project boundaries (Bazel, Nx, Turborepo) are mandatory at scale.
Ownership boundaries. Who reviews what? CODEOWNERS files scale but aren't free.
Rogue commits affect everyone. A broken build on main blocks all teams, not just the owner's team. Test coverage and merge queue discipline are critical.
Merge conflicts. More people on one repo = more conflicts on shared files. Especially on lockfiles (package-lock.json, Cargo.lock).

What a production monorepo needs

Build system with build graph awareness. Only build/test what changed. Bazel, Buck2, Nx, Turborepo, Pants.
Sparse checkout. Most clones do not need every file. Git's sparse-checkout fetches only specified directories.
Git LFS. Binary artifacts use LFS; the main repo stays lean.
Merge queue. PRs merge in FIFO order with rebase-on-merge and pre-merge full-CI. Prevents main breakages from concurrent PRs.
CODEOWNERS. Auto-assign reviews based on paths.
Build caching. Remote build cache (Bazel BES, Nx Cloud) reduces duplicate work.
Strong CI discipline. A red main is an everybody-emergency.

PRO TIP

The monorepo-vs-polyrepo debate is really the build-tooling debate in disguise. Teams with excellent build tooling (Bazel / Buck / Nx with remote cache) run monorepos successfully at scale. Teams without it run polyrepos (many small repos) and coordinate loosely. Pick the tool you can actually operate; do not copy Google's monorepo without Google's build tooling.

The Decision Matrix

Factor	Submodules	Subtrees	Monorepo
Repo isolation	Strong (nested repos)	Weak (merged)	None
Version pinning per-project	Explicit	Per-merge	Shared (no pinning)
Clone complexity	Needs recursion flags	Plain clone	Plain clone
Cross-project atomic changes	Impossible	Manual	Natural
Release cadence	Independent	Semi-independent	Unified
Tooling prerequisite	Git	Git	Specialized build system
Team scale	OK for small teams	OK for small teams	Needs dedicated infra at scale
Common pain	Confusing UX, detached HEAD	Drift over time	Requires mature tooling
When it shines	External deps, audited libraries	Vendored libs that rarely change	Fast-moving teams with cross-project changes

Hybrid: Polyrepo With Careful Tooling

Many real teams split the difference:

Services in separate repos (per-team ownership, independent release cadence).
Internal libraries in one library-monorepo (shared utilities, easy refactoring).
Cross-repo dependencies managed via package manager (npm, pip, Maven) with internal package registry.

This is the "polyrepo with internal registry" pattern: each team owns its repo, internal libs are versioned and published to a private registry, updates are coordinated via normal dependency-management (renovate, dependabot).

Pros: independent release cadence, clear ownership, no mega-repo. Cons: cross-service changes still require coordinated PRs; version drift between services.

Migration Paths

Polyrepo → Monorepo

Combine multiple repos into one:

# Start fresh
mkdir monorepo && cd monorepo
git init

# Import each repo as a subtree
git subtree add --prefix=services/api https://github.com/org/api main
git subtree add --prefix=services/web https://github.com/org/web main
git subtree add --prefix=libs/utils https://github.com/org/utils main
# ... etc

# Or use git-filter-repo for more control

Preserve history: use --squash for an import commit, or no --squash to fully absorb histories (bigger repo, fuller record).

Monorepo → Polyrepo

Extract one subdirectory into its own repo:

# In the monorepo
git-filter-repo --subdirectory-filter services/api
# Now this clone contains ONLY services/api's history, as if it had always been its own repo
# Rename the remote, push to a new repo

When Submodules Finally Work

Despite their pain, submodules are the right answer when:

The nested repo is genuinely external (third-party, open-source, out-of-team).
You want exact commit pinning with audit trail.
The submodule is stable: you update it quarterly, not daily.
Your CI handles submodule init automatically.
The team has set submodule.recurse = true globally and knows the workflow.

Example: a security-sensitive project that pulls in audited cryptography libraries as submodules. Each update is a deliberate bump with review of the specific commit range.

When Subtrees Finally Work

Subtrees shine when:

You want vendoring: copy in a library's code, maintain a local fork, occasionally sync.
The codebase size is manageable: you do not mind absorbing history.
Users should see a self-contained repo without needing submodule knowledge.
You have clear sync discipline: someone owns pulling subtree updates regularly.

Example: a company internal fork of an open-source tool. Subtree keeps it as part of your repo while letting you pull upstream changes as they come.

When Monorepo Finally Works

Monorepos shine when:

The team has one codebase's worth of ownership (not multiple independent teams).
You frequently make cross-project changes that benefit from atomic commits.
You can invest in build and CI tooling to scale.
The org can tolerate "main is broken" incidents affecting many teams.
You have one engineering culture around code style, review, testing.

Example: a product team of 50 engineers across frontend, backend, mobile. Shared components, shared types, shared infra. Bazel or Turborepo lights up what changed; merge queue keeps main green; single PR can touch the full stack.

Common Anti-Patterns

"We'll start with submodules and migrate later"

Migration away from submodules is painful (submodule history absorption, rewriting). If you suspect you will not want submodules, do not start with them.

"Monorepo because Google does it"

Google's monorepo works because they have bazillions in tooling (Piper, Blaze). Without that tooling, a "monorepo" is just a big repo that CI chokes on.

Submodules inside submodules

Recursive submodules nested three levels deep. Nothing works right. Avoid.

Mixing approaches in one repo

"This repo has submodules AND subtrees AND vendored copies." Pick one strategy per dependency and stick with it.

Subtree without `--squash`

Absorbing thousands of commits from every subtree import makes git log unusable. Use --squash unless you specifically want the full history.

WAR STORY

A startup adopted submodules for their "shared utilities" repo across four service repos. Two weeks later, a new hire spent three days debugging "my submodule is weird" issues. Six months later, they kept missing deploys because someone forgot to bump the submodule pointer. They migrated to a monorepo with Turborepo. Three months after that, CI was slow and they introduced Nx with build caching. Year later, everyone is happy, but the journey cost ~6 months of engineering time in distractions. Choose thoughtfully up front; the wrong choice compounds.

Key Concepts Summary

Submodules reference other repos at specific commits; nested Git structure with pinned versions.
Subtrees merge content into a subdirectory; self-contained repo without nested Git.
Monorepos put everything in one repo; enables atomic cross-project changes.
None is universally right. Team size, release cadence, and tooling maturity drive the decision.
Submodule pains: clone flags, detached HEAD, coordinated commits, CI complexity.
Subtree pains: drift over time, history pollution without --squash, manual sync.
Monorepo pains: repo size, specialized tooling required, org-wide CI coordination.
Hybrid (polyrepo with internal registry) is what many teams land on, per-team repos with versioned shared libs.
submodule.recurse true globally fixes many submodule UX issues.

Common Mistakes

Adopting submodules without telling the team; new clones silently miss content without --recurse-submodules.
Using subtrees without --squash; git log becomes unreadable.
Copying Google's monorepo strategy without copying Google's build tooling.
Mixing submodules, subtrees, and vendored copies in the same repo.
Letting subtrees drift because no one owns syncing.
Picking monorepo for a team that cannot maintain a green main.
Nested submodules three levels deep: debugging nightmare.
Forgetting --recurse on clones; CI is green on your machine, red on the build server.
Never investing in build-graph-aware tooling (Nx, Bazel) in a monorepo; CI becomes unusable at scale.
Forcing a migration (polyrepo → monorepo) mid-project without a clear plan for history.

Cherry-Pick, Revert, and Bisect

Continue

Hooks and Automation

←→ navigateM toggle sidebar