Git Internals for Engineers

The Three States: Working, Staging, Repository

A developer edits three files. They run git commit -am "fix bug" and get told "nothing to commit, working tree clean." They panic. They run git status and see their edits in red. They run git add . and try again, now it commits. They have just rediscovered that git commit -a only stages tracked files, not new ones. They have also used Git for four years without knowing what the index is, because the tool hides it behind polite commands. Today's lesson is the opposite: meet the index directly, understand the staging area as its own first-class concept, and watch every Git command transparently push and pull between the three states.

This is the second fundamental model in Git, after the object database. Every git add, git commit, git checkout, git reset, and git stash moves data between three specific places. Once you can name what moves where, "why is my file still showing as modified?" becomes a five-second diagnosis instead of a trial-and-error session.

The Three Places

Every file in a Git repository exists in up to three states simultaneously:

State	Where it lives	What it represents
Working directory	Your files on disk	What you can edit right now
Staging area (index)	`.git/index`, a binary file	What is queued for the next commit
Repository	`.git/objects/`	Committed snapshots, in the history

Moving between them:

git add <file>, working directory → staging area (hashes the content into a blob + updates the index)
git commit, staging area → repository (creates a tree + commit object from the index)
git checkout <file> (classic) or git restore <file>, staging area → working directory (or repository → both)
git reset <file>, repository → staging area (keeps working tree unchanged)

KEY CONCEPT

Git is almost unique in having an explicit staging area as a first-class concept. Other VCS systems (Mercurial, Subversion) jump straight from working tree to committed history. The index is your opportunity to craft exactly what goes into the next commit. Every good Git habit (review before commit, commit in small logical units, separate style changes from logic changes) depends on using the index deliberately.

See the Three States

The arrows tell the whole story: git add moves content from working → index; git commit moves index → repository; git restore/git reset undoes in the other direction.

`git status` Decoded

git status lists files by state. Every line has a specific meaning:

git status
# On branch main
# Changes to be committed:       ← files in the INDEX (will go in next commit)
#   (use "git restore --staged <file>..." to unstage)
#         modified:   app.py
#
# Changes not staged for commit: ← files in WORKING DIR differing from INDEX
#   (use "git add <file>..." to update what will be committed)
#         modified:   README.md
#
# Untracked files:                ← files NOT in index at all
#   (use "git add <file>..." to include in what will be committed)
#         new-file.txt

The same file can appear twice if you edit it, add it, then edit it again:

echo "v1" > x.txt
git add x.txt       # v1 now in index
echo "v2" > x.txt   # v2 in working dir; index still has v1

git status
# Changes to be committed:
#   new file:   x.txt           ← index has v1 (staged for commit)
# Changes not staged for commit:
#   modified:   x.txt           ← working dir has v2 (differs from index)

This is the whole point of the staging area: you can stage a specific version of a file, then keep editing, and only the staged version is committed. The edits sit in the working directory, outside of what the next commit will contain.

Inspecting the Index Directly

The index is a real file:

ls -la .git/index
# -rw-r--r-- 1 admin admin 256 Apr 20 10:00 .git/index

# What's in it?
git ls-files --stage
# 100644 a1b2c3d4... 0    app.py
# 100644 e5f6a7b8... 0    README.md
# Each line: mode, blob SHA, stage (0 = normal, >0 = conflict), path

Notice the blob SHA. When you git add, Git does exactly three things:

Reads the file content.
Creates a blob object in .git/objects/ with SHA1 of content.
Updates the index to say "this path is now this blob SHA."

That is why the change is immediately in .git/objects/, even before you commit. Staging a file already writes the blob. The commit is just a pointer.

echo "abc" > new.txt
git add new.txt

# The blob exists NOW, before any commit
git hash-object new.txt
# 8baef1b4a...
git cat-file -p 8baef1b4a
# abc

PRO TIP

This is why git add is not dangerous: even if you never commit, the blob is in your object database and can be recovered. If you rm a file after staging it, git checkout -- <file> restores it, it reads the blob back out of the index. Staged content is already durable within the repo.

Why the Staging Area Exists

New Git users sometimes ask: "Why do I have to git add before git commit? Why not just commit the working directory?"

Three real reasons:

1. Commit-by-logical-unit

You may have changed many files in your working directory. Some belong to commit A ("fix bug"), others to commit B ("refactor helper"). The staging area lets you say: "commit only these three files, in this specific version." Without staging, you would have to stash, commit, unstash, stash more, commit more, clumsy.

2. Partial file staging

You can stage specific hunks of a file, not the whole thing:

git add -p  # interactive: for each chunk, (y/n/s/e/?)

This splits a file's changes into "this chunk goes in the commit, that chunk stays for later." Essential for clean commits when you accidentally mixed concerns.

3. Review before commit

git diff --staged shows exactly what your commit will contain. git diff (no args) shows unstaged changes, what you would add if you ran git add next. Two different views, one workflow:

git diff            # working dir vs index — "what if I add?"
git diff --staged   # index vs HEAD — "what if I commit?"
git diff HEAD       # working dir vs HEAD — "what has changed since last commit?"

Junior engineers skip this and then wonder why their commits have debug print statements, half-finished experiments, and stray console.log lines. Senior engineers review git diff --staged every single commit. It takes 30 seconds and catches enormous embarrassment.

`git add` Variants

git add <file>         # stage the current working-dir version of one file
git add -A             # stage all changes (new, modified, deleted) in whole repo
git add .              # stage all changes in the CURRENT DIRECTORY
git add -u             # stage modifications and deletions of TRACKED files only
git add -p             # interactive, hunk-by-hunk
git add -i             # interactive menu (rarely used; -p is usually enough)

The . vs -A distinction trips people up:

git add . stages changes in the current directory and below.
git add -A stages everything in the repo, regardless of where you are.
Modern Git (2.x+) made git add . behave like -A scoped to the dir, so the old "deleted files don't get staged" gotcha is fixed. But in older Git or weird configs, -A is safer.

# Commonly: I'm in a subdir, only want changes from here down
git add .

# I want everything changed anywhere
git add -A

# I want to not include my stray experiment files I haven't yet tracked
git add -u

Removing and Renaming

git rm <file>          # remove from working dir AND stage the removal
git rm --cached <file> # remove from index ONLY; keep file in working dir
                       # (useful when accidentally added to git)
git mv old new         # equivalent to: mv old new; git add -A

git rm --cached is the "oops I accidentally committed this file, I want to keep the file on disk but remove it from Git" command:

# I accidentally added a 50 MB binary
git add big.bin
git commit -m "oops"

# Uncommit and unstage, but keep the file
git rm --cached big.bin
echo "big.bin" >> .gitignore
git add .gitignore
git commit -m "stop tracking big.bin"

(Remember: the blob is still in history. To fully remove from the repo, rewrite history as in Lesson 1's Quiz.)

`.gitignore` Is About the Index, Not the Working Dir

.gitignore tells Git which files to not track by default. It does not ignore already-tracked files.

# .gitignore
*.log
node_modules/
.env

# If you already committed app.log, adding to .gitignore does NOT make Git stop tracking it
# You have to:
git rm --cached app.log       # remove from index
git commit -m "stop tracking app.log"
# NOW .gitignore will keep new matches out

This is a constant source of confusion. Rule: .gitignore affects git add behavior (files matching patterns are skipped on git add .), but it does not retroactively unstage anything.

The Staging Area and Commits

What happens on git commit?

Git reads the index: the list of blob SHAs + their paths.
Git creates a tree object from the index (or reuses existing one if nothing changed).
Git creates a commit object pointing at that tree, with HEAD's current commit as parent.
Git updates HEAD (or the current branch ref) to point at the new commit.

git commit -m "feat: add login"
# [main a1b2c3d] feat: add login
#  2 files changed, 42 insertions(+), 3 deletions(-)

# What actually happened:
# - The staged changes became a new tree
# - A new commit object referenced that tree + the prior HEAD commit
# - The branch ref (main) was moved to the new commit SHA

Crucially, nothing in the working directory changes on commit. Your editor's files are untouched. The commit is a snapshot of the index; the working directory is just the surface you use to prepare the next index.

WAR STORY

A developer ran git commit -am "big refactor" expecting to commit everything. The commit contained only 2 of the 15 files they had edited. The 13 missing files were all new, untracked. -a means "stage modified and deleted TRACKED files automatically", it does NOT include untracked ones. Fix: git add -A && git commit -m "...". Better fix: always git status before committing; read what the next commit will contain. The 30 seconds you spend on status saves the 30 minutes you spend explaining a broken commit.

Stash: A Fourth State (Sort Of)

Git's stash is a temporary shelf for changes you are not ready to commit but need to set aside:

git stash                # saves working dir + staged changes, reverts to HEAD
git stash push -m "msg"  # with a message
git stash list           # see all stashes
git stash pop            # reapply the most recent (and delete it)
git stash apply          # reapply without deleting

# Stash untracked files too
git stash -u

# Stash only unstaged changes
git stash --keep-index

Under the hood, stash is built on commits, git stash creates two commits in a refs/stash ref (one for the index, one for the working tree). Pop them off the stash stack to reapply. Nothing is magic; it is the same object model.

The Full Picture: A Life of One File

# 1. Create a new file
echo "hello" > greet.py
git status  # greet.py untracked

# 2. Stage it
git add greet.py
git status  # new file: greet.py

# 3. Commit
git commit -m "add greeter"

# 4. Edit again
echo "hello world" > greet.py
git status  # modified: greet.py

# 5. Review the change
git diff    # shows the "world" addition

# 6. Stage part of it
git add -p greet.py  # interactive; choose hunks

# 7. Commit the staged parts, continue editing
git commit -m "refine greeting"

# Working dir still has any unstaged edits; next round starts

Each transition is deliberate. Each command moves content between specific states. There is never a mystery about where a change "is."

Key Concepts Summary

Three states: working directory (editable files), staging area (index, the next commit draft), repository (committed history in .git/objects/).
git add = working → index (creates a blob, updates .git/index).
git commit = index → repository (creates tree + commit objects).
git restore / git reset undo in the opposite direction.
The same file can appear staged AND modified simultaneously when you edit after staging.
The index is a real file (.git/index), listable with git ls-files --stage.
Staged content is already blob-stored. Recovery is possible even without a commit.
git add . vs -A vs -u differ in scope (dir vs repo) and what they include (all vs tracked-only).
.gitignore affects future git add, not already-tracked files (git rm --cached to stop tracking).
git commit -a only stages tracked files. Untracked files need explicit git add.
Stash is just commits on a refs/stash ref, not magic.

Common Mistakes

git commit -am "..." expecting all new files to be included. -a only stages modified/deleted TRACKED files.
Running git commit without git status or git diff --staged first, then shipping debug prints or half-finished code.
Adding a file to .gitignore and expecting already-tracked files to disappear. Use git rm --cached.
Using git add . in the wrong directory and missing changes elsewhere. git add -A is safer for "commit everything."
Thinking the index is abstract or magical. It is literally .git/index; git ls-files --stage prints its contents.
Mixing git checkout -- <file> (old) and git restore <file> (new). Both work; restore is clearer.
Panicking after git add of a wrong file. You can git restore --staged <file> to unstage (and the blob stays recoverable even if you forget).
Ignoring the power of git add -p. Partial-file staging is the difference between clean commits and messy ones.
Assuming "git add . && git commit" is atomic. It is two separate steps, between them, Git state is exactly the staged-but-not-yet-committed state.

KNOWLEDGE CHECK

You edit `config.py`, run `git add config.py`, then keep editing `config.py`. You run `git commit -m 'update config'`. Which version of the file ends up in the commit, and how would you verify?

Git Is a Content-Addressable Filesystem

Continue

Refs, HEAD, and Branches

←→ navigateM toggle sidebar