Git Internals for Engineers

Git Is a Content-Addressable Filesystem

A new engineer on your team asks how git log works. The usual answer is "it shows commit history." The real answer is "it follows parent pointers on a DAG of SHA-1-addressed objects stored in .git/objects/, starting at whatever SHA HEAD currently references." Those two answers sound like different levels of pedantry — but only the second one tells you why git log <sha> works on a detached commit, why you can recover lost work from the reflog, why a rebase creates new commit SHAs, and why git cat-file -p HEAD returns the exact bytes Git hashed to produce that commit. The whole tool collapses into a few simple concepts once you see the storage model; without it, every Git command feels like a wizard's incantation.

This lesson unpacks the core idea: Git is a content-addressable filesystem with a version-control UI pasted on top. Every file you have ever committed, every directory snapshot, every commit — all stored as objects keyed by the SHA-1 of their contents. Once you can cat-file -p any commit and navigate by hand through tree and blob, nothing about Git remains mysterious.


The One-Line Model

Git's storage is three kinds of objects and one rule:

  • Blob — the content of a file (bytes only, no filename).
  • Tree — a directory: a list of (mode, type, sha, name) entries pointing at blobs and other trees.
  • Commit — a snapshot pointer: tree <sha> + parent <sha>... + author + committer + message.

And the rule:

Every object is stored under the SHA-1 hash of its content. The SHA is the identifier.

That is the whole filesystem layer. Everything else — branches, tags, merges, reset, rebase, reflog — is refs, pointers, and commands that manipulate them.

KEY CONCEPT

Git is not a "diff database." It does not store patches. It stores complete snapshots of your files, deduplicated by content hash. If two commits contain an identical 100 MB file, the file is stored once, with one SHA, referenced by both. Diffs are computed on the fly by comparing snapshots. This single fact — snapshot storage, not diff storage — explains why Git is so fast at jumping to any historical state.


See It With Your Own Eyes

Every object lives under .git/objects/. Let us create a tiny repo and inspect it:

mkdir /tmp/gittest && cd /tmp/gittest
git init
# Initialized empty Git repository in /tmp/gittest/.git/

ls .git/objects/
# info  pack

echo "hello world" > hello.txt
git add hello.txt
git commit -m "first commit"

ls .git/objects/
# 3b  84  d9  info  pack
# Three two-char directories, each holding one object file

Every object is stored at .git/objects/<first-2-chars-of-sha>/<remaining-38-chars>. There are now three objects after one commit: one blob (the file), one tree (the directory snapshot), one commit (the snapshot pointer).

Use git cat-file to see them:

# Show the commit
git log --format=%H -n 1
# d9c3e5a6f2...

git cat-file -p d9c3e5a6f2
# tree 84f1a2b3c4...
# author Sharon <sharon@example.com> 1713600000 +0000
# committer Sharon <sharon@example.com> 1713600000 +0000
#
# first commit

# Follow the tree
git cat-file -p 84f1a2b3c4
# 100644 blob 3b18e512db...    hello.txt

# Follow the blob
git cat-file -p 3b18e512db
# hello world

Three objects, three cat-file calls, and you have walked from a commit → its tree → the file's content. This is all Git is.


The Object Model

committree 84f1a2b3...parent 0000000 (root)author Sharon ...committer ...SHA: d9c3e5a6...treetree100644 blob 3b18e5 hello.txt040000 tree 7a1c4d src/SHA: 84f1a2b3...blobtreeblobhello world\nSHA: 3b18e512...tree (src/)100644 blob 2f1e... main.pySHA: 7a1c4d...Every object is a file under .git/objects/<first-2>/<remaining-38>Key = SHA of content

Three kinds of pointers:

  • commit → tree (the commit's snapshot)
  • commit → parent commits (the history chain)
  • tree → blobs and sub-trees (the directory content)

That is the whole object graph. Branches and tags are refs, which are just named pointers at specific commits — we cover those in Lesson 3.


Content Addressing: Identical Files Are Stored Once

Because the SHA is computed from the content, two files with identical bytes produce the same SHA and are stored as the same blob.

# Create two copies of the same file
echo "same content" > a.txt
cp a.txt b.txt
git add .
git commit -m "two files"

# Inspect the tree
git ls-tree HEAD
# 100644 blob 7c5e3cc...    a.txt
# 100644 blob 7c5e3cc...    b.txt
#                       ↑ same SHA for both

# Only ONE blob is stored
find .git/objects -type f | xargs -I{} basename {} | sort -u
# (three unique blobs for the repo: the two files and the tree)
# ← but for blobs: just ONE entry, not two

This is the dedup. Rename a file? Same content, same blob, zero storage added — just a new tree entry pointing to the existing blob. Copy a file? Same story.

PRO TIP

Git does not track renames explicitly. It infers them by comparing blob SHAs between trees and noticing "this blob appeared at path B while an identical blob disappeared from path A." git log --follow <file> and git blame --follow use this heuristic. You never explicitly say "I renamed X to Y" — Git figures it out from the content.


How the SHA Is Computed

For a blob:

# Git's exact formula: SHA1("blob " + content.length + "\0" + content)
printf 'hello world\n' | git hash-object --stdin
# 3b18e512dba79e4c8300dd08aeb37f8e728b8dad

# Manually, with openssl:
content="hello world
"
printf "blob %d\0%s" $(echo -n "$content" | wc -c) "$content" | openssl sha1
# 3b18e512dba79e4c8300dd08aeb37f8e728b8dad

# ← Same SHA. It's just SHA1 of a specific byte pattern.

For a tree: serialize entries as <mode> <name>\0<20-byte-sha>, concatenate, prefix with tree <size>\0, hash with SHA1.

For a commit: the textual "tree X\nparent Y\nauthor ...\ncommitter ...\n\nmessage\n" content, prefixed with commit <size>\0, hashed with SHA1.

You do not need to memorize any of this. The point is: nothing in Git is opaque. Every object has a deterministic, reproducible identity tied to its content.

Why SHA-1?

Git was designed in 2005 when SHA-1 was considered cryptographically strong. Since then, SHA-1 collisions have been demonstrated (Google's SHAttered in 2017). Git's response:

  • SHA-256 mode is available (git init --object-format=sha256) but not yet default because of interop.
  • Meanwhile, Git has added collision detection — it rejects objects that produce a known-collision pattern.
  • For the vast majority of projects, SHA-1 is still safe enough in practice. Migration to SHA-256 is slowly rolling through the ecosystem.
WARNING

If your repository's security model requires collision resistance (e.g., content signed in a commit must be tamper-evident for audit), you want SHA-256 and commit signing together. For ordinary code repos, SHA-1 remains practical — but be aware of the underlying crypto.


Walking History Is Walking a Linked List

Each commit points at its parent(s). Walking backwards through the parent chain is the history. That is what git log does.

# Every commit, full SHA + parent relationship
git log --pretty='%H -> %P' --all
# d9c3e5a... -> 8b2f4c1...       ← this commit's parent
# 8b2f4c1... -> a1b7e0d...
# a1b7e0d... -> (empty)          ← root commit, no parent

# Same thing visually
git log --graph --oneline --all
# * d9c3e5a (HEAD -> main) add README
# * 8b2f4c1 initial structure
# * a1b7e0d first commit

Merges have multiple parents. The DAG (directed acyclic graph) is the commit graph with all parent relationships:

git log --graph --oneline --all
#   *   c5a6e7f (HEAD -> main) Merge branch 'feature'
#   |\
#   | * 9b3d2a1 (feature) add feature X
#   * | 7e1c4b0 fix typo on main
#   |/
#   * 4f8a6d2 baseline
#   * a1b7e0d initial

The commit graph is everything. Every Git operation is "walk this part of the graph, produce these new commits or refs."


git fsck: Verify the Integrity

Because every object is hashed, corruption is trivial to detect:

git fsck --full
# Checking object directories: 100% done.
# Checking objects: 100% done.
# (no errors)

# If something were corrupted:
# error: sha1 mismatch 3b18e512... (expected 3b18e512..., got d7a0b1e...)
# fatal: loose object ... is corrupt

git fsck re-hashes every object and compares with its filename. Any mismatch is instant corruption evidence. This is the reason Git's storage is sometimes called "the world's slowest but safest filesystem."


.git/objects/ in Detail

ls -la .git/objects/
# drwxr-xr-x  3 admin  admin   96 Apr 20 10:00 3b
# drwxr-xr-x  3 admin  admin   96 Apr 20 10:00 7c
# drwxr-xr-x  3 admin  admin   96 Apr 20 10:00 84
# drwxr-xr-x  3 admin  admin   96 Apr 20 10:00 d9
# drwxr-xr-x  2 admin  admin   64 Apr 20 10:00 info
# drwxr-xr-x  2 admin  admin   64 Apr 20 10:00 pack

ls .git/objects/3b/
# 18e512dba79e4c8300dd08aeb37f8e728b8dad        ← the blob content

Each object file is zlib-compressed. You can decompress by hand:

# Read a raw object (requires `pigz` or `openssl zlib`):
cat .git/objects/3b/18e512dba79e4c8300dd08aeb37f8e728b8dad | pigz -d
# blob 12\0hello world

For busy repos, Git periodically packs loose objects into a single compressed file:

git gc
ls .git/objects/
# info  pack
ls .git/objects/pack/
# pack-abc123...idx      ← index (which objects are in the pack, and where)
# pack-abc123...pack     ← packed objects, delta-compressed

Packs use delta compression: similar objects (often successive versions of a file) are stored as deltas against one another, saving enormous space. This is an optimization of the content-addressed model, not a change to it — cat-file still returns the same reconstructed content.


Five Commands That Expose the Object Model

# 1. Hash a blob without storing (just compute the SHA)
echo "hello" | git hash-object --stdin
# ce013625030ba8dba906f756967f9e9ca394464a

# 2. Hash AND store (same SHA, now in .git/objects/)
echo "hello" | git hash-object -w --stdin

# 3. Show any object's content
git cat-file -p <sha>
# Prints blob content, tree listing, or commit metadata

# 4. Show the type of any SHA
git cat-file -t <sha>
# blob / tree / commit / tag

# 5. List everything reachable from a commit
git rev-list --objects <sha>
# <sha> (commit)
# <tree-sha> (tree)
# <blob-sha> <path/to/file.ext>
# ...

git rev-list --objects HEAD lists every object reachable from the current commit — every blob, every tree, every ancestor commit. This is what git push uploads: the graph of new objects the remote does not have yet.

PRO TIP

Spend 10 minutes running these five commands on a real repo. After you have cat-file-walked from HEAD to a specific file via its tree, the Git model clicks permanently. It is a surprisingly small amount of structure; you just need to see it once.


Why This Matters in Practice

Understanding the object model collapses a lot of Git mystery:

  • Detached HEAD is not scary. It means HEAD points at a commit SHA directly instead of at a branch ref. Git can still show you HEAD as if it were a branch; it just will not auto-advance.
  • "git is slow in this repo" usually means millions of small loose objects. git gc packs them; .git/objects/pack/ is one big compressed file.
  • "git push is huge" usually means you added large binary files. git rev-list --objects --all + sorting by object size finds them (git filter-repo or git lfs fixes them).
  • Rebase changes commit SHAs because a commit's SHA includes its parent's SHA. Change the parent (by moving onto a different base) → different SHA for every descendant commit.
  • Two branches with "the same changes" have different commits if one was merged and one was rebased. The trees may look identical at HEAD; the commit graph differs.
  • Rewriting history never loses anything immediately — the old commits still exist in .git/objects/, reachable via the reflog (Lesson 4.1), until garbage collection runs.

Key Concepts Summary

  • Git is a content-addressable filesystem. Every object (blob/tree/commit) is keyed by SHA-1 of its content.
  • Three object types: blob (file content), tree (directory), commit (snapshot pointer + parents + metadata).
  • Identical content = identical SHA = stored once. Renames and copies are free in storage terms.
  • Commits form a DAG. Each commit points at its parent(s); history is parent-chain traversal.
  • .git/objects/ holds every object. Loose (one file each) or packed (delta-compressed).
  • git cat-file -p <sha> shows any object's contents. git hash-object computes SHAs. git fsck verifies integrity.
  • SHA-1 is still default. SHA-256 is available but migration is ongoing.
  • Rebase changes commit SHAs because commit SHA includes parent SHA. Change the parent, change every descendant's identity.
  • Nothing is opaque. Every Git command ultimately reads or writes these objects.

Common Mistakes

  • Thinking Git stores diffs. It stores snapshots; diffs are computed on demand.
  • Being afraid of detached HEAD. It is just HEAD pointing at a commit directly — your work is still there and recoverable.
  • Committing large binaries, then trying to remove them with rm. The blob stays in .git/objects/ forever unless you rewrite history (or use Git LFS from the start).
  • Assuming "the same commit" across branches means the same SHA. A commit's SHA depends on its parent; rebased or cherry-picked commits have different SHAs.
  • Using sha1sum to "compare" files between commits. Git's blob SHA is SHA1("blob <size>\0" + content), not raw content SHA. Always use git hash-object.
  • Panicking when .git/ grows huge. git gc --aggressive + git repack often halves it; massive repos need LFS or shallow clones.
  • Manually editing .git/objects/. Do not. Even touching file timestamps can trigger re-indexing surprises. Let Git manage it.
  • Running git fsck only after a problem. Add it to periodic maintenance; it catches silent corruption before it bites.

KNOWLEDGE CHECK

You create a 200 MB file, commit it, realize it was a mistake, `rm` it, and commit the removal. `du -sh .git/` still reports ~200 MB. What is going on, and what is the right way to actually reclaim that space?