Git Internals for Engineers

How Git Computes Diffs

A classmate from a Subversion background watches you git log -p and asks how Git can be so fast when each commit's diff must be "stored somewhere." Your answer: Git does not store diffs. It stores full snapshots, and computes diffs on demand by comparing trees. They do not believe you. You git cat-file -p HEAD and point at the tree <sha> line. You walk the tree to show full files as blobs. Their face changes from skeptical to confused to impressed as the model clicks. Git's storage model is fundamentally different from older VCSes — and more elegant — once you see it.

This lesson covers how Git computes diffs (comparing trees by blob SHA, then unified-diffing differing blobs), why that approach is both fast and accurate, how rename detection works, and why git diff has so many flags. If you have ever wondered why Git can jump to a 5-year-old commit in 50 ms when SVN would take 5 seconds, this is the explanation.


Snapshot Storage, Diff Display

Git stores complete snapshots — every commit has a tree pointing at blobs that represent the full file content. No deltas, no patches, no "history of changes." Just: "here is what the whole project looked like at commit A."

When you ask for a diff between commit A and commit B, Git:

  1. Looks up the tree of A.
  2. Looks up the tree of B.
  3. Walks both trees. For each file path:
    • If blob SHAs are identical → no change.
    • If blob SHAs differ → load both blobs, run unified diff.
    • If path exists only in A → file deleted.
    • If path exists only in B → file added.
  4. Applies heuristics to detect renames (identical blob SHA, different path → rename).

No diff was stored. The diff is computed by comparing two snapshots, and the comparison is mostly blob-SHA equality checks — extremely fast.

KEY CONCEPT

Git's storage is O(total content) regardless of history depth. Git's diff computation is O(changed files), not O(total files). This is why git diff HEAD~100 HEAD is as fast as git diff HEAD~1 HEAD for the same number of changed files — Git does not "walk through" 100 commits; it just compares two trees.


See It in Action

# Show the tree for two commits
git rev-parse HEAD:^{tree}
# 84f1a2b3...

git rev-parse HEAD~3:^{tree}
# 789abc0d...

# The diff is just a comparison of these two trees
git diff-tree -r 789abc0d 84f1a2b3
# :100644 100644 abc12345... def67890... M    src/app.py
# :100644 100644 11111111... 22222222... M    package.json
# :000000 100644 00000000... 98765432... A    docs/install.md
# :100644 000000 aaaaaaaa... 00000000... D    old/legacy.py

# Format:
#   <src-mode> <dst-mode> <src-sha> <dst-sha> <status>  <path>
# Status: A=added, D=deleted, M=modified, R=renamed, C=copied

git diff-tree is the plumbing command. git diff / git show / git log -p are porcelain commands that wrap it and render unified diffs on top.

Watch how dedup helps:

# Files with identical content have identical blob SHAs → instantly detected as unchanged
git ls-tree HEAD | awk '{print $3, $4}' | sort
# <sha> path1
# <sha> path2
# Two paths with the same SHA? Identical content; counted as unchanged across renames.

Rename Detection

Git does not store a "rename" operation — there is no git mv special record. When you git mv old new, Git:

  1. Stages the deletion of old.
  2. Stages the addition of new.

The commit records those two changes independently. On diff, Git's rename detection kicks in:

  • If a blob SHA disappears from one path and appears at another → rename.
  • If blob SHAs differ but content similarity > threshold → rename with modifications.
# Without rename detection
git diff --no-renames HEAD~1 HEAD
# delete mode 100644 old.py
# create mode 100644 new.py

# With rename detection (default on modern Git)
git diff HEAD~1 HEAD
# rename old.py => new.py (100%)
# (or with modifications)
# rename old.py => new.py (82%)
# ... (diff of the 18% that changed) ...

# Force a specific similarity threshold
git diff -M50 HEAD~1 HEAD        # rename if >= 50% similar
git diff -M90 HEAD~1 HEAD        # rename if >= 90% similar

Default threshold is 50%. For a strict "exact renames only," -M100%.

PRO TIP

git log --follow <file> uses rename detection to trace a file's history across renames. Without --follow, git log <file> stops at the commit where the current name first appears. With --follow, it continues past renames. Essential for tracing the evolution of a file that has been moved around.


Unified Diff Format — Line by Line

A sample diff:

diff --git a/app.py b/app.py
index 3b18e512..ef5a6b7c 100644
--- a/app.py
+++ b/app.py
@@ -1,5 +1,7 @@
 import os
 import sys
+import logging
+logger = logging.getLogger(__name__)

 def main():
-    print("hello")
+    logger.info("hello")

Breakdown:

  • diff --git a/app.py b/app.py — the headers. a/ is "from", b/ is "to". These are symbolic; no actual a/ or b/ directories exist.
  • index 3b18e512..ef5a6b7c 100644 — blob SHAs of from and to versions, plus file mode.
  • --- a/app.py and +++ b/app.py — the from/to paths.
  • @@ -1,5 +1,7 @@ — hunk header. "From line 1, 5 lines" → "to line 1, 7 lines."
  • Lines with leading space — context (unchanged).
  • Lines with - — removed.
  • Lines with + — added.

This is the same format used by diff -u and every patch tool. Git is just a producer/consumer of this format.

Flags that matter

git diff                      # working dir vs index
git diff --staged             # index vs HEAD (what's in the next commit)
git diff HEAD                 # working dir vs HEAD (total pending changes)
git diff commitA commitB      # between two commits
git diff commitA..commitB     # same (two-dot)
git diff commitA...commitB    # from their merge base to commitB

git diff --stat               # summary: files and insert/delete counts
git diff --shortstat          # one-line summary
git diff --name-only          # just the changed paths
git diff --name-status        # paths + A/M/D/R indicator
git diff -w                   # ignore whitespace
git diff --color-words        # word-level diff (useful for prose changes)
git diff --ignore-all-space   # strongest whitespace-ignore

The .. vs ... distinction:

  • A..B = "things in B but not A" — the commits reachable from B that are not reachable from A.
  • A...B = "the symmetric difference" — commits reachable from either but not both, plus their merge base matters for diff semantics.

For diff specifically:

  • git diff A..B = diff from A to B.
  • git diff A...B = diff from (merge-base of A and B) to B. Useful to see "what feature branch has added on top of where it branched from main."

Why This Is Faster Than Diff-Storage

Other VCS systems (SVN, older versions of CVS) store diffs. When you want the content at revision 500, they must apply 499 diffs or find a "checkpoint" snapshot and apply fewer. Either way, getting to any revision is O(distance).

Git's approach: every commit has a full snapshot, deduplicated at the blob level. Getting the content at any revision is O(1) — look up the commit's tree, walk it. Getting the diff between two revisions is O(differences).

The trade-off: disk space. In naive storage, a 10 MB file that has 1 line changed every commit would be 10 MB × N commits = huge. Git mitigates this with:

  • Blob dedup by SHA. Unchanged files across commits share one blob.
  • Pack files. After git gc, Git stores similar blobs as deltas against one another inside a pack. So two versions of a 10 MB file that differ by one line become one base blob + a small delta — physically stored compactly, but the logical model is still "two full snapshots."
du -sh .git/objects/
# 140M    .git/objects/

git gc
du -sh .git/objects/
# 35M     .git/objects/
# (most loose objects packed + delta-compressed)

Pack files give you snapshot semantics with delta-compression efficiency. Best of both worlds.


Diff Algorithms

Git supports multiple diff algorithms:

git diff --diff-algorithm=myers    # default — fast, good enough
git diff --diff-algorithm=minimal  # fewer changes, slower
git diff --diff-algorithm=patience # better for structured text (indented code)
git diff --diff-algorithm=histogram # like patience, often similar

For most code, myers is fine. For gnarly refactorings where the default diff is confusing, try histogram or patience — they often produce more readable hunks for indented code.

Set globally:

git config --global diff.algorithm histogram

git log -p and git log --stat

Combine commit traversal with diffs:

git log -p                    # commits + their diffs
git log -p -S'function_name'  # commits that changed the occurrence of this string
                              # (the "pickaxe" — covered in Module 6)
git log --stat                # commits + file-change summary
git log --since='2 weeks'     # time-filtered
git log --author='Sharon'     # author-filtered
git log -- src/auth.py        # only commits that touched this file

Ranges:

git log A..B                  # commits in B but not A
git log main..feature         # feature's unique commits
git log --all                 # all branches, not just reachable from HEAD
git log --graph               # ASCII graph of branches and merges

Diff Between Working Directory and Any Commit

git diff <sha> -- path/to/file     # working dir vs that commit
git diff main                      # working dir vs tip of main
git diff main..HEAD                # tip of main vs HEAD
git diff main -- path/file         # scoped to one file

Useful variant: compare a specific file across two commits:

git diff A B -- path/to/file

This shows the diff of just that file between the two snapshots.


git blame Uses the Same Mechanism

git blame walks history and for each line, finds the commit that introduced it. Internally, this is:

  1. Start with the current content of the file.
  2. For each line, walk backwards through commits that touched this file.
  3. At each commit, diff against its parent. Lines that changed here are attributed to this commit.
  4. Continue for unchanged lines until each line finds its introducing commit.
git blame src/auth.py
# a1b2c3d4 (Sharon     2026-04-20 10:00:00 +0000  1) def login(user, password):
# 789abc0d (Alice      2026-01-15 14:22:00 +0000  2)     if not user:
# a1b2c3d4 (Sharon     2026-04-20 10:00:00 +0000  3)         raise ValueError("no user")

git blame -L 10,20 src/auth.py     # only lines 10-20
git blame -w src/auth.py            # ignore whitespace-only changes
git blame --ignore-rev abc123 file  # pretend that commit never happened (useful after a mass reformat)

With --ignore-rev or --ignore-revs-file=.git-blame-ignore-revs, a global reformat commit does not "blame" every line to the reformatter — blame continues past that commit to the real author.

PRO TIP

For big reformats (prettier, black, rustfmt rollouts), commit the reformat as one isolated commit and add that SHA to .git-blame-ignore-revs. GitHub honors this file automatically in its blame UI, and git blame --ignore-revs-file=.git-blame-ignore-revs does the same locally. Reformat commits stop polluting blame.


Diff Drivers for Binary Files

For binary files (images, PDFs, Jupyter notebooks), unified diff is useless. Git supports diff drivers — custom diff commands per file type:

# .gitattributes
*.png diff=exif
*.ipynb diff=jupyter
*.pdf diff=pdf

# .git/config or ~/.gitconfig
[diff "exif"]
    textconv = exiftool

[diff "jupyter"]
    textconv = nbdime-textconv

[diff "pdf"]
    textconv = pdftotext

Now git diff on a .png runs exiftool on both versions and diffs the text output — metadata changes are visible. Same pattern for Jupyter notebooks (nbdime), PDFs, Word docs.

Without this, a binary diff is just Binary files differ.


Diff Inside a Merge Commit

A merge commit has multiple parents. Diffing it is ambiguous — against which parent? Git's default:

git show <merge-sha>
# Shows a "combined diff" — lines that were modified relative to BOTH parents.
# Useful for spotting conflict resolution edits.

git show --first-parent <merge-sha>
# Diff against the first parent only.
# ("What did merging this branch change relative to where we were?")

git show -m <merge-sha>
# Separate diffs against EACH parent.

git log -p --first-parent main
# Walks main ignoring merged-in branches' individual commits.
# Shows each merge as a single diff against the previous first-parent commit.

For PR-based workflows where merges are the unit of review, --first-parent shows the clean high-level history. git log --first-parent --oneline is often the answer to "what has happened on main recently?"


Practical Commands for Daily Work

"What did I change?"

git diff                      # unstaged changes
git diff --staged             # staged changes
git diff HEAD                 # all changes since last commit

"What did this commit change?"

git show <sha>
git show <sha> --stat
git show <sha> -- path/file   # scoped

"What is on the feature branch that is not on main?"

git log main..feature         # commits
git diff main...feature       # aggregate diff

"What files changed in the last 10 commits?"

git log --name-only HEAD~10..HEAD
git diff --name-only HEAD~10 HEAD  # aggregate

"When did this line / function change?"

git log -p -L 50,60:src/app.py        # history of lines 50-60
git log -p -L :function_name:src/app.py  # history of a whole function

The -L option is underrated. It shows the full history of a specific line range or function.


Key Concepts Summary

  • Git stores full snapshots, not diffs. Diffs are computed on demand by comparing trees.
  • Tree comparison is fast: same blob SHA → unchanged; differing SHAs → load and diff.
  • Rename detection infers renames from blob SHA matches across paths. Default threshold 50%.
  • Unified diff format — standard output; @@ -a,b +c,d @@ is the hunk header; context, -, + mark lines.
  • .. vs ...: A..B = commits in B not A; A...B for diff = from merge-base to B.
  • Pack files + delta compression give snapshot semantics with delta efficiency. git gc packs.
  • Multiple diff algorithms: myers (default), patience, histogram, minimal. Try patience for indented code.
  • git blame walks history per-line; --ignore-revs-file skips reformat commits.
  • Diff drivers (.gitattributes) enable meaningful diffs for binary files via textconv.
  • Merge commits have multiple parents; git show uses a combined diff by default.
  • git log -L traces history of specific lines or functions — excellent for "who/when/why."

Common Mistakes

  • Thinking Git stores diffs (it stores snapshots; diffs are computed).
  • Confusing A..B and A...B — they mean different things, especially in diff vs log.
  • Forgetting --follow when tracing history across renames. Without it, git log <file> stops at the first commit touching the current filename.
  • Using the default diff algorithm for all codebases; for complex refactors, try --diff-algorithm=patience or histogram.
  • Forgetting to commit mass reformats as isolated commits, polluting blame forever.
  • Running git diff when you meant git diff --staged (or vice versa). The difference is huge for commit review.
  • Relying on rename detection when a file has been moved AND heavily modified in one commit — the similarity drops below threshold and Git reports "delete + add" instead of "rename." Split the change into two commits for cleaner history.
  • Using git log -p on a huge repo without scoping — piping gigabytes of diff output to a pager. Scope with -- path or --since.
  • Skipping .gitattributes diff drivers for binary-ish files. Reviewing a 1000-commit PR with silent "Binary files differ" is miserable; textconv for Jupyter notebooks or PDFs is a 20-minute setup that pays off forever.