Text Diff Guide: How to Compare Text and Find Differences
What Is a Text Diff?
A text diff is the computed difference between two pieces of text. Given an original document and a modified version, a diff algorithm identifies which parts were added, removed, or left unchanged. The output is typically called a diff, a patch, or an edit script — a structured representation of every change needed to transform the original into the modified version.
The concept traces back to the Unix diff utility, written by Douglas McIlroy in 1974 at Bell Labs. McIlroy's tool compared two files line by line and printed the differences in a format that another utility, patch, could apply to reproduce the changes. This design — compute the difference, then apply it elsewhere — became the foundation for version control systems, code review workflows, and collaborative editing tools that billions of developers use today.
At its core, every diff algorithm answers one question: what is the minimum set of changes that transforms text A into text B? The answer depends on what you define as a "unit" of text — a line, a word, or a single character — and which algorithm you use to find the shortest edit path between the two versions.
Why Comparing Text Matters
Text comparison is one of those capabilities that seems simple until you realize how many workflows depend on it. Here are the most common contexts where diffing is essential:
- Code review. Every pull request on GitHub, GitLab, or Bitbucket presents a diff. Reviewers read it to understand what changed, catch bugs, verify logic, and approve merges. Without a clear diff view, code review would mean comparing full files side by side — impractical at any scale.
- Document versioning. Legal contracts, technical specifications, and policy documents go through multiple revisions. A diff between draft versions lets editors and stakeholders see exactly what was added, removed, or reworded without reading both versions cover to cover.
- Configuration auditing. When a server configuration changes and something breaks, comparing the current config file against the previous version immediately reveals what changed. This is often the fastest path to diagnosing an outage.
- Debugging. Comparing two log files, two API responses, or two database dumps can isolate the exact point where behavior diverged. If a staging environment works but production does not, diffing the environment variables or config files often surfaces the discrepancy in seconds.
- Content editing. Writers and editors compare drafts to track how copy evolved. Word-level diff is particularly useful here, showing which specific words and phrases were changed without drowning the reader in unchanged context.
Paste two versions of any text and see the differences highlighted line by line or word by word. Free, private, no account required.
How Diff Algorithms Work
Diff algorithms are rooted in a classic computer science problem: finding the Longest Common Subsequence (LCS) of two sequences. The LCS is the longest sequence of elements that appear in both inputs in the same order, though not necessarily contiguously. Once you know the LCS, everything not in it is either an insertion or a deletion.
For example, given the sequences ABCDEF and ABXDEF, the LCS is ABDEF (length 5). The character C was deleted from the original, and X was inserted in the new version. The edit distance — the number of insertions and deletions needed — is 2.
The Myers Diff Algorithm
The most widely used diff algorithm in practice is the Myers diff algorithm, published by Eugene W. Myers in 1986 in the paper "An O(ND) Difference Algorithm and Its Variations." This is the algorithm that powers git diff, and by extension, every pull request diff you have ever read on GitHub.
Myers frames the diff problem as navigating an edit graph. Imagine a grid where the X-axis represents the original text and the Y-axis represents the new text. Moving right means deleting an element from the original. Moving down means inserting an element from the new version. Moving diagonally (down-right) means the elements match — no edit needed. The goal is to find the path from the top-left corner to the bottom-right corner that uses the fewest non-diagonal moves (the fewest edits).
The key insight of the Myers algorithm is that it searches outward from the start point along diagonals using a breadth-first strategy. For each "depth" D (number of edits), it explores all possible paths that use exactly D edits. It stops as soon as a path reaches the endpoint, guaranteeing the shortest edit script. The runtime is O(ND), where N is the total length of both inputs and D is the edit distance. When the two inputs are similar (D is small), this is very fast — which is exactly the common case when diffing code between commits.
Edit Distance
Edit distance (also called Levenshtein distance when single-character operations are used) quantifies how different two strings are. A diff with an edit distance of 3 means three insertions or deletions are required. Some algorithms also count substitutions (replacing one element with another) as a single operation rather than a delete-plus-insert. The choice depends on the tool: line-level diffs typically count insertions and deletions only, while character-level diffs may count substitutions.
Types of Diff Comparison
The "resolution" of a diff — what counts as the smallest unit of comparison — profoundly affects both the readability and usefulness of the output.
Line-Level Diff
Line-level diff treats each line as an atomic unit. If a single character changes on a line, the entire line is marked as removed (in the original) and re-added (in the new version). This is the default behavior of the Unix diff command and git diff.
Line-level comparison is fast and produces compact output for code, where edits typically affect whole lines (adding a statement, changing a variable name, removing a block). It is the standard in code review workflows because developers think in terms of lines of code.
Word-Level Diff
Word-level diff splits each line into individual words and compares them. If you change one word in a 20-word sentence, only that word is highlighted — not the entire line. This is vastly better for comparing prose, legal text, documentation, or any content where small edits happen within long paragraphs.
The trade-off is performance: word-level diffing requires tokenizing text into words before running the diff algorithm, and the number of tokens is much larger than the number of lines, so it takes more time on very large inputs.
Character-Level Diff
Character-level diff compares individual characters. This is the finest granularity and can identify single-character changes like typos, off-by-one errors in numbers, or subtle punctuation changes. It is most useful when you need forensic precision — for example, comparing two versions of a configuration value to find that a 0 was changed to an O.
Character-level diffs are noisy on large files because they produce very granular output. They are best used on small snippets or as a secondary view after a line-level diff has identified the lines of interest.
Diff Output Formats
Unified Diff Format
The unified diff is the format you see in git diff output and in GitHub pull requests. It shows changes inline with context: lines prefixed with - were removed, lines prefixed with + were added, and lines with no prefix are unchanged context lines included for orientation.
Unified diff is compact and widely supported. It is the de facto standard for patches, code review, and version control output. Every developer should be able to read it fluently.
Side-by-Side Diff
Side-by-side diff displays the original text in a left column and the modified text in a right column, with matching lines aligned horizontally. Changed lines are highlighted, and insertions or deletions appear as blank space in the opposite column.
This format is visually easier to scan than unified diff because your eyes can track the original and modified versions simultaneously. Most GUI diff tools (VS Code, IntelliJ, Beyond Compare) default to side-by-side view. The downside is that it requires more horizontal space, which can be awkward on narrow screens or in terminal windows.
Context Diff
The older context diff format (produced by diff -c) shows changes in separate blocks for the original and modified files, each surrounded by context lines. It was the standard before unified diff was introduced and is rarely used today. If you encounter it, the ! prefix marks changed lines, - marks deletions, and + marks additions.
Reading a Unified Diff
A unified diff is structured with clear conventions. Here is a complete example:
--- config/settings.json 2026-04-18 10:30:00
+++ config/settings.json 2026-04-20 14:15:00
@@ -8,7 +8,8 @@
"debug": false,
"log_level": "info",
"max_retries": 3,
- "timeout": 5000,
+ "timeout": 10000,
+ "retry_delay": 1000,
"cache_enabled": true,
"cache_ttl": 3600
}
Here is how to read each part:
---identifies the original file and its timestamp.+++identifies the modified file and its timestamp.@@ -8,7 +8,8 @@is the hunk header. It says: this section starts at line 8 in the original (showing 7 lines) and starts at line 8 in the new file (showing 8 lines). The extra line is the newretry_delayproperty.- Lines without a prefix (
"debug": false,etc.) are context lines — unchanged, included for orientation. - The line starting with
-was removed: the timeout was 5000. - The lines starting with
+were added: the timeout was changed to 10000, and a newretry_delayproperty was introduced.
Diff in Version Control
Git is the most common context where developers encounter diffs. Here are the essential commands:
git diff— shows unstaged changes in your working directory relative to the index (staging area).git diff --staged(or--cached) — shows changes that have been staged for the next commit.git diff main..feature-branch— shows all differences between two branches. This is what a pull request diff represents.git diff HEAD~3— shows changes made in the last three commits.git diff --stat— shows a summary of changed files with insertion/deletion counts, without the full diff content.
Pull request diffs on platforms like GitHub are unified diffs with syntax highlighting, inline commenting, and the ability to toggle between unified and split (side-by-side) views. Understanding how to read a raw unified diff makes these visual tools even more useful, because you understand the underlying structure rather than just the highlighting.
Practical Use Cases
Beyond code review and version control, text diffing solves many practical problems:
- Comparing API responses. When an API starts returning unexpected data, save the expected response and the actual response, then diff them. The differences immediately reveal which fields changed, which were added, and which disappeared.
- Checking configuration changes. Before deploying a config change to production, diff the new config against the current one. This catches accidental deletions, unintended defaults, and typos that automated tests might not cover.
- Reviewing document edits. When a colleague edits a shared document, a word-level diff shows exactly what they changed without requiring both parties to read the entire document. This is especially valuable for long legal documents or technical specifications.
- Debugging by comparing logs. If a process succeeds in staging but fails in production, diff the log output from both environments. The point of divergence is often the root cause or the first clue to finding it.
- Finding copy edits in content. Content teams use diff tools to verify that only intended changes were made to published articles, marketing copy, or UI strings. A single misplaced character in a price or legal disclaimer can have real consequences.
Ignoring Whitespace and Case
Not every difference matters equally. After reformatting a file (changing tabs to spaces, adjusting indentation, or normalizing line endings), a standard diff will show every reformatted line as changed, potentially thousands of lines, even though no logic changed. Most diff tools offer options to filter out these noise differences:
-wor--ignore-all-space— ignores all whitespace differences, including added or removed spaces.-bor--ignore-space-change— ignores changes in the amount of whitespace (e.g., 2 spaces vs. 4 spaces) but still catches lines where whitespace was added or removed entirely.--ignore-blank-lines— skips differences that consist only of added or removed empty lines.
In git diff, the equivalent flags are git diff -w (ignore all whitespace) and git diff -b (ignore whitespace amount changes).
When not to ignore whitespace: In whitespace-sensitive languages like Python, where indentation determines code blocks, or in YAML files, where indentation defines data structure, whitespace differences are semantic and must not be ignored. Always consider whether whitespace carries meaning in your specific context before filtering it out.
Using Diff in Code and the Command Line
Command-Line diff
The standard Unix diff command compares two files:
# Unified diff output (the most common format)
diff -u original.txt modified.txt
# Side-by-side output
diff -y original.txt modified.txt
# Ignore whitespace differences
diff -u -w original.txt modified.txt
# Show only whether files differ, without the full diff
diff -q original.txt modified.txt
JavaScript (Node.js)
The diff npm package provides several comparison functions:
import { diffLines, diffWords, createPatch } from 'diff';
const original = 'The quick brown fox\njumps over the lazy dog';
const modified = 'The quick red fox\nleaps over the lazy dog';
// Line-level diff
const lineChanges = diffLines(original, modified);
lineChanges.forEach(part => {
const prefix = part.added ? '+' : part.removed ? '-' : ' ';
console.log(prefix, part.value);
});
// Word-level diff
const wordChanges = diffWords(original, modified);
wordChanges.forEach(part => {
const color = part.added ? 'green' : part.removed ? 'red' : 'grey';
process.stderr.write(part.value);
});
// Generate a unified patch
const patch = createPatch('story.txt', original, modified);
console.log(patch);
Python
Python's standard library includes difflib, which provides both comparison and patch generation:
import difflib
original = ['The quick brown fox\n', 'jumps over the lazy dog\n']
modified = ['The quick red fox\n', 'leaps over the lazy dog\n']
# Unified diff
diff = difflib.unified_diff(original, modified,
fromfile='original.txt', tofile='modified.txt')
print(''.join(diff))
# Side-by-side HTML diff
differ = difflib.HtmlDiff()
html = differ.make_file(original, modified)
# Sequence matcher for similarity ratio
ratio = difflib.SequenceMatcher(None,
'The quick brown fox', 'The quick red fox').ratio()
print(f'Similarity: {ratio:.1%}') # 84.2%
SnapUtils Text Diff runs entirely in your browser. Paste your original and modified text, choose line or word mode, and see differences highlighted instantly.
Best Practices for Text Comparison
- Diff before committing. Always run
git diff --stagedbefore committing to verify that only the intended changes are included. Stray debug statements, accidental whitespace changes, and unrelated edits are caught here, not in code review. - Save diffs for audit trails. When deploying configuration changes or database migrations, save the diff output as part of the deployment record. If something breaks, the diff provides an instant starting point for investigation.
- Use word-level diff for prose. Line-level diff is designed for code. For documentation, marketing copy, legal text, or any natural-language content, word-level diff produces far more readable output because it highlights the specific words that changed rather than entire paragraphs.
- Keep commits small and focused. A diff that changes 3 files and 20 lines is easy to review. A diff that changes 40 files and 2,000 lines is effectively unreviewable. Small, focused commits produce small, focused diffs, which produce better code review.
- Separate formatting from logic changes. If you need to reformat a file (change indentation, sort imports, adjust line length), do it in a separate commit with no logic changes. This keeps the formatting diff separate from the meaningful diff, making both easier to review.
- Use semantic diff when available. Some tools understand the syntax of specific languages and can produce "semantic" diffs — for example, recognizing that a function was moved rather than deleted and re-added. Tools like
difftasticparse code into syntax trees before diffing, producing cleaner output than line-level comparison for structural changes.
Frequently Asked Questions
Line-level diff treats each line as an atomic unit. If a single character changes on a line, the entire line is marked as removed and re-added. This is fast and ideal for code, where changes typically affect whole lines.
Word-level diff breaks lines into individual words and compares them. If you change one word in a 20-word sentence, only that word is highlighted. This is far better for prose, documentation, and natural-language content where small edits occur within long paragraphs.
The Myers diff algorithm, published in 1986, finds the shortest edit script (minimum insertions and deletions) to transform one sequence into another. It models the problem as an edit graph where moving right is a deletion, moving down is an insertion, and moving diagonally is a match. The algorithm performs a greedy breadth-first search along diagonals, exploring all paths with D edits before trying D+1 edits.
Its runtime is O(ND), where N is total input length and D is edit distance. Since most real-world diffs have small D (the files are mostly similar), it performs very well in practice. This is the algorithm used by git diff.
A unified diff has three key elements. First, the header lines: --- marks the original file and +++ marks the modified file. Second, hunk headers like @@ -10,4 +10,5 @@ indicate where the change occurs — line 10 in the original (4 lines shown) and line 10 in the new file (5 lines shown). Third, the diff content: lines starting with - were removed, lines starting with + were added, and lines with no prefix are unchanged context.
Text diff tools are designed for plain text and will not produce meaningful results for binary files such as images, PDFs, compiled executables, or compressed archives. Most tools will simply report "Binary files differ" without showing specifics. For binary comparison, use specialized tools: image diff tools for visual comparison, hex editors for byte-level inspection, or format-specific utilities like PDF comparison software.
The @@ notation marks a hunk header in unified diff format. The syntax is @@ -L,S +L,S @@, where the first L,S pair gives the starting line number and line count in the original file, and the second pair gives the same for the new file. For example, @@ -25,7 +25,9 @@ means the hunk starts at line 25 in both files, covers 7 lines in the original, and 9 lines in the modified version — indicating 2 lines were added.
It depends on context. Ignore whitespace when reviewing changes after reformatting code, changing indentation styles, or normalizing line endings — this prevents hundreds of noise lines from hiding real changes. Use diff -w or git diff -w for this.
Do not ignore whitespace in whitespace-sensitive languages (Python, where indentation defines code blocks), in YAML files (where indentation defines structure), or when auditing formatting standards. Always consider whether whitespace carries semantic meaning before filtering it out.