Text Diff Guide: How to Compare Text and Find Differences

What Is a Text Diff?

A text diff is the computed difference between two pieces of text. Given an original document and a modified version, a diff algorithm identifies which parts were added, removed, or left unchanged. The output is typically called a diff, a patch, or an edit script — a structured representation of every change needed to transform the original into the modified version.

The concept traces back to the Unix diff utility, written by Douglas McIlroy in 1974 at Bell Labs. McIlroy's tool compared two files line by line and printed the differences in a format that another utility, patch, could apply to reproduce the changes. This design — compute the difference, then apply it elsewhere — became the foundation for version control systems, code review workflows, and collaborative editing tools that billions of developers use today.

At its core, every diff algorithm answers one question: what is the minimum set of changes that transforms text A into text B? The answer depends on what you define as a "unit" of text — a line, a word, or a single character — and which algorithm you use to find the shortest edit path between the two versions.

Why Comparing Text Matters

Text comparison is one of those capabilities that seems simple until you realize how many workflows depend on it. Here are the most common contexts where diffing is essential:

How Diff Algorithms Work

Diff algorithms are rooted in a classic computer science problem: finding the Longest Common Subsequence (LCS) of two sequences. The LCS is the longest sequence of elements that appear in both inputs in the same order, though not necessarily contiguously. Once you know the LCS, everything not in it is either an insertion or a deletion.

For example, given the sequences ABCDEF and ABXDEF, the LCS is ABDEF (length 5). The character C was deleted from the original, and X was inserted in the new version. The edit distance — the number of insertions and deletions needed — is 2.

The Myers Diff Algorithm

The most widely used diff algorithm in practice is the Myers diff algorithm, published by Eugene W. Myers in 1986 in the paper "An O(ND) Difference Algorithm and Its Variations." This is the algorithm that powers git diff, and by extension, every pull request diff you have ever read on GitHub.

Myers frames the diff problem as navigating an edit graph. Imagine a grid where the X-axis represents the original text and the Y-axis represents the new text. Moving right means deleting an element from the original. Moving down means inserting an element from the new version. Moving diagonally (down-right) means the elements match — no edit needed. The goal is to find the path from the top-left corner to the bottom-right corner that uses the fewest non-diagonal moves (the fewest edits).

The key insight of the Myers algorithm is that it searches outward from the start point along diagonals using a breadth-first strategy. For each "depth" D (number of edits), it explores all possible paths that use exactly D edits. It stops as soon as a path reaches the endpoint, guaranteeing the shortest edit script. The runtime is O(ND), where N is the total length of both inputs and D is the edit distance. When the two inputs are similar (D is small), this is very fast — which is exactly the common case when diffing code between commits.

Edit Distance

Edit distance (also called Levenshtein distance when single-character operations are used) quantifies how different two strings are. A diff with an edit distance of 3 means three insertions or deletions are required. Some algorithms also count substitutions (replacing one element with another) as a single operation rather than a delete-plus-insert. The choice depends on the tool: line-level diffs typically count insertions and deletions only, while character-level diffs may count substitutions.

Types of Diff Comparison

The "resolution" of a diff — what counts as the smallest unit of comparison — profoundly affects both the readability and usefulness of the output.

Line-Level Diff

Line-level diff treats each line as an atomic unit. If a single character changes on a line, the entire line is marked as removed (in the original) and re-added (in the new version). This is the default behavior of the Unix diff command and git diff.

Line-level comparison is fast and produces compact output for code, where edits typically affect whole lines (adding a statement, changing a variable name, removing a block). It is the standard in code review workflows because developers think in terms of lines of code.

Word-Level Diff

Word-level diff splits each line into individual words and compares them. If you change one word in a 20-word sentence, only that word is highlighted — not the entire line. This is vastly better for comparing prose, legal text, documentation, or any content where small edits happen within long paragraphs.

The trade-off is performance: word-level diffing requires tokenizing text into words before running the diff algorithm, and the number of tokens is much larger than the number of lines, so it takes more time on very large inputs.

Character-Level Diff

Character-level diff compares individual characters. This is the finest granularity and can identify single-character changes like typos, off-by-one errors in numbers, or subtle punctuation changes. It is most useful when you need forensic precision — for example, comparing two versions of a configuration value to find that a 0 was changed to an O.

Character-level diffs are noisy on large files because they produce very granular output. They are best used on small snippets or as a secondary view after a line-level diff has identified the lines of interest.

Diff Output Formats

Unified Diff Format

The unified diff is the format you see in git diff output and in GitHub pull requests. It shows changes inline with context: lines prefixed with - were removed, lines prefixed with + were added, and lines with no prefix are unchanged context lines included for orientation.

Unified diff is compact and widely supported. It is the de facto standard for patches, code review, and version control output. Every developer should be able to read it fluently.

Side-by-Side Diff

Side-by-side diff displays the original text in a left column and the modified text in a right column, with matching lines aligned horizontally. Changed lines are highlighted, and insertions or deletions appear as blank space in the opposite column.

This format is visually easier to scan than unified diff because your eyes can track the original and modified versions simultaneously. Most GUI diff tools (VS Code, IntelliJ, Beyond Compare) default to side-by-side view. The downside is that it requires more horizontal space, which can be awkward on narrow screens or in terminal windows.

Context Diff

The older context diff format (produced by diff -c) shows changes in separate blocks for the original and modified files, each surrounded by context lines. It was the standard before unified diff was introduced and is rarely used today. If you encounter it, the ! prefix marks changed lines, - marks deletions, and + marks additions.

Reading a Unified Diff

A unified diff is structured with clear conventions. Here is a complete example:

--- config/settings.json    2026-04-18 10:30:00
+++ config/settings.json    2026-04-20 14:15:00
@@ -8,7 +8,8 @@
     "debug": false,
     "log_level": "info",
     "max_retries": 3,
-    "timeout": 5000,
+    "timeout": 10000,
+    "retry_delay": 1000,
     "cache_enabled": true,
     "cache_ttl": 3600
 }

Here is how to read each part:

Diff in Version Control

Git is the most common context where developers encounter diffs. Here are the essential commands:

Pull request diffs on platforms like GitHub are unified diffs with syntax highlighting, inline commenting, and the ability to toggle between unified and split (side-by-side) views. Understanding how to read a raw unified diff makes these visual tools even more useful, because you understand the underlying structure rather than just the highlighting.

Practical Use Cases

Beyond code review and version control, text diffing solves many practical problems:

Ignoring Whitespace and Case

Not every difference matters equally. After reformatting a file (changing tabs to spaces, adjusting indentation, or normalizing line endings), a standard diff will show every reformatted line as changed, potentially thousands of lines, even though no logic changed. Most diff tools offer options to filter out these noise differences:

In git diff, the equivalent flags are git diff -w (ignore all whitespace) and git diff -b (ignore whitespace amount changes).

When not to ignore whitespace: In whitespace-sensitive languages like Python, where indentation determines code blocks, or in YAML files, where indentation defines data structure, whitespace differences are semantic and must not be ignored. Always consider whether whitespace carries meaning in your specific context before filtering it out.

Using Diff in Code and the Command Line

Command-Line diff

The standard Unix diff command compares two files:

# Unified diff output (the most common format)
diff -u original.txt modified.txt

# Side-by-side output
diff -y original.txt modified.txt

# Ignore whitespace differences
diff -u -w original.txt modified.txt

# Show only whether files differ, without the full diff
diff -q original.txt modified.txt

JavaScript (Node.js)

The diff npm package provides several comparison functions:

import { diffLines, diffWords, createPatch } from 'diff';

const original = 'The quick brown fox\njumps over the lazy dog';
const modified = 'The quick red fox\nleaps over the lazy dog';

// Line-level diff
const lineChanges = diffLines(original, modified);
lineChanges.forEach(part => {
  const prefix = part.added ? '+' : part.removed ? '-' : ' ';
  console.log(prefix, part.value);
});

// Word-level diff
const wordChanges = diffWords(original, modified);
wordChanges.forEach(part => {
  const color = part.added ? 'green' : part.removed ? 'red' : 'grey';
  process.stderr.write(part.value);
});

// Generate a unified patch
const patch = createPatch('story.txt', original, modified);
console.log(patch);

Python

Python's standard library includes difflib, which provides both comparison and patch generation:

import difflib

original = ['The quick brown fox\n', 'jumps over the lazy dog\n']
modified = ['The quick red fox\n', 'leaps over the lazy dog\n']

# Unified diff
diff = difflib.unified_diff(original, modified,
    fromfile='original.txt', tofile='modified.txt')
print(''.join(diff))

# Side-by-side HTML diff
differ = difflib.HtmlDiff()
html = differ.make_file(original, modified)

# Sequence matcher for similarity ratio
ratio = difflib.SequenceMatcher(None,
    'The quick brown fox', 'The quick red fox').ratio()
print(f'Similarity: {ratio:.1%}')  # 84.2%

Best Practices for Text Comparison

Frequently Asked Questions

Line-level diff treats each line as an atomic unit. If a single character changes on a line, the entire line is marked as removed and re-added. This is fast and ideal for code, where changes typically affect whole lines.

Word-level diff breaks lines into individual words and compares them. If you change one word in a 20-word sentence, only that word is highlighted. This is far better for prose, documentation, and natural-language content where small edits occur within long paragraphs.

The Myers diff algorithm, published in 1986, finds the shortest edit script (minimum insertions and deletions) to transform one sequence into another. It models the problem as an edit graph where moving right is a deletion, moving down is an insertion, and moving diagonally is a match. The algorithm performs a greedy breadth-first search along diagonals, exploring all paths with D edits before trying D+1 edits.

Its runtime is O(ND), where N is total input length and D is edit distance. Since most real-world diffs have small D (the files are mostly similar), it performs very well in practice. This is the algorithm used by git diff.

A unified diff has three key elements. First, the header lines: --- marks the original file and +++ marks the modified file. Second, hunk headers like @@ -10,4 +10,5 @@ indicate where the change occurs — line 10 in the original (4 lines shown) and line 10 in the new file (5 lines shown). Third, the diff content: lines starting with - were removed, lines starting with + were added, and lines with no prefix are unchanged context.

Text diff tools are designed for plain text and will not produce meaningful results for binary files such as images, PDFs, compiled executables, or compressed archives. Most tools will simply report "Binary files differ" without showing specifics. For binary comparison, use specialized tools: image diff tools for visual comparison, hex editors for byte-level inspection, or format-specific utilities like PDF comparison software.

The @@ notation marks a hunk header in unified diff format. The syntax is @@ -L,S +L,S @@, where the first L,S pair gives the starting line number and line count in the original file, and the second pair gives the same for the new file. For example, @@ -25,7 +25,9 @@ means the hunk starts at line 25 in both files, covers 7 lines in the original, and 9 lines in the modified version — indicating 2 lines were added.

It depends on context. Ignore whitespace when reviewing changes after reformatting code, changing indentation styles, or normalizing line endings — this prevents hundreds of noise lines from hiding real changes. Use diff -w or git diff -w for this.

Do not ignore whitespace in whitespace-sensitive languages (Python, where indentation defines code blocks), in YAML files (where indentation defines structure), or when auditing formatting standards. Always consider whether whitespace carries semantic meaning before filtering it out.