You open a PDF, copy a paragraph of text, paste it somewhere else, and receive a wall of gibberish: ï¬rst instead of first, ’ instead of an apostrophe, or entire words replaced by a sequence of question marks and boxes. The document looks perfectly fine on screen, yet the text that comes out is unusable. This is not a display bug — it is a text encoding problem baked into the PDF file itself.

This guide explains exactly why PDF text goes wrong, the different failure modes you are likely to encounter, what happens at the byte level inside a PDF when encoding breaks, and how to get clean readable text back out. Whether you are dealing with a downloaded research paper, a vendor invoice, a legal contract, or a scanned form, the solutions covered here apply.

<\!-- Section 1 -->

1. Why PDF Text Goes Garbled

PDF text corruption is almost never random. Each type of garbling has a specific technical cause, and understanding the cause is the fastest path to the right fix.

Character Encoding Mismatch: Windows-1252 vs UTF-8

Before Unicode became universal, applications used dozens of competing single-byte encodings — Windows-1252, ISO-8859-1, Mac OS Roman, and many others. Each encoding assigns a different character to byte values in the range 0x80–0xFF. If a PDF was created on a Windows machine using the Windows-1252 encoding but its metadata declares the encoding as something else (or declares nothing at all), any extractor that assumes UTF-8 will misinterpret every non-ASCII byte. The curly quote stored as byte 0x91 in Windows-1252 does not exist in ISO-8859-1 and maps to a different code point in UTF-8, producing the famous ’ substitution that plagues copied web and PDF text alike.

Font Substitution During Printing or Conversion

When a PDF is created by printing to a PDF driver rather than by a proper PDF export pipeline, the printer driver may not have access to the original font metrics. It substitutes a system font — usually Times New Roman or Helvetica — and re-encodes the glyph stream on the fly. The visual result may look acceptable, but the glyph-to-Unicode mapping is now whoever wrote the printer driver's mapping table, which is often wrong for anything beyond basic ASCII.

Copy-Paste Corruption from Web or Word Processors

Microsoft Word, LibreOffice, and many web-based editors silently convert text to their internal encoding as you paste. Smart quotes become straight quotes, em-dashes become double hyphens, bullet points vanish, and special symbols are dropped entirely. When that document is then exported to PDF, the garbled content is already present in the source — the PDF is faithfully encoding the wrong characters.

OCR Artifacts

Optical character recognition software reads pixel patterns and guesses what letter each pattern represents. It has well-known failure modes: the letter l (lowercase L) is frequently confused with the digit 1 or the letter I; the letter O is confused with the digit 0; ligatures like fi and fl are read as a single unknown character. The accuracy of OCR depends heavily on scan resolution, font clarity, and whether the OCR engine was trained on a similar typeface.

PostScript Conversion Errors

Many professional publishing workflows generate PDF by first producing PostScript and then distilling it through a converter such as Ghostscript or Adobe Distiller. PostScript is a programming language — it does not have a concept of Unicode text, only drawing commands. If the distillation step does not correctly reconstruct the text layer from the drawing commands, glyphs appear on screen (they were drawn correctly) but the text layer is absent or garbled.

<\!-- Section 2 -->

2. The Three Types of PDF Text Problems

It helps to distinguish three failure modes, because each has a different solution path.

Type 1: Encoding Corruption

The text layer exists and is mostly intact, but the bytes have been interpreted under the wrong encoding scheme. Symptoms: accented characters (é, ñ, ü) appear as multi-character sequences like é, ñ, ü. Smart quotes and em-dashes appear as two or three strange characters. The overall structure of the text — word breaks, line breaks, paragraph order — is preserved; only certain characters are wrong. Fix: re-extract with explicit encoding detection, or apply a UTF-8/Windows-1252 re-encoding pass.

Type 2: OCR Artifacts

The text layer exists but was generated by an OCR engine rather than a native export. Symptoms: random single-character substitutions (rn read as m, cl read as d, li read as h), inconsistent spacing, numbers embedded in words, occasional strings of gibberish where the OCR engine failed to recognize a word entirely. The encoding is usually correct — it is the recognition step that introduced errors. Fix: re-run OCR with a higher-quality engine or at a higher scan resolution; use post-processing spell-check to catch common substitutions.

Type 3: Font Mapping Errors (Missing ToUnicode Table)

This is the most technically subtle problem. The PDF has a text layer, the encoding is nominally correct, but the font used in the document either has no ToUnicode mapping table or has an incorrect one. Symptoms: text extracts as completely wrong characters — not a systematic encoding shift but apparently random substitution, or long runs of identical wrong characters. Ligatures like fi and fl may extract as a single unknown glyph. Fix: use a PDF extraction library that attempts heuristic glyph-to-Unicode mapping (pdfminer.six's advanced layout analysis), or re-export the original document with proper font embedding.

<\!-- Section 3 -->

3. How PDF Text Encoding Works

To understand why these problems occur, you need a working mental model of how PDF stores text. The PDF specification does not store text as a Unicode string the way an HTML file does. It stores a sequence of glyph IDs — integers that index into a font's glyph table — alongside positioning instructions. The actual characters those glyphs represent are recorded separately in one of two structures.

The ToUnicode CMap

The primary mechanism is the ToUnicode entry in a PDF font dictionary. This is a stream that contains a character map (CMap) — essentially a lookup table that says "glyph ID 0x0041 corresponds to Unicode code point U+0046 (LATIN CAPITAL LETTER F)." When you copy text from a PDF or run a text extractor, the viewer or library reads these glyph IDs, looks up each one in the ToUnicode CMap, and produces the corresponding Unicode character.

If the ToUnicode CMap is absent, the extractor falls back to the font's Encoding dictionary, which maps glyph names (like /fi or /endash) to Unicode. If that is also absent or incomplete, the extractor must guess — and guessing produces the garbled output you see.

Type1, TrueType, and CFF Fonts

PDF supports three main font format families, and each interacts with the encoding machinery differently. Type 1 fonts (the original PostScript font format) use named glyphs and a standard encoding vector; they are usually extractable if the glyph names follow Adobe's standard naming conventions. TrueType fonts use numeric glyph IDs with no inherent naming convention; a ToUnicode CMap is essential for correct extraction. CFF (Compact Font Format, also called Type 1C) fonts are essentially compressed Type 1 fonts embedded in PDFs as Type 0 (composite) fonts; they support full Unicode ranges and are the most reliable for text extraction when properly created.

Ligatures and Glyph Composition

Professional typefaces include ligatures: single glyphs that visually represent two or more characters. The fi ligature is the most common — the dot of the i is merged with the top of the f for visual refinement. In the font's glyph table, this ligature is a single entity with a single glyph ID, typically named /fi or mapped to the Unicode code point U+FB01 (LATIN SMALL LIGATURE FI). If the PDF's ToUnicode CMap maps this ligature glyph to U+FB01, a good extractor will decompose it to the two-character string "fi". If the CMap is missing the entry, the extractor either drops the glyph or outputs the Unicode replacement character U+FFFD (�).

Technical note: You can inspect a PDF's font dictionaries and CMap streams directly with a tool like mutool show document.pdf trailer or by opening the PDF in a hex editor and searching for the ToUnicode stream. The stream content is PostScript-like CMap syntax and is human-readable once decompressed.
<\!-- Section 4 — SnapUtils CTA -->

4. How SnapUtils PDF Text Fixer Works

SnapUtils PDF Text Fixer is a browser-based tool that extracts and cleans text from PDFs without sending your file to a server. Everything runs locally in your browser using WebAssembly — your document never leaves your machine.

Here is the step-by-step process:

  1. Open the tool. Navigate to snaputils.tools/pdf-text-fixer in any modern browser. No sign-up, no extension, no download required.
  2. Drop or select your PDF. Drag the file onto the drop zone, or click to open a file picker. Files up to 50 MB are supported. The PDF is loaded into browser memory only — it is not uploaded anywhere.
  3. Choose extraction mode. Select Native text if the PDF was created digitally (exported from Word, LaTeX, InDesign, etc.). Select OCR mode if the PDF is a scanned document with no text layer, or if native extraction produces empty output.
  4. Configure encoding options. If you know the source encoding (e.g., the document came from a legacy Windows application), you can specify it. Otherwise, leave Auto-detect enabled — the tool will probe the byte sequences and apply the most likely encoding correction.
  5. Extract. Click Extract Text. The tool parses the PDF's font dictionaries, applies encoding corrections, decomposes ligatures, and normalizes Unicode. The output appears in a text area on the right.
  6. Download or copy. Use Copy to clipboard for quick use, or Download .txt to save the extracted text as a UTF-8 plain text file.

Fix Your Garbled PDF Now

SnapUtils PDF Text Fixer works entirely in your browser. No upload, no account, no software to install. Paste in your PDF and get clean UTF-8 text in seconds.

Open PDF Text Fixer
<\!-- Section 5 -->

5. Common Garbled Character Patterns and What They Mean

Certain garbling patterns appear so frequently that they are almost diagnostic. If you can match the garbled output to a row in the table below, you immediately know the root cause and the correct fix.

Garbled output Expected character Root cause Fix
or fi (two chars) fi ligature (U+FB01) not decomposed Use extractor with ligature decomposition
or fl (two chars) fl ligature (U+FB02) not decomposed Use extractor with ligature decomposition
or appearing as • • (bullet) UTF-8 bullet (U+2022) decoded as Windows-1252 Re-extract with UTF-8 encoding
â€" — (em-dash) U+2014 encoded in UTF-8, decoded as Windows-1252 Re-extract with UTF-8 or apply encoding fix pass
’ ' (right single quote) U+2019 encoded in UTF-8, decoded as Windows-1252 Re-extract with UTF-8 or apply encoding fix pass
“ / †" / " (smart quotes) U+201C/U+201D encoded in UTF-8, decoded as Windows-1252 Re-extract with UTF-8 encoding
é é U+00E9 (é) in UTF-8 = bytes 0xC3 0xA9; misread as two Windows-1252 chars Specify UTF-8 encoding in extractor
? or strings Various glyphs Missing ToUnicode CMap; extractor cannot map glyph IDs Use heuristic mapper (pdfminer.six) or re-export PDF
Spaces missing between words Word spaces PDF uses kerning-based spacing, not explicit space glyphs Use layout-aware extractor (pdfminer.six LAParams)
Words run together across columns Column-separated text Extractor reads in stream order, not visual order Use columnar layout analysis mode

The é family of errors (where a single accented character explodes into two characters) is caused by a classic double-encoding problem. The UTF-8 encoding of é is the two-byte sequence 0xC3 0xA9. When that byte sequence is passed through a Windows-1252 decoder, 0xC3 maps to à and 0xA9 maps to ©, giving you é. This is a deterministic transformation — you can mechanically reverse it, which is exactly what encoding-correction tools do.

<\!-- Section 6 -->

6. Fixing PDFs Created from Scanned Documents

A scanned PDF is a fundamentally different problem from a native PDF with encoding issues. A scanner produces an image — a bitmap of pixels. The PDF wraps that image in a container. There is no text layer at all, only picture data. When you try to copy text from a scanned PDF in most viewers, you get nothing, because there is nothing to copy.

How to Tell If Your PDF Is Scanned

Open the PDF and try to select a word with your cursor. If you can select individual characters precisely, the PDF has a native text layer. If clicking anywhere selects the entire page as a single image, or if nothing is selectable at all, the PDF is a scanned image. You can confirm by running pdftotext document.pdf - in a terminal — if the output is empty or only whitespace, there is no extractable text layer.

Adding a Text Layer with OCR

To get text from a scanned PDF you need to run OCR. The process is:

  1. Render each page to an image at sufficient resolution. 300 DPI is the minimum for reasonable accuracy; 600 DPI is better for documents with small fonts.
  2. Run the image through an OCR engine. Tesseract is the most widely used open-source engine. tesseract page.png output -l eng pdf produces a searchable PDF with the recognized text embedded. Google Document AI, Amazon Textract, and Adobe Acrobat's OCR are commercial alternatives with higher accuracy.
  3. Verify and correct OCR errors. No OCR engine is perfect. Check for common substitutions (rnm, l1, O0) and fix them before relying on the extracted text.
OCR accuracy tip: If your scanned document is skewed (the text lines aren't horizontal), deskew it before OCR. Most scanning apps and dedicated tools like unpaper can straighten pages. Skewed input can drop Tesseract's word error rate by 10–20% for badly skewed pages.

Hybrid PDFs: Scanned Image + Existing (Bad) Text Layer

Some PDFs are created by scanning and then running OCR automatically — but the OCR was low-quality or ran on a low-resolution scan. These documents have a text layer, but it is full of OCR errors. Simply extracting the existing text layer gives you garbage. Your options are to either work with the bad OCR output and post-correct it, or to strip the existing text layer and re-run a better OCR engine from scratch. Tools like ocrmypdf (which wraps Ghostscript and Tesseract) have a --redo-ocr flag specifically for this scenario.

<\!-- Section 7 -->

7. When to Use Other Tools

SnapUtils PDF Text Fixer handles the most common scenarios, but for heavy automation, batch processing, or unusual PDF structures, command-line and programmatic tools offer more control.

Adobe Acrobat

Acrobat Pro's Export PDF → Plain Text function uses Adobe's proprietary text extraction engine, which has the best out-of-the-box handling for edge cases in Adobe-generated PDFs. If the garbled PDF was created by an Adobe product, this is worth trying first. Acrobat also has a Fix Text Recognition command (under Tools → Scan & OCR) for PDFs with bad OCR layers. Limitation: Acrobat is expensive, Windows/Mac only, and not scriptable without the JavaScript API.

pdftotext (poppler-utils)

The command-line tool pdftotext, part of the poppler library, is fast, free, and available on every platform. Basic use:

pdftotext -enc UTF-8 -layout document.pdf output.txt

The -layout flag attempts to preserve the visual layout of the text, which helps with multi-column documents. The -enc flag forces UTF-8 output. For documents where the glyph-to-Unicode mapping is broken, pdftotext will still fail — it relies on the PDF's own encoding metadata.

mutool (MuPDF)

MuPDF's mutool command offers multiple extraction modes:

mutool draw -F text document.pdf          # text in reading order
mutool convert -o output.txt document.pdf # full conversion

MuPDF's text extraction is sometimes better than poppler for PDFs with complex layouts or unusual font encodings, because its rendering engine is more tolerant of malformed font dictionaries.

Python: pdfminer.six

For programmatic extraction with full control, pdfminer.six is the most capable Python library. It exposes the full PDF object model, allows you to intercept the encoding stage, and has a layout analysis engine (LAParams) that handles multi-column text, rotated pages, and unusual character spacing:

from pdfminer.high_level import extract_text
from pdfminer.layout import LAParams

params = LAParams(line_margin=0.5, char_margin=2.0)
text = extract_text("document.pdf", laparams=params)
print(text)

pdfminer.six also attempts heuristic ToUnicode reconstruction for fonts that are missing their CMap, which recovers text that other tools silently drop.

<\!-- Section 8 -->

8. Prevention: How to Export PDFs Without Text Corruption

If you are on the creating end of PDFs, these practices eliminate the most common sources of text corruption before they occur.

Always Embed Fonts

The single most important setting in any PDF export dialog is font embedding. When fonts are not embedded, the PDF viewer must substitute a system font, which breaks the glyph-to-Unicode mapping for any character not in the substitute font's standard encoding. In Word: File → Options → Save → Embed fonts in the file. In InDesign: fonts are always embedded when exporting to PDF. In LaTeX: modern XeLaTeX and LuaLaTeX embed fonts by default; with pdfLaTeX, always include the fontenc package and use a font with a complete encoding vector.

Use PDF/A for Long-Term Archival

PDF/A is an ISO standard (ISO 19005) designed for long-term preservation. PDF/A-2b and PDF/A-3b compliance requires complete font embedding and mandates ToUnicode CMaps for all fonts. These requirements directly prevent the most common text extraction failures. Most professional applications (Word, InDesign, LibreOffice, LaTeX packages like pdfx) can export to PDF/A. If you are distributing documents that need to be machine-readable years from now — contracts, research papers, official records — PDF/A is the correct format.

Avoid Printing to PDF When Possible

Printing to a PDF driver (e.g., Microsoft Print to PDF, CutePDF Writer) goes through the printer abstraction layer, which discards semantic information about text encoding. The driver rasterizes or re-encodes the content for a printer model it is emulating. Always use the application's native Save As PDF or Export PDF function instead. The difference in text extraction quality can be dramatic — the same document exported via Save As PDF is fully extractable; printed to PDF, it may be entirely garbled.

Test Extraction Before Distributing

Before sending a PDF externally, run a quick extraction test: open the PDF, select all text (Ctrl+A), copy, and paste into a plain text editor. If what you see is a faithful copy of the document's content, the text layer is sound. If you see garbling, fix the source — it is far easier to re-export from the original application than to repair a distributed PDF later.

Use Unicode Normalization at the Source

If your document source contains text from multiple systems (for example, a report assembled from database exports, web copy, and scanned tables), normalize all text to NFC Unicode form before building the final PDF. The Python unicodedata.normalize('NFC', text) call collapses composed character sequences (e.g., e + combining acute accent → é) into their canonical precomposed form, which PDF fonts handle more reliably.

<\!-- Mid-article CTA -->

Already Have a Broken PDF?

Drop it into SnapUtils PDF Text Fixer — browser-only, no upload, no account. Get clean UTF-8 text in a few seconds.

Try PDF Text Fixer Free
<\!-- Section 9 — FAQ -->

9. Frequently Asked Questions

Why does my PDF show garbled or scrambled characters when I copy the text?

Garbled text on copy usually means the PDF's font lacks a proper ToUnicode mapping table. The PDF renderer displays the correct glyphs visually, but when you copy text the viewer has no reliable way to map those glyph IDs back to Unicode code points. The result is a stream of wrong characters or question marks. Re-exporting the PDF with font embedding or running it through a text extraction tool with encoding detection can resolve this.

What is the difference between a garbled-text PDF and a scanned PDF?

A garbled-text PDF contains a real text layer — the characters are there, but their Unicode mapping is wrong, so extraction produces nonsense. A scanned PDF is essentially an image: there is no text layer at all, only pixels. Fixing a garbled-text PDF means correcting encoding; fixing a scanned PDF means running OCR first to create a text layer, then optionally correcting OCR errors.

Can I fix PDF text corruption without Adobe Acrobat?

Yes. SnapUtils PDF Text Fixer works entirely in your browser with no software to install. Alternatives include the command-line tools pdftotext (part of poppler-utils) and mutool (part of MuPDF), as well as the Python library pdfminer.six. For scanned documents you also need an OCR engine such as Tesseract.

Why do 'fi', 'fl', and 'ffi' ligatures appear as a single wrong character after extraction?

Many professional typefaces merge letter combinations into a single glyph called a ligature for visual refinement. When the PDF was created, the text "fi" was stored as a single ligature glyph. If the font's ToUnicode map does not include an entry for that ligature glyph, the extractor cannot decompose it back to "fi" and either drops it or substitutes a placeholder. Well-written PDF exporters always include ToUnicode entries for ligatures; poorly written ones do not.

What does PDF/A have to do with text extraction quality?

PDF/A is an ISO archival subset of PDF that requires full font embedding and mandates ToUnicode mapping tables for all fonts. A PDF/A-compliant document is therefore almost always extractable without garbling. If you control the PDF creation step, exporting to PDF/A-2b or PDF/A-3b is the single most reliable way to guarantee that text can be faithfully extracted years later.

My PDF looks fine on screen but the extracted text has wrong letters — why?

PDF viewers render glyphs using the font's glyph outlines, which can be visually correct even when the ToUnicode mapping is absent or wrong. Extraction depends entirely on the ToUnicode (or Encoding) table to translate glyph IDs to characters. The visual rendering and the text extraction pipeline are completely independent inside a PDF viewer, which is why a file can look perfect on screen but produce garbage when copied or extracted.

<\!-- Related Articles -->