Back to Blog
PDF Tools

OCR Explained: How to Make Scanned PDFs Searchable

PDF-Builder Team·

Introduction

You scanned a stack of contracts, receipts, or old records into PDFs. Now you need to find a specific clause or dollar amount. You hit Ctrl+F, type your search term, and nothing happens. The PDF looks like a document, but to your computer it's just a collection of page-sized images. There's no text to search.

This is the most common PDF problem people don't realize they have. The fix is OCR, optical character recognition. OCR reads the characters in your scanned images and adds a hidden text layer to the PDF, turning it into a searchable document.

Here's how the technology works, which tools handle it well, and what to do when results come out wrong.


What OCR actually does

A scanned PDF contains photographs of pages. Each page is a single image, like a JPEG. When you "select" text in a scanned PDF, you're selecting the image, not individual words.

OCR software analyzes these images and identifies letters, numbers, and symbols. It then places invisible, selectable text on top of the original image, positioned to match each character's location on the page. The visual appearance of your PDF doesn't change. But now you can search it, copy text from it, and screen readers can read it aloud.

The process works in stages. First, the software preprocesses the image by straightening skewed pages, removing noise, and adjusting contrast. Then it segments the page into blocks of text, lines, and individual characters. Finally, it runs recognition algorithms against each character, comparing shapes to known letter forms across the target language. Modern OCR engines use neural networks trained on millions of document samples, which is why they handle varying fonts and print quality better than the pattern-matching systems from a decade ago.

The output is a "searchable PDF" (sometimes called PDF/A when used for archival). The original scan stays intact as the visual layer, and the recognized text sits behind it as a separate layer. If the OCR misreads a word, you see the correct image but the search layer contains the error.


How to tell if your PDF needs OCR

Open the PDF and try to select a single word. Click and drag across a line of text.

If individual words highlight and you can copy them into another application, your PDF already has selectable text. It's either a native digital PDF or it's already been through OCR. You don't need to run OCR again.

If clicking selects the entire page as one block, or if nothing highlights at all, you have a scanned PDF that needs OCR.

File size is another clue. A 10-page native PDF might be 200 KB. A 10-page scanned PDF is often 5-30 MB because it's storing full-resolution images of every page. If your PDF seems unusually large for its page count, it's likely scanned. For more on the difference between native and scanned PDFs, see our guide on PDF conversions and formatting.


Methods for running OCR

There are four ways to make a scanned PDF searchable. Each trades off between convenience, privacy, and control.

Online tools

The fastest route. Upload your PDF to a website, wait for processing, download the searchable version.

Adobe Acrobat Online, iLovePDF, Smallpdf, and PDF24 all offer free OCR. The process is the same everywhere: upload, select language, click a button, download.

These tools work well for quick, one-off jobs with non-sensitive documents. Most cap free usage (Smallpdf limits you to 2 tasks per day, iLovePDF to 1 per hour) and restrict file sizes.

The tradeoff is privacy. Your document gets uploaded to their servers for processing. Most services say files are deleted within 1-2 hours, but you're trusting that claim. For a restaurant menu or a public notice, that's fine. For medical records, legal contracts, or financial documents, it's worth thinking about. Our PDF tool privacy guide covers what happens to uploaded files in more detail.

Desktop software

Desktop applications run OCR on your machine. Nothing gets uploaded.

Adobe Acrobat Pro has the most polished OCR implementation. Open your scanned PDF, go to All Tools, then Scan & OCR, then Recognize Text. Acrobat handles language detection, skew correction, and text placement automatically. It costs $20/month.

OCRmyPDF is free, open-source, and runs on Windows, Mac, and Linux. It's a command-line tool built on top of Google's Tesseract OCR engine. Install it, point it at a PDF, and it produces a searchable version. It also optimizes file size and can produce PDF/A output for archival.

PDFgear is a free desktop editor that includes OCR. It handles the basics well and has a visual interface if command-line tools aren't your preference.

ABBYY FineReader is the option for high-stakes documents where accuracy matters most. It consistently scores highest in accuracy benchmarks and lets you review and correct recognized text before saving. It costs around $200 for a perpetual license.

For a broader comparison of these and other options, see our best free PDF tools guide.

Browser-based local tools

These look like online tools but process files in your browser using WebAssembly. Your PDF never leaves your device.

PDF-Builder works this way for several PDF operations. You get the convenience of a web interface with the privacy of local processing, no installation, no file upload, no account required.

This approach sits between online tools and desktop software. You get privacy without setup.

Command line

For developers or anyone processing large batches, command-line tools are the most flexible option.

OCRmyPDF is the standard here:

ocrmypdf input.pdf output.pdf

That single command produces a searchable PDF. Add flags for more control:

ocrmypdf --language eng+fra --deskew --clean --optimize 2 input.pdf output.pdf

This tells it to recognize English and French, straighten skewed pages, clean up the image, and optimize file size aggressively.

For batch processing an entire folder:

for f in scans/*.pdf; do
  ocrmypdf "$f" "searchable/$(basename "$f")"
done

OCRmyPDF is smart enough to skip pages that already have text, so running it on a hybrid PDF (part scanned, part native) won't damage the existing text layer.

Under the hood, OCRmyPDF uses Google's Tesseract engine, which supports over 100 languages. Install additional language packs if you're working with non-English documents.


Quick comparison

MethodCostPrivacySetupBest for
Online toolsFree (with limits)Low, files uploaded to serversNoneQuick one-off OCR of non-sensitive docs
Desktop softwareFree to $200High, local processingInstall requiredRegular use, sensitive documents
Browser-based local toolsFreeHigh, files stay on your deviceNonePrivacy + convenience
Command lineFreeHigh, local processingSome setupDevelopers, batch processing

How to get accurate OCR results

OCR accuracy on clean, printed text from a modern printer typically lands between 97% and 99.9%. That sounds high, but 97% accuracy on a 1,000-word document means 30 wrong characters. In a contract or financial document, one wrong digit matters.

Accuracy depends almost entirely on the quality of the input image. The OCR software can only work with what you give it. Here's what affects the output.

Scan resolution

300 DPI is the target. This gives OCR engines enough detail to distinguish similar-looking characters (the number 0 and the letter O, the number 1 and the lowercase L).

Scanning below 200 DPI blurs fine details and accuracy drops noticeably. Going above 400 DPI produces larger files without meaningful accuracy gains.

If you're scanning specifically for OCR, set your scanner to 300 DPI, grayscale. Color scans produce larger files and don't improve text recognition.

Page alignment

Skewed pages reduce accuracy. Even a 2-3 degree tilt can cause the OCR engine to misread characters at line edges or merge lines together.

Use a flatbed scanner and make sure pages sit flat against the glass. If you're using a phone camera, hold it directly above the page rather than at an angle.

Most OCR tools include automatic deskew, but starting with a straight scan produces better results than correcting a crooked one after the fact.

Contrast and brightness

OCR needs clear separation between text and background. Faded ink, yellowed paper, or a gray background all reduce contrast and hurt accuracy.

Set scanner brightness to around 50%. If you're working with old or damaged documents, increase the contrast setting to darken the text relative to the background.

For documents that are already scanned and can't be rescanned, image preprocessing tools can help. OCRmyPDF's --clean flag runs cleanup algorithms before recognition. Adobe Acrobat applies similar preprocessing automatically.

Font and print quality

Clean, standard fonts (Times New Roman, Arial, Courier) produce the best results. Unusual typefaces, very small text (below 8pt), and decorative fonts reduce accuracy.

Documents from dot-matrix printers, old typewriters, or low-quality photocopies are harder to recognize. Multiple generations of photocopying (a copy of a copy) degrade quality fast.

Handwriting is still largely unreliable for OCR. Modern AI-based systems have improved handwriting recognition, but accuracy remains well below what you'd get with printed text. If you need to digitize handwritten documents, expect to do significant manual correction afterward.

Language settings

Always set the correct language before running OCR. The language setting tells the engine which character set and dictionary to use for validation. Running English OCR on a German document will miss umlauts and produce nonsense words.

For multilingual documents, most tools let you select multiple languages. In OCRmyPDF, use --language eng+deu for a document mixing English and German.


Common problems and fixes

"No text found" after OCR

The OCR completed but search still doesn't work. This usually means the scan quality was too low for the engine to recognize anything with confidence. Rescan at 300 DPI if possible. If you can't rescan, try a different OCR tool, as different engines handle low-quality input differently.

Garbled text when copying

You can select text after OCR, but pasting it produces garbage characters. This means the OCR engine misidentified characters. Common with low-resolution scans, unusual fonts, or wrong language settings. Check your language setting first, then try a higher-accuracy tool like ABBYY FineReader.

File size increased dramatically

OCR can increase file size because it adds a text layer on top of the existing images. If the tool also re-renders or duplicates the image layer, file size can balloon.

Use OCRmyPDF with --optimize to compress the output. Adobe Acrobat's "Reduce File Size" option works after OCR as well. You can also use a PDF compression tool afterward.

OCR runs but skips pages

Some tools skip pages they think already contain text. If your PDF is a hybrid with some native text pages and some scanned pages, the tool might skip the scanned pages if it detects any text layer at all. In OCRmyPDF, use --force-ocr to process every page regardless. Be aware this replaces any existing text layer.

Slow processing on large documents

A 500-page scanned PDF can take a while. OCR is computationally expensive because it runs image analysis on every page. Processing locally on an older machine, expect roughly 1-3 pages per second depending on page complexity and resolution.

For large batches, OCRmyPDF supports parallel processing. Use --jobs 4 to process four pages simultaneously on a multi-core machine. Cloud-based tools handle large files faster because they throw more hardware at the problem, but you're uploading the document.


Making a scanned PDF searchable is the obvious benefit, but OCR enables other things too.

Accessibility. Screen readers can't interpret images of text. A scanned PDF without OCR is inaccessible to visually impaired users. Adding a text layer is the first step toward making scanned documents accessible. Our PDF accessibility checklist covers the full process, and OCR is the prerequisite for everything else on that list.

Archival compliance. Many industries require documents to be stored in searchable format. Legal discovery, healthcare records management, and government archives often mandate PDF/A with embedded text. OCR converts scanned documents to meet these requirements.

Data extraction. Once text is recognized, you can extract specific information programmatically. Pull invoice amounts, contract dates, or patient names from thousands of scanned documents using text extraction scripts. Without OCR, this data is locked inside images.

Reduced storage. Counterintuitively, OCR can reduce long-term storage needs. Once you have searchable text, you can compress the image layer more aggressively (since the text layer preserves the content) or switch to a lower-resolution image while keeping the document fully searchable.


Summary

Scanned PDFs are images, not text. OCR adds a hidden text layer so you can search, select, copy, and process the content.

For occasional use on non-sensitive documents, online tools are fastest. For sensitive documents or regular use, desktop software or browser-based local tools keep your files private. For batch processing, OCRmyPDF on the command line handles hundreds of files with a single script.

Accuracy depends on scan quality more than software choice. Scan at 300 DPI, keep pages straight, ensure good contrast, and set the correct language. These basics matter more than which OCR engine you pick.

If you're working with documents where every character counts, use ABBYY FineReader or Adobe Acrobat Pro and review the output manually. For everything else, any modern OCR tool running on a clean 300 DPI scan will produce usable results.