Excellent Utilities: OCRmyPDF - add OCR text layer to scanned PDFs - Page 2 of 3

Last Updated on May 22, 2022

In Operation

OCRmyPDF doesn’t offer a graphical front-end. Instead you run the program from the command line with a command such as:

We run a lot of tests of scanned documents, most of them are single pages. The process was fast. And each file was successfully processed with no fuss or bother.

To get a better idea of the time taken to complete the process, we took a 395 page PDF. This PDF already has a text layer, so OCRmyPDF defaults not to apply OCR. But it’s possible to force the process with the –force-ocr flag. This is useful if the file has been OCRed with an earlier version of Tesseract or other OCR software.

There’s various stages. The first is a scanning stage. On the 395 page PDF, this proceeded at approximately 1.9 seconds a page using a quad core Intel i5 processor. The scanning phase is only single threaded. The next stage is applying the OCR. This calls a popular OCR tool, Tesseract. For this part of the process, multiple copies of Tesseract are called, making a lot better use of the multi-core processor. This part of the process took, on average, 2.7 seconds of the page. There’s also lossless image optimization performed, courtesy of GhostScript. Again that’s only single core affair, as is the final stage of the process, which like the first stage calls the ocrmypdf process. The whole process on that 395 page PDF took a whopping 43 minutes.

Of course, documents you’ll want to OCR will typically be much shorter than 395 pages.

OCRmyPDF doesn’t only apply an OCR layer to PDFs. It can also take an image file as an input. When given an image, the software will try to convert the image to a PDF before processing. This pre-stage uses the Python package img2pdf.

In the video below, we take a sample JPEG scanned file with a size 2,887,137 bytes. The program recognizes we have submitted a JPG image. The program checks the validity of the image, and then proceeds to convert it to PDF. After conversion, it calls Tesseract to perform the OCR function. And then it sees if there’s any benefit from image optimization.

As the video indicates, adding the OCR layer increases the file size to 2,902,363 bytes. That’s an increase of 15,226 bytes. Put another way, that’s an increase of a mere 0.52%.

Features of the program:

Generates a searchable PDF/A file from a regular PDF. PDF/A is an ISO-standardized subset of the full PDF specification that is designed for archiving (the ‘A’ stands for Archive). OCRmyPDF generates PDF/A-2b by default.
Places OCR text accurately below the image to ease copy / paste.
Retains the exact resolution of the original embedded images.
When possible, inserts OCR information as a “lossless” operation without disrupting any other content.
Optimizes PDF images.
If requested, the program deskews and/or cleans the image before performing OCR.
Validates input and output files.
Distributes work across all available CPU cores. This only applies to the OCR phase of the process unless you use a program like GNU Parallel.
Scales well to handle files with thousands of pages.

Next page: Page 3 – Summary

Pages in this article:
Page 1 – Introduction / Installation
Page 2 – In Operation
Page 3 – Summary

Complete list of articles in this series:

Excellent Utilities
AES Crypt	Encrypt files using the Advanced Encryption Standard
Ananicy	Shell daemon created to manage processes’ IO and CPU priorities
broot	Next gen tree explorer and customizable launcher
Cerebro	Fast application launcher
cheat.sh	Community driven unified cheat sheet
CopyQ	Advanced clipboard manager
croc	Securely transfer files and folders from the command-line
Deskreen	Live streaming your desktop to a web browser
duf	Disk usage utility with more polished presentation than the classic df
eza	A turbo-charged alternative to the venerable ls command
Extension Manager	Browse, install and manage GNOME Shell Extensions
fd	Wonderful alternative to the venerable find
fkill	Kill processes quick and easy
fontpreview	Quickly search and preview fonts
horcrux	File splitter with encryption and redundancy
Kooha	Simple screen recorder
KOReader	Document viewer for a wide variety of file formats
Imagine	A simple yet effective image optimization tool
LanguageTool	Style and grammar checker for 30+ languages
Liquid Prompt	Adaptive prompt for Bash & Zsh
lnav	Advanced log file viewer for the small-scale; great for troubleshooting
lsd	Like exa, lsd is a turbo-charged alternative to ls
Mark Text	Simple and elegant Markdown editor
McFly	Navigate through your bash shell history
mdless	Formatted and highlighted view of Markdown files
navi	Interactive cheatsheet tool
noti	Monitors a command or process and triggers a notification
Nushell	Flexible cross-platform shell with a modern feel
nvitop	GPU process management for NVIDIA graphics cards
OCRmyPDF	Add OCR text layer to scanned PDFs
Oh My Zsh	Framework to manage your Zsh configuration
Paperwork	Designed to simplify the management of your paperwork
pastel	Generate, analyze, convert and manipulate colors
PDF Mix Tool	Perform common editing operations on PDF files
peco	Simple interactive filtering tool that's remarkably useful
ripgrep	Recursively search directories for a regex pattern
Rnote	Sketch and take handwritten notes
scrcpy	Display and control Android devices
Sticky	Simulates the traditional “sticky note” style stationery on your desktop
tldr	Simplified and community-driven man pages
tmux	A terminal multiplexer that offers a massive boost to your workflow
Tusk	An unofficial Evernote client with bags of potential
Ulauncher	Sublime application launcher
Watson	Track the time spent on projects
Whoogle Search	Self-hosted and privacy-focused metasearch engine
Zellij	Terminal workspace with batteries included

Pages: 1 2 3

Documents	Internet	Education
Audio	Video	Graphics
Admin	Desktop	Productivity
Science	Games	Security
Utilities	Coding	Finance
Web Apps	Other	Books

Google	Microsoft	Apple
Adobe	IBM	Autodesk
Oracle	Atlassian	Corel
Cisco	Intuit	SAS
Progress	Salesforce	Citrix