Last Updated on May 22, 2022
In Operation
OCRmyPDF doesn’t offer a graphical front-end. Instead you run the program from the command line with a command such as:
We run a lot of tests of scanned documents, most of them are single pages. The process was fast. And each file was successfully processed with no fuss or bother.
To get a better idea of the time taken to complete the process, we took a 395 page PDF. This PDF already has a text layer, so OCRmyPDF defaults not to apply OCR. But it’s possible to force the process with the –force-ocr flag. This is useful if the file has been OCRed with an earlier version of Tesseract or other OCR software.
There’s various stages. The first is a scanning stage. On the 395 page PDF, this proceeded at approximately 1.9 seconds a page using a quad core Intel i5 processor. The scanning phase is only single threaded. The next stage is applying the OCR. This calls a popular OCR tool, Tesseract. For this part of the process, multiple copies of Tesseract are called, making a lot better use of the multi-core processor. This part of the process took, on average, 2.7 seconds of the page. There’s also lossless image optimization performed, courtesy of GhostScript. Again that’s only single core affair, as is the final stage of the process, which like the first stage calls the ocrmypdf process. The whole process on that 395 page PDF took a whopping 43 minutes.
Of course, documents you’ll want to OCR will typically be much shorter than 395 pages.
OCRmyPDF doesn’t only apply an OCR layer to PDFs. It can also take an image file as an input. When given an image, the software will try to convert the image to a PDF before processing. This pre-stage uses the Python package img2pdf.
In the video below, we take a sample JPEG scanned file with a size 2,887,137 bytes. The program recognizes we have submitted a JPG image. The program checks the validity of the image, and then proceeds to convert it to PDF. After conversion, it calls Tesseract to perform the OCR function. And then it sees if there’s any benefit from image optimization.
As the video indicates, adding the OCR layer increases the file size to 2,902,363 bytes. That’s an increase of 15,226 bytes. Put another way, that’s an increase of a mere 0.52%.
Features of the program:
- Generates a searchable PDF/A file from a regular PDF. PDF/A is an ISO-standardized subset of the full PDF specification that is designed for archiving (the ‘A’ stands for Archive). OCRmyPDF generates PDF/A-2b by default.
- Places OCR text accurately below the image to ease copy / paste.
- Retains the exact resolution of the original embedded images.
- When possible, inserts OCR information as a “lossless” operation without disrupting any other content.
- Optimizes PDF images.
- If requested, the program deskews and/or cleans the image before performing OCR.
- Validates input and output files.
- Distributes work across all available CPU cores. This only applies to the OCR phase of the process unless you use a program like GNU Parallel.
- Scales well to handle files with thousands of pages.
Pages in this article:
Page 1 – Introduction / Installation
Page 2 – In Operation
Page 3 – Summary
Complete list of articles in this series:
Excellent Utilities | |
---|---|
AES Crypt | Encrypt files using the Advanced Encryption Standard |
Ananicy | Shell daemon created to manage processes’ IO and CPU priorities |
broot | Next gen tree explorer and customizable launcher |
Cerebro | Fast application launcher |
cheat.sh | Community driven unified cheat sheet |
CopyQ | Advanced clipboard manager |
croc | Securely transfer files and folders from the command-line |
Deskreen | Live streaming your desktop to a web browser |
duf | Disk usage utility with more polished presentation than the classic df |
eza | A turbo-charged alternative to the venerable ls command |
Extension Manager | Browse, install and manage GNOME Shell Extensions |
fd | Wonderful alternative to the venerable find |
fkill | Kill processes quick and easy |
fontpreview | Quickly search and preview fonts |
horcrux | File splitter with encryption and redundancy |
Kooha | Simple screen recorder |
KOReader | Document viewer for a wide variety of file formats |
Imagine | A simple yet effective image optimization tool |
LanguageTool | Style and grammar checker for 30+ languages |
Liquid Prompt | Adaptive prompt for Bash & Zsh |
lnav | Advanced log file viewer for the small-scale; great for troubleshooting |
lsd | Like exa, lsd is a turbo-charged alternative to ls |
Mark Text | Simple and elegant Markdown editor |
McFly | Navigate through your bash shell history |
mdless | Formatted and highlighted view of Markdown files |
navi | Interactive cheatsheet tool |
noti | Monitors a command or process and triggers a notification |
Nushell | Flexible cross-platform shell with a modern feel |
nvitop | GPU process management for NVIDIA graphics cards |
OCRmyPDF | Add OCR text layer to scanned PDFs |
Oh My Zsh | Framework to manage your Zsh configuration |
Paperwork | Designed to simplify the management of your paperwork |
pastel | Generate, analyze, convert and manipulate colors |
PDF Mix Tool | Perform common editing operations on PDF files |
peco | Simple interactive filtering tool that's remarkably useful |
ripgrep | Recursively search directories for a regex pattern |
Rnote | Sketch and take handwritten notes |
scrcpy | Display and control Android devices |
Sticky | Simulates the traditional “sticky note” style stationery on your desktop |
tldr | Simplified and community-driven man pages |
tmux | A terminal multiplexer that offers a massive boost to your workflow |
Tusk | An unofficial Evernote client with bags of potential |
Ulauncher | Sublime application launcher |
Watson | Track the time spent on projects |
Whoogle Search | Self-hosted and privacy-focused metasearch engine |
Zellij | Terminal workspace with batteries included |