OCR Tools

Machine Learning in Linux: Surya – multilingual document OCR toolkit adds text recognition

Our Machine Learning in Linux series focuses on apps that make it easy to experiment with machine learning.

Surya is billed as a multilingual document OCR toolkit. It’s a CLI-based utility that can be used with a CPU or GPU.

The latest release has added text recognition. And there’s also a Streamlit app, software which turns data scripts into shareable web apps.

This is free and open source software.

Installation

To run Surya you’ll need Python 3.9 or higher and PyTorch, the latter provides libraries for basic tensor manipulation on CPUs or GPUs, a built-in neural network library, model training utilities, and a multiprocessing library that can work with shared memory.

We tested Surya with PCs running Ubuntu and Manjaro including a machine with an NVIDIA GeForce RTX 3060 Ti dedicated graphics card, and an Intel NUC 13 Pro which only has onboard Intel Iris Xe. We’ll go through installing the GPU version of PyTorch. If you want to see how to install the CPU version so that you can run Surya on machines without a dedicated graphics card, see our review of an earlier release of Surya.

There are a variety of ways of installing Surya without polluting our machines. We’ll install Surya in a isolated Python environment.

$ sudo apt install python3-venv -y

$ mkdir pytorch_env
$ cd pytorch_env

Create the environment:

$ python3 -m venv pytorch_env
$ source pytorch_env/bin/activate

To install PyTorch with GPU support, issue the command:

$ pip install torch torchvision torchaudio

We can now install Surya with the command:

$ pip install surya-ocr

Installing the software with pip in an isolated environment

The developer also provides a streamlit application which lets us try Surya on images or PDF files with a web-based interface..

$ pip install streamlit

Here’s the final page of the installation of the web-based app.

Installing Surga GUI with pip

On the first run, the model weights are automatically downloaded.

Next page: Page 2 – In Operation and Summary

Pages in this article:
Page 1 – Introduction and Installation
Page 2 – In Operation and Summary

Subscribe
Notify of
guest

This site uses Akismet to reduce spam. Learn how your comment data is processed.

1 Comment
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Daniel Hunt
Daniel Hunt
8 months ago

I have been surfing online more than 3 hours today, yet I never found any interesting article like yours. It is pretty worth enough for me. In my opinion, if all web owners and bloggers made good content as you did, the web will be much more useful than ever before.