OCRmyPDF – adding text layer to image/PDF

OCRmyPDF – installation and use

OCRmyPDF is a python script that utilizes the Tesseract OCR library.

It allows for, among other things:

- adding an OCR layer to an existing PDF or image file (after conversion to PDF)
- removing an old OCR layer from a PDF file and replacing it with a new one
- generation of an additional text file containing text detected by OCR correction of orientation, page quality, etc.
- performing OCR only for selected pages
- selection of the language/languages of the document for which OCR is to be performed
- adding metadata to the PDF file

The following 3 components are required for it to work:

- Python 3.x 64-bit
- Tesseract OCR 64-bit
- Ghostscript 64-bit

Installing OCRmyPDF for Windows

(source: https://ocrmypdf.readthedocs.io/en/latest/installation.html)

1. Install Python and the Tesseract OCR library from the command line using the winget command

winget install -e --id Python.Python.3.11 
winget install -e --id UB-Mannheim.TesseractOCR

2. Download Ghostscript (AGPL edition) from the project page and install the application

3. Install OCRmyPDF from the command line using the pip command

py -m pip install ocrmypdf

Launching manually in the command line

1. Basic command

To launch OCRmyPDF from the comman line type in

py -m ocrmypdf [flagi] ”INPUT_FILE” ”OUTPUT_FILE”

example:

py -m ocrmypdf "C:\Users\AStraus\Desktop\IMG_20250416_092300_943.pdf" "C:\Users\AStraus\Desktop\IMG_20250416_092300_943_OCR.pdf"

If the output file does not already exist it will be created, otherwise it will be overwritten.

2. List of all flags and parameters

Setting the -h or --help flag will display a list of all available options and parameters, with a short overview for each

py -m ocrmypdf -h

3. Applying OCR to an image

If the input file is an image (supported formats: jpg, png, tiff, gif, webp, bmp) it is necessary to set an additional --image-dpi [image_resolution] flag. Check image Properties to verify image printing resolution (DPI) for best results.

py -m ocrmypdf –-image-dpi [DPI] “INPUT_FILE” ”OUTPUT_FILE”

example:

py -m ocrmypdf --image-dpi 200 "C:\Users\AStraus\Desktop\testimage.png" "C:\Users\AStraus\Desktop\image_ocr.pdf

4. Generating a separate text file

Set the --sidecar flag and add a path to an output text file, if you want to generate an additional text file containing the recognized OCR text

py -m ocrmypdf –-sidecar “TEXT_FILE“ “INPUT_FILE” ”OUTPUT_FILE”

example:

py -m ocrmypdf --sidecar “C:\Users\AStraus\Desktop\ocr_text.txt” "C:\Users\AStraus\Desktop\IMG_20250416_092300_943.pdf" "C:\Users\AStraus\Desktop\IMG_20250416_092300_943_OCR.pdf"

5. Files with preexisting text layer

If the PDF file already contains a text layer, you can try to overwrite, add to it or redo the OCR process by setting one of the following flags:

- -f, --force-ocr (‘flattens’ the file, disregarding preexisting OCR and non-OCR text layers, processes the file as an image)
- -s, --skip-text (does not interfere with the preexisting text layer, attempts to recognize additional text in images)
- --redo-ocr (removes a previously generated OCR layer and redoes it, but any preexisting, non-OCR text in the file is not modified)

6. Specifying language

By default, the Tesseract OCR 4.0 library uses an English model/dictionary to to detect text on page. If the input file is in another language, causing errors/typos/lack of recognition for special symbols, OCR precision can be increased by downloading additional language models (files with .traineddata extension) a następnie umieszczając je w folderze Program Files -> Tesseract OCR-> tessdata

Additional languages are specified by setting the -l [language_code] flag. List of all supported languages can be found here.

It is possible to specify more than one language

py -m ocrmypdf -l pol - l deu “INPUT_FILE” ”OUTPUT_FILE”

7. Choosing pages to OCR

Performing OCR on on only specified pages of a document is possible by setting a --pages [page numer(s)] flag

py -m ocrmypdf --pages 1, 3-5 “INPUT_FILE” ”OUTPUT_FILE”

8. Metadata

Adding metadata (Properties->PDF Information) to output file is possible using the following flags

- --title TITLE
- --author AUTHOR
- --subject SUBJECT
- --keywords KEYWORDS

Additionally, if set, contents of the 'Title' field will appear at the very beginning of resulting PDF file.

9. Improving file quality

You can improve file quality before performing OCR on a file by setting the following flags:

-r, --rotate-pages (detect misoriented pages, rotate pages depending on detected page orientation)

-d, --deskew (‘straighten out’ slightly rotated/skewed pages)

Using OCRmyPDF with Wizlink

At the moment, OCRmyPDF can only be launched with Wizlink in the following ways:

1. by directly interacting with the command line (launching the cmd.exe application with the Run Application activity, and then sending a command to it using the Send Keys activity (command + {ENTER}))

NOTE: Due to the nature of the cmd.exe application, it is not possible to point to its window using the Application Name specified in the Run Application activity. You can work around this problem by additionally using the Find Application activity and giving the cmd.exe window a new name.

When you send several commands one after another to cmd.exe, they will execute one after another (cmd.exe will process the first command, and after the free command line reappears, it will do the same with the 'stuck' ones).

2. by directly running python with Run Application activity parameters.

The latest version of python installed on the computer can be called with the Run Application activity, as File Path specifying only the keyword “py” or “python”. If you want to use a specific version of Python, you can list the full path.

NOTE: In this method, all commands entered in the Arguments parameter should not contain the initial word “py” because python has already been called by Run Application itself)

-m ocrmypdf [flags] ”INPUT_FILE” ”OUTPUT_FILE”

This will launch a cmd.exe window, which will close immediately after the script finishes.

NOTE: In neither of these methods Wizlink gets feedback on the completion of the script, or its result (success/fail).

If you plan to use the result file in the same scenario, depending on the typical size of the file, you need to determine by trial and error a long enough Wait, or construct a loop waiting for the result file to appear.

Troubleshooting

- If an attempt to execute the script ends with an error of the Ghostscript component for files created by a particular source, you can try to perform a downgrade to an earlier version of Ghostscript