2020. június 3., szerda

Image resolution for OCR and how to check it

The resolution of the documents consisting of scanned pages determines how much information they store about the visual content, so how well or easily those pages are readable. This is usually measured in DPI (Dot Per Inch - how many pixels are covering a 2.54cm part of the page) so that it is independent of the page size. Horizontal and vertical DPI is usually the same, and for normal documents 300 DPI is optimal (higher is only needed for documents with small print). The minimum acceptable DPI for production OCR processing is 200 DPI (as advised by vendors of several OCR engines).
For the usual business documents of A4 and Letter standards the below table shows the amount of information contained and describing the visual content of the complete page at different usual DPIs (as well as horizontal and vertical size in inches, millimeters, pixels and DPI):


It can be seen that a 200 DPI image has less than half the information about the page (practically a 3.74 megapixel photo - worse than what any smartphone can take in 2020) compared to the optimal 300 DPI (approx. 8.5 megapixels). Anything below has even more scarce information about the page: 150 DPI only has one quarter, 100 DPI only has one tenth of the information.
Consequently a modern OCR solution using traditional engines (ABBYY, Nuance, Tesseract, RecoStart whichever company has them acquired at the moment) can be expected to perform adequately if feed with 200-300 DPI documents.
Up scaling scanned images just inserts rows and columns of smartly averaged pixels between the original pixels but this does not add the missing information: what was really there originally on the document. This is automatically performed by the engines anyways, but is not expected to make the results better, just enables processing low resolution documents (still usually producing substandard results).

Effect of JPEG: lossy compression 

JPEG uses lossy compression, this simply means, that even though you input the specific pixel information you have scanned (what color is each location in the page with a specific resolution), it does not store this information as given: it encodes 8x8 blocks converted into frequencies with a quality setting. If that is high, the JPEG closely resembles the image, so it is almost as good for OCR as the original and is larger. Lower quality adds noise and makes OCR results worse.
On comparison TIFF G4, PNG etc. formats store pixel as received, there is no degradation due to compression, this is called loss-less compressions and usually results in larger files.

How to figure out the resolution of your input files

For JPEG, PNG, TIFF etc. open them in e.g. IrfanView https://www.irfanview.com/ and fin.d the resolution in the bottom left in the status bar or press “I” for “I”mage properties. In case of TIFFs switch to the page you one to see the resolution for, they do not necessarily have the same resolution. Resolution may slightly differ from the numbers in the above table.

Finding the resolution of the images in PDFs

You can use the pdfimages command from popplers utils (free, opensource software).

Using WSL on Windows 10

Using msys2 on other Windows versions


Please note, PDFimages may calculate DPI wrong (for example if images are stored rotated) but the resolutions you can always compare to the table above.

How to install pdfimages

Using Windows Subsystem for Linux on Windows 10
  1. Enable Windows Subsystem for Linux
  2. Install e.g. Ubuntu from the Microsoft Store
  3. Start bash
  4. Install the poppler-util package (and its dependencies)
$ sudo apt install -y poppler-utils
Reading package lists... Done
Building dependency tree
Reading state information... Done
poppler-utils is already the newest version (0.62.0-2ubuntu2.10).
0 upgraded, 0 newly installed, 0 to remove and 0 not upgraded.

Using msys2

https://www.msys2.org/ is a free and open-source platform for multiple Windows versions (excluding Windows XP and below)
After installing msys2, install the package (the example below is for 64 bit installations):
c:\msys64\usr\bin\bash -lic “pacman -S mingw-w64-x86_64-poppler“

Rendszeres olvasók