For the usual business documents of A4 and Letter standards the below table shows the amount of information contained and describing the visual content of the complete page at different usual DPIs (as well as horizontal and vertical size in inches, millimeters, pixels and DPI):
It can be seen that a 200 DPI image has less than half the information about the page (practically a 3.74 megapixel photo - worse than what any smartphone can take in 2020) compared to the optimal 300 DPI (approx. 8.5 megapixels). Anything below has even more scarce information about the page: 150 DPI only has one quarter, 100 DPI only has one tenth of the information.
Consequently a modern OCR solution using traditional engines (ABBYY, Nuance, Tesseract, RecoStart whichever company has them acquired at the moment) can be expected to perform adequately if feed with 200-300 DPI documents.
Up scaling scanned images just inserts rows and columns of smartly averaged pixels between the original pixels but this does not add the missing information: what was really there originally on the document. This is automatically performed by the engines anyways, but is not expected to make the results better, just enables processing low resolution documents (still usually producing substandard results).
Effect of JPEG: lossy compression
JPEG uses lossy compression, this simply means, that even though you input the specific pixel information you have scanned (what color is each location in the page with a specific resolution), it does not store this information as given: it encodes 8x8 blocks converted into frequencies with a quality setting. If that is high, the JPEG closely resembles the image, so it is almost as good for OCR as the original and is larger. Lower quality adds noise and makes OCR results worse.
On comparison TIFF G4, PNG etc. formats store pixel as received, there is no degradation due to compression, this is called loss-less compressions and usually results in larger files.
How to figure out the resolution of your input files
For JPEG, PNG, TIFF etc. open them in e.g. IrfanView https://www.irfanview.com/
and fin.d the resolution in the bottom left in the status bar or press
“I” for “I”mage properties. In case of TIFFs switch to the page you one
to see the resolution for, they do not necessarily have the same
resolution. Resolution may slightly differ from the numbers in the above
table.
Finding the resolution of the images in PDFs
You can use the pdfimages command from popplers utils (free, opensource software).
Please note, PDFimages may calculate DPI wrong (for example if images are stored rotated) but the resolutions you can always compare to the table above.
How to install pdfimages
Using Windows Subsystem for Linux on Windows 10
- Enable Windows Subsystem for Linux
- Install e.g. Ubuntu from the Microsoft Store
- Start bash
- Install the poppler-util package (and its dependencies)
$ sudo apt install -y poppler-utils
Reading package lists... Done
Building dependency tree
Reading state information... Done
poppler-utils is already the newest version (0.62.0-2ubuntu2.10).
0 upgraded, 0 newly installed, 0 to remove and 0 not upgraded.
Reading package lists... Done
Building dependency tree
Reading state information... Done
poppler-utils is already the newest version (0.62.0-2ubuntu2.10).
0 upgraded, 0 newly installed, 0 to remove and 0 not upgraded.
Using msys2
https://www.msys2.org/ is a free and open-source platform for multiple Windows versions (excluding Windows XP and below)
After installing msys2, install the package (the example below is for 64 bit installations):
c:\msys64\usr\bin\bash -lic “pacman -S mingw-w64-x86_64-poppler“