Optical character recognition (OCR) is the translation of scanned images of typewritten
or printed text into editable text. It is widely used to convert books and documents
into electronic files or ebooks, to archive publications in digital library, or to publish
the text on a website.

Images usually are acquired through document scanners or digital cameras. Destructive
scanning involves cutting or debinding the books, whereas non-destructive scanning
employs digital camera setups to quickly and efficiently capture images of book
pages without damaging the original document.

For OCR to be successful, some post processing may be required to correct artifacts
introduced during image capture stages or defects in original documents themselves.
There are several free or open-source programs designed specifically for those postprocessing
steps. The tools provide automation on image scropping, rotation, denoise, despeckle,
dewarping, etc.

unpaper: tries to enhance the
quality of scanned images by removing dark edges that appeared through scanning
or copying on areas outside the actual page content (e.g., dark areas between
the left-hand-side and the right-hand-side of a double-sided book-page scan).
The program also tries to deskew a misaligned image. Input and output files can
be in either .pbm , .pgm or .ppm format, thus generally in .pnm format; due to
this image type limitation, in order
to effectively use this tool, image format conversion from more popular formats — such as TIFF
or PNG — is generally required.