High-Level Description

This module is responsible for the export feature of Decapod. As input it gets the dewarped camera-captured document images. These can be transformed into different types of PDFs:

PDF containing only the original images

PDF containing the original images with underlying text

PDF/A containing the OCRed version of the documents including font information

The PDF generation uses OCRopus as its OCR system.

Current State

Currently operational features:

PDF containing only the original images (this is the only supported output format for release 0.3)

PDF containing the original images with underlying text

tokenized PDF: this is a proof of concept for the final PDF/A output. It uses token-based compression similar to JBIG2.

Future Work

Tokenized PDF generation will be enhanced to incorporate automatically generated font information in order to create valid PDF/A output and achieving a good compression ratio while maintaining a faithful representation of the captured documents.