Howto create pdf to whatever conversion

The Pdf format is not meant for editing or simple text extraction etc. With some pdf files it is impossible to extract a word/line/column representation. Despite these limitations, most pdf files are "sane", so we can extract text and build words from letters, lines from words and finally columns from lines. Text output design in pdfedit allows adding arbitrary output formats very easily.

Firstly it creates PageTextSource class with the template parameters and uses it as functor∞ to StateUpdater::updatePdfOperators function. It means that after each operator update the functor∞ is called. It stores the formatting operators into PageSimpleFragment and when a text operator is encounters creates new PageSimpleFragment.

Then it calls PageTextSourceformat method which does the transformation from letters to columns.

Finally when the page is parsed into reasonable structures, output method is called which tries to build output format from

all words

from all columns (which contain lines, lines contain words, ...).

Output structure can decide whether to build the output from one or both possibilities. XmlOutputBuilder build xml from columns iterating through its lines, then words, and letters.

New formats

There are two things to be done to enable your new format.

1) Implement derived class from OutputBuilder which means implementing one or both build functions. For example declaration of XmlOutputBuilder class is

These are files where you find menu item implementation in gui (item Tools->Pdf to xml). In src/gui

scripts/pdftoxml.qs - called when Options->Pdf to xml is pressed

base.h/cc

pdftoxml method (where CPage::convert function is called)

basegui.h/cc

fileSaveDialogXml method (save xml file dialog)

selectPagesDialog method (select pages dialog)

selectpagesdialog.h/cc - select pages dialog

Implementation notes

Note: Transformations (letters to words, words to lines, lines to columns) are not heavily tested and are rather simple (some sorting functions are missing).Note 2: The biggest limitations are fonts. The font specification is embedded in pdf file, but pdfedit (nor any other tool i am aware of) can extract these fonts. It is also due to the fact, that not every character must be present in the specification and the font is therefore not complete and usable.