~ Babbling of a code monkey

Introduction

The Document Liberation Project was officially announced at LGM in Leipzig on April 2 2014, a year ago. We (the founding members) gave a talk about the project later on the same day.

The project was planned as an umbrella over autonomous projects that handle various file formats and that use the same framework (I do not like this term, but I do not have any better one). This makes it very easy to integrate a new import library to an application, because it uses the same interface as other already inegrated import libraries. But at the same time it allows the libraries to exist as independent projects, with different maintainers, release schedules, licenses etc. Let me repeat this: we have never wanted to excercise any strict control. We want people to work with us, not for us.

At the occassion of the project’s first birthday, I think it is time for a little reminiscence. I also want to share an outlook for the future.

The past year

We did have high hopes for the future a year ago, however, not all of them have been fulfilled (or not completely).

One of the main highlights was the release of a new framework library called librevenge in May. librevenge contains all the document interfaces and helper types that used to be spread over several of the import libraries, thus simplifying the dependency chain. A part of this release was a switch of all existing import and export libraries to librevenge.

We started a new library for import of Adobe PageMaker documents–libpagemaker. It was written by Anurag Kanungo as part of GSoC 2014 and it supports the format of PageMaker 6.x.

We have also extended existing libraries. Some has gained support for more formats: for example, Laurent Alonso has added support for Microsoft Works Spreadsheet and Database to libwps and he is extending that to Lotus 1-2-3 currently. He has also added support for more than twenty legacy Mac formats to libmwaw. There have been various improvements for most of the other import libraries.

We have created two export libraries: for EPUB and Abiword. The libraries are called libepubgen and librvngabw, respectively.

Unfortunately, we have mostly failed to attract new developers (or contributors in general). We did receive an occassional patch for one library or another, but only one substantial new feature: Miklós Vajna has implemented reading of metadata from Visio and Publisher documents. We also hoped that other people would start new libraries, but that has not happened yet (or we do not know about it).

Another valuable way to contribute is to provide sample documents. We have not attracted many people in this area either. Let me at least mention Steven Zakulec and Derek Kalinosky, who contributed, respectively, Microsoft Publisher and CorelDRAW documents for regression testing.

The bright(?) future

There are some interesting developments coming this year. We should see some progress on Adobe FreeHand and Apple Pages import filters (in libfreehand and libetonyek, respectively). We will also be doing doing at least one new import filter as part of GSoC 2015–we have received proposals for import of Apple Numbers, Xara Xtreme and Zoner Draw formats. I also hope to finally move EPUB import in libe-book forward a bit.

Why this post?

I am writing this as a direct response to Miklos’s blog post on the same theme. Miklos argues against the current setup for regression testing that all our import libraries use. I do not believe his approach would be substantially better than the current one. I will try to summarize my thoughts about it in the following text. I, however, admit that the current setup is not quite perfect and I can envision some improvements…

How the current regression test suites work

For every import library, there is a separate repository that contains the regression test suite. That consists from sample documents and pre-generated output files in several formats, which are generated by command line conversion tools that every library provides. Most important of these is the so-called “raw” format: it is simply a serialization of the librevenge API calls. Additional output formats include ODF and SVG for graphic libraries.

The test suite is driven by two perl scripts: regression.pl checks that the current output matches the saved output and writes a diff file for any difference; regenerate_raw.pl updates the current output files. These scripts are copied from test suite to test suite and adapted for the current use (e.g., which formats are checked, location of the test directories, etc.)

Better way? Or maybe not…

This section discusses pros and cons of Miklos’s approach in the context of DLP import libraries. It uses citations from Miklos’s blog post.

Better focused checks

Being automatically generated, you have no control over what part of the output is important and what part is not — both parts are recorded and when some part changes, you have to carefully evaluate on a case by case basis if the change is OK or not.

This is not as big deal as it would seem, especially if the changes are checked regularly and the test repository is kept updated. Usually the changes are quite localized and easy to verify.

Single-point failure

… from time to time you just end up regenerating your reference testsuite and till the maintainer doesn’t do that, everyone can only ignore the test results — so it doesn’t really scale.

The test suite is in gerrit, next to the main repository. If someone submits a fix for review, he can submit an update to the test suite too. I admit that the two changes would not be linked in any way, but we do not get that many contributions for that to be a problem. And it would be possible to make the test suite a submodule of the main repository, which would fix this.

No way to forget to run the tests

Provided that make distcheck is ran before committing, you can’t forget to clone and run the tests.

As a de-facto release engineer for the majority of DLP‘s libraries, I have got a check list of things to do before a new release. Running the regression tests is just one item on that list.

Less prone to unrelated changes

On the other side, it is an extra work to write them. And, more importantly, to keep them in sync with the code, so they cover everything that is necessary. With the current approach, any change in the output is immediately visible.

Possible to commit code change + test in a single commit

Having testcase + code change in the same commit is one step closer to the dream …

To be fair, the current approach would makes it rather difficult to run the test suite for an older checkout, because there is no association to a particular commit in the test repository. But I do not think I have ever needed this, so I do not see it as a problem.

Big increase in size of the main repository

LibreOffice’s code is huge–20 MB of test files would be about 1% of the size. This is not true for the libraries we are talking about. Their size if several MB at most, so addition of a number of data files immediately shows up in the repository size. It also shows up in release tarball’s size, which is even more important point.

Let me show an anecdotal example: the current size of unpacked tarball of libetonyek is 3 MB. The cumulative size of the test documents in its test repository is 24 MB. And these documents only cover Keynote 5 format…

Testing of multiple versions of a format induces copy-paste

We typically have tests for multiple versions of the same file format. These also often have approximately the same content over all versions. I assume that, when adding a new test file that is based on similar file produced by a different version of the application, the test case would most probably be copied from test case for that other file. That means that if a change is needed later (e.g., to add a new check), it has to be duplicated over several places. This increases the risk that some of the test cases will not be updated.

Possible improvements

Diff is not always good enough

If there is a change in the output, regression.pl generates a diff. This, however, is not always the best way to show the changes. In some cases, word diff (e.g., generated by dwdiff) would be much better.

Dependency on other libraries

All the output generators are implemented in external libraries. This is not a problem for the “raw” output, as this is not expected to change. But ODF output is often affected by changes in libodfgen. Unfortunately, this also means that the tests only work with a specific version of libodfgen–typically the current master. This is a problem and I think that our decision to test ODF conversion in the libraries’ test suites was wrong and counter-productive. IMHO the output generators should be tested in the libraries that implement them.

This is already partly done for libodfgen, as we have test code that generates various ODF documents programmaticaly. But the output is just saved to files that must be examined manually–there is no automated check of the output. IMHO Miklos’s approach would be really beneficial here.

Conclusion

While the current regression testing setup is not perfect, there is no need to radically change it, as the proposed alternative does not really add many benefits. The biggest concern is a considerable increase in size of the release tarballs. However, we should limit the tests to the use of the raw format and move tests of output generators to the libraries that implement them. It makes sense to use Miklos’s approach to test these.

Since we announced the Document Liberation Project and its accompanying framework library, librevenge, there has been several requests for EPUB generator. (librevenge itself contains generators to CVS, HTML, SVG and plain text. There is also a separate library for generating ODF called libodfgen.) I had an idle moment two weeks ago and did not feel like working on any of my existing projects, so I decided to look at this. I started a new library, libepubgen (predictable, eh?) The core of it was the HTML generation code from librevenge, modified to produce its output to an abstract output interface instead of into librevenge::RVNGString and also to create XHTML 1.0 instead of HTML 4.01.

Since then, in more idle moments, I added support for images and simple splitting of the HTML output to multiple files. I also integrated the new library into writerperfect, so there are now command line tools to convert various text formats supported by writerperfect to EPUB (for every foo2odt, there is now foo2epub as well). What is still missing is handling of foreign binary objects, the drawing parts of librevenge::RVNGTextInterface (I think I will convert these to SVG images), smarter splitting of HTML files (provided there is enough information from the input) and probably more things I cannot think of right now. It would also be nice to allow conversion of presentations (in other words, to implement librevenge::RVNGPresentationInterface), as they tend to contain lots of text.

The interface is probably not going to change much at this point (except adding functions for registering binary object handlers and possibly some configuration to libepubgen::EPUBTextGenerator). One thing to highlight is that, like libodfgen, libepubgen does not create the output file directly. Instead, it provides an interface (libepubgen::EPUBPackage), which the caller has to implement. All internal files are then generated using this interface. The only exception is mimetype, which must always be created by the caller (only if it creates Zip container, of course. mimetype is not needed for filesystem container).

I will continue to work (on and off) on this, but I have no estimate when the first release is going to happen. If anyone is interested and wants to help, patches are welcome. Bug reports too, but remember this is a new project, so bugs are expected.

The story

Since I had always wanted to make the switch to librevenge-based libraries in LibreOffice 4.3, ever since we first started to work on librevenge in November 2013, I immediately began working on that. I had already prepared the ground for that switch some time ago, when I had extended WPXSvInputStream class with the functionality that it would need to implement librevenge::RVNGInputStream interface. I had also added import tests for many of the supported formats, in anticipation of breakage 🙂 So, all that remained was to bundle librevenge, update all the other libraries and then do some WPX-to-RVNG substitutions in writerperfect module. Sounds easy, right?
As it turned out, it was not. The first problem appeared when I tried to build librevenge as a shared library (all the other libraries are built statically for historical reasons. Most of them are only needed by a single shared library anyway.) Unfortunately it turned out that our autotools wrappers for MSVC compiler cannot handle that, so this needed special handling for MS Windows: building the code just as it was an internal library. Which should have been easy, except for the fact that the linker failed to create the DLL, without any error message. I did not have a Windows machine, so I tested the build by scheduling builds through gerrit, which made debugging the problem a bit cumbersome. After some fiddling with the code and makefiles and several hours later, I figured out that it is because the code did not export any symbol and correctly defined the macros to enable that. [Censored: several choice expletives addressed at the authors of link.exe at Microsoft.] After that, I only needed to repeat the same recipe to change libodfgen from static to shared library.
So, about a day later then I originally expected, I finally started to rebase the libraries. This went a bit smoother: I only needed to patch two of the libraries–one to avoid an unwanted dependency and the other to workaround a bug in MSVC–and make a few adjustments to gbuild to make it build on Windows. And a few more fixes to get it to build with system libraries. The whole thing was pushed to master (and a bit later to libreoffice-4-3 branch) on Monday, 3 days after the release of librevenge.
There are still outstanding items: the most important of them is that the import tests for 3 of the file types handled by these libraries fail. I suspect it is an integration problem, but I have not checked and just disabled the tests for now. After I fix these, I also want to enable the formats libmwaw and libwps newly support.

The conclusion

Now it is time to return to the topic sentence and say together with ./configure:

checking which librevenge to use... external
checking for REVENGE... yes

This short series of blog posts attempts to show how to use librevenge to facilitate writing of import filters for office document formats. It is focused on writing libraries, as I would like to encourage sharing of import filters between Open Source projects.
There has been no release of librevenge yet, but I do not expect any significant changes at this stage, so what I say here should be true later too. Despite of that, librevenge is already used by 10 libraries and there are more that are work-in-progress._

What is librevenge?

librevenge is a library that simplifies writing of import filters by providing interfaces for typical office file formats: text documents, presentations, vector drawings and spreadsheets. Using it frees import library developers from the necessity to invent their own interface and allows them to focus on writing the actual import code. It also simplifies integration of new import libraries to applications: to integrate a new library of a kind the application already handles, one only needs to write some boilerplate code that registers the importer into the application’s filter framework (and that code can typically be copied from a previous occurence with just a slight modification).
The library replaces existing import interfaces from libwpd (text documents), libwpg (vector drawings) and libetonyek (presentations). Support for spreadsheets is new–we had not had that in any library previously. It is split into three parts:

librevenge, containing the import interfaces and types used there (and for historical reasons also an SVG generator for vector drawing interface);

librevenge-stream, containing stream interface that is used for input data;

librevenge-generators, containing implementations of the import interfaces that produce some useful output document formats (more about that later).

How does it work?

The import interfaces are event-based. In other words, they define callbacks that should be called at various stages of import of the source document, like startDocument(), closeParagraph() or insertText(). The caller must provide an implementation of the chosen interface that actually does something meaningful, e.g., builds an internal document model. Alternatively, the caller can use one of the prepared implementations–we call them generators, that are available in librevenge-generators library. These are plain text and HTML for text documents; SVG for presentations and drawings; and CSV for spreadsheets. There are also ODF generators for all document kinds in libodfgen. Note that once one has implemented a generator, it can be used with all import libraries for the same kind of document.
The callback functions have (we hope) self-explanatory names. Most of them are paired; in that case the pairs are named either start/endFoo() or open/closeFoo(). Standalone callbacks are named insertFoo() or defineFoo(). All opening callbacks and most of the standalone callbacks take a single argument which is a property list. The closing callbacks never take an argument.
The following sections contain more details about the helper types. All are defined in namespace librevenge, but that namespace is omitted for the sake of brevity.

Strings

RVNGString is used for passing strings through the interfaces. It always uses UTF-8 encoding.

Properties

RVNGProperty is an interface for specific properties handled by RVNGPropertyList. It contains convenience functions to extract objects of types supported by librevenge (except RVNGPropertyList itself).

Property lists

RVNGPropertyList maps string keys to RVNGProperty objects. It has factory functions for constructing and inserting properties for various data types: int, double, RVNGString, RVNGBinaryData,… Note that while the insertion functions for numbers allow to specify a unit, it is not advised to use anything but RVNG_UNIT_INCH (which is the default), as legacy generators do not really handle units, so the results might be incorrect.

Property list vectors

RVNGPropertyListVector is a sequence of RVNGPropertyList objects. This type is also used to implement nesting of property lists: RVNGPropertyList cannot contain another RVNGPropertyList directly, but it can contain a RVNGPropertyListVector.

Binary data

RVNGBinaryData serves to store, well, arbitrary binary data 🙂 As it is quite common operation, it is possible to convert to/from base64.

String lists

RVNGStringList is just that: a list of RVNGString objects. The predefined drawing and presentation generators use an RVNGStringList for output, inserting each page/slide separately.

Streams

RVNGInputStream interface, defined in librevenge-stream library, serves to pass around the input data. It has the usual functions for read-only stream: readNBytes(), seek(), tell(), isEnd() etc. It also has functions for handling internal structure, as that is useful for many formats (which use Zip, OLE2, etc. internally).
There are two implementations of RVNGInputStream available in librevenge-stream: RVNGFileStream, which also transparently handles Zip and OLE2, and RVNGDirectoryStream, which is useful for handling directory-based document formats. An application typically needs to implement its own stream type that wraps whatever internal stream type it uses.

Creating an import library

So you decided that you like librevenge and that the library for parsing format XYZ you are contemplating to write will use it. Because want to make it as easy as possible to start, we have written a tool that creates a skeleton of a new project. It is called project-generator and it can be found in this repository.
It has several options, but only few of them are really needed:

-p sets the project name;

-d sets one-sentence description (used in .pc file and in README);

-a sets the main author and -e his e-mail; both are used in .rc files and in CREDITS;

-D, -P, -S, -T select the kind of document the library handles. -D is for vector drawings, -P for presentations, -S for spreadsheets and -T for text documents (this is the default).

Anatomy of a project

So we have created a new project named, let’s say, libfoo. Let’s take a peek at what is inside…

Build system

The project uses autoconf and automake for build, as we believe that it is the least bad from all the existing build systems. In addition to that, there are project files for several versions of Microsoft Visual Studio in build/win32.

Headers

The public headers are in inc. The public interface is really minimal: a class (by default named FooDocument) that has two static functions:

isSupported() takes an RVNGInputStream and tests whether the input has the right format;

parse() takes an RVNGInputStream and an RVNGXYZInterface (which one depends on the document kind the library imports), reads the input and produces the document by calling RVNGXYZInterface‘s callbacks.

That means that it is the caller’s responsibility to supply the input stream and the generator–by providing suitable implementations for them or using one of the existing ones from librevenge-generators and librevenge-stream. The library only uses the two interfaces.

Library

Since the library only works with streams, it has no notion of the path to the source document (as it cannot even know if the input stream is based on an actual file or just a memory buffer). Therefore, if a library needs to read other files (especially with paths relative to the input), it has to delegate that task to the caller in some way.
The code of the library is in src/lib. The project-generator produces FooDocument.cpp containing empty implementation of the two public functions and libfoo_utils.h and libfoo_utils.cpp with some functions and types that we generally find helpful, but which we do not want to put into librevenge for various reasons. Many of the functions handle reading numbers and strings from RVNGInputStream (e.g., readU32() or readCString()), but there are other things too. (I am being deliberately vague here, as more functions can be added–or existing ones removed–in future versions of project-generator.)

Unit tests

Unit tests–if any–should go into src/test and use CppUnit. When implementing a new test class, do not forget to use CPPUNIT_TEST_SUITE_REGISTRATION(TestClassName) at the end (in namespace scope). That macro registers the class in the default test suite at the test manager, so the tests from this class will actually be run. (This can be done manually, of course, but why bother?)

Command-line tools

Last but not least, project-generator creates several command-line conversion tools into formats suitable for the document kind: HTML and plain text for text documents; SVG for vector drawings; SVG and plain text for presentations; and CSV for spreadsheets. The sources for these tools are in subdirectories of src/conv, named by the output type.
There is also another tool, converting to so-called “raw” format. The raw generator prints all callbacks and their arguments. It also allows to check proper nesting of paired callbacks. This is particularly useful during development, but we use the output for regression tests as well.
All these converters use the generators provided by librevenge-generators.

Stay tuned!

In the next part I will present a complete parser for an invented text document format.