May 2007 Archive

May 20, 2007

Review: RiverDocs Converter

Disclosure: This is a paid review. RiverDocs Limited have had no
influence on the tone or content of this review.

Summary

An essential tool for any organisation which publishes Microsoft Word or PDF files online, RiverDocs Converter is vastly superior to any other conversion software currently available. There's now no reason for publishers not to offer accessible, high quality HTML versions of documents previously published only in proprietary formats. The parser even compensates for poorly authored source documents, previously a significant barrier to producing accessible, semantic HTML versions of Word and PDF documents.

It's not a magic bullet though - every conversion requires human-checking, and documents with any degree of complexity require a degree of input from an experienced web editor - but despite a slightly weak editor it's still well worth the price and will only get better in future versions given the publisher's focus on research and development.

Introduction

RiverDocs Converter is a software package for the Microsoft
Windows operating system which claims to convert documents
designed for print into structured, accessible HTML documents
for online delivery. In short this means it'll take PDFs and
Microsoft Word files and attempt to convert them into a format
more suitable for delivery and consumption over the web.

PDF and MS Word are beloved of government and corporations
who often need to publish large documents quickly, but these
formats are primarily designed for printing, not for delivery
online, and have serious accessibility issues associated with
them. So the potential benefits from effective conversion
software are enormous - being able to offer HTML versions of
these documents cost effectively is something that hasn't been
possible before.

Installation

Installation was straightforward, taking a couple of minutes
on my workhorse desktop PC.

The software does require the latest
version of the Microsoft .NET 2.0 Framework, if this isn't
already installed and available you will be prompted to
download and install it.

Getting started

Starting the software for the first time you are presented
with a quick guide to converting your first document, and the
clean, functional RiverDocs interface.

Test 1 - my first
conversion

To test the software for the first time I used a PDF
document regarding chimney stack removal I found on Cambridge
City Council's website at:

It's a 4 page document containing a cover sheet, and a mix
of different levels of heading, bullets and images. The PDF
document was not tagged.

Opening the file displays it in the main RiverDocs
window:

Clicking the Convert button started the conversion, which
took less than a second using the default settings. The
interface changes to a split-screen affair, with the original
document in the left pane, and the converted document in the
right pane:

To give an idea of the quality of
conversion and mark-up the software can produce
automatically I wanted to save the document immediately.
Admittedly this is not intended real-world usage of the
product, but does provide an idea of quality of the baseline
conversion prior to manual editing.

Big River had provided me with a one page crib-sheet
covering the major interface elements, so I knew that the Save
function was for saving a RiverDocs project, and the Publish
function was for saving the converted document as HTML, CSS and
images.

Clicking the Publish button presents the Publish dialogue
box:

In addition to publishing as HTML, the
software also supports output in CHM (Microsoft Compiled
HTML Help) format.

To keep things tidy I wanted to publish this version into a
new folder, but this is not a standard Windows file dialogue
box, and doesn't provide the facility to create a new folder,
so I had to switch out to Windows Explorer to do this before
publishing the document in RiverDocs.

But, it turns out the file name entered into this dialogue
box is actually used as a folder name, which will be created
for you and into which the document is published. These sorts
of interface issues are symptomatic of the software's relative
youth, and will no doubt be ironed out as the product
matures.

The publishing of this document took less than a second,
here are the results:

The default settings produce HTML documents with an XHTML
1.0 Transitional doctype, generating a separate HTML file for
each page of the source PDF, an index HTML document containing
a generated table of contents, a single CSS file and an images
folder containing converted images. The CSS is valid, and
attempts to mimic the style of the original document as closely
as possible.

As a comparison I ran the same file through Abbyy's PDF
Transformer, another PDF to HTML conversion tool. The results
were much vastly less impressive:

The Abbyy software makes no attempt to
produce structured HTML, instead presenting every single
line in the document as a paragraph and styling them to
appear as closely as possible to the original PDF.

In general the quality of the default output from RiverDocs
is extremely impressive. In this case there were just two
validation problems: an unclosed list item in the generated
table of contents, and missing alt attributes for the images on
the final page. Since the default output is "section based" the
parser moved the words "GUIDANCE NOTES" onto a page by itself
despite displaying it as part of the title page in the preview
pane, which was the only deviation from the page layout of the
original.

But this isn't a fair test of the software which wasn't
designed to be operated in this manner. While the results are
good, they aren't good enough to publish without manual
editing, so let's try again, only this time using some good old
human judgement.

Test 2 - getting serious

For the second test I wanted to take the same document but
publish it to a single HTML document of the highest quality as
close to the original format as possible. The process is the
same - open the file to be converted, and click Convert.

Metadata

Before getting stuck into the document itself I wanted to
specify some metadata for it. Fortunately RiverDocs make this
very easy to do (just click the Metadata button), and provides
a default set of Dublin Core elements for completion:

It appears that additional user-defined elements can be
created, so publishers in UK government for example can easily
add eGMS metadata to converted documents:

Unfortunately these additional
elements didn't make it to my published document, a bug I've
reported to Big River.

Options

RiverDocs offers a number of options to customise the output
of the converted document. The most important are:

Publish mode Can be single file,
section based (default) and page based. Section based
splits the document into section based on a heading level
specified by the user.

HTML Tidy configuration RiverDocs uses
the HTML Tidy library to identify and report issues with
converted documents. This can be set to A, AA or AA
(Strict). It's not clear from the documentation what the
difference between AA and AA (Strict) is.

CSS Options Although the software
makes a fair attempt to reproduce the style of the original
document, it's likely that most publishers will want to use
established in-house styles for publishing to the web.
RiverDocs has full support for external stylesheets,
allowing the specification of a local file (which will
allow you to preview and edit the document with your styles
applied) and a relative path to be used when the document
is published. The option to use a remote stylesheet would
be a welcome addition.

HTML Navigation Finally, the
navigation elements which allow the user to move from page
to page or section to section of the published document can
be renamed, or disabled.

The editor

For many users the area of the application where most time will be spent is the HTML editor, where the converted output can be modified and fine-tuned. In most cases this will be to either match the original document or to conform to a house web publishing style.

The editor always presents the output document in a
page-by-page format, regardless of the publish mode that's
currently set. It would be nice to be able to preview the
single page and section-based options.

The editor can be used visually in preview mode, or in
source mode which provides a simple text editor view of the
document page you're working on. As I wanted a single file
output and had set the options accordingly there was something
of a disconnection between working on a separate HTML file for
each page, and the intended output. As far as I can see there
is no way to preview the single file output prior to
publishing.

The toolbar provides standard editing tools you'd expect to
find available on a simple HTML editor. These generally work as
expected, although there are some quirks - for example undo
will only remember changes you've made until you switch to
source mode: so if you make change, switch to source mode and
back to visual mode you'll need to correct any errors manually
in source mode.

Once you've got used to the way the editor functions it's a
reasonably comfortable working environment, but don't expect it
have the functionality of DreamWeaver. I can foresee many users
doing the initial conversion in RiverDocs and taking the
published output into the editor of their choice to complete
the process: indeed if I was using RiverDocs on a daily basis
to convert a large number of files this is the way I'd work -
the software's value lies in its conversion capabilities, not
its editing capabilities.

One of the most common problems that will arise from
automatic conversion is that of images and appropriate alt
attributes. Editing images is easy - select the image in the
editor, and click the image icon:

The id is a temporary value used by
the software during conversion and editing, and is removed
on publishing.

Screen capture

One very nice feature of RiverDocs is the screen capture
tool. On the final page of the original PDF is a diagram
showing a cross-section of a wall, with some labels indicating
particular features of the diagram. Since the PDF was generated
from Adobe Pagemaker, the diagram consists of an image object
and a series of text objects for the labels. In the automatic
conversion RiverDocs quite rightly converted these separately,
which can be seen on the last page of the output
of test 1.

In my final version I want the image and labels as a single
image, and this is where the screen capture tool comes in:

It operates like any screen capture tool you've used before
- highlight the area to be captured and click an icon. In
RiverDocs the highlighted area will be inserted into your HTML
document as an image.

You've got issues

The software
provides assistance to help you identify and correct
potential issues with the converted document. The Issues
icon gives a quick idea of the number of issues identified
by the software at any stage after automatic conversion.
Clicking the icon opens a third pane with details of the
issues:

The potential issues highlighted
include missing alt attributes on images. I was disappointed
to note that alt text from objects in tagged PDFs wasn't
carried across to the converted HTML document. Otherwise the
guidance provided by the issues is sound, based as it is on
HTML Tidy - those of you familiar with the Tidy extension
for Firefox will know what to expect.

For non-expert users this provides an extremely useful indication of where there are potential problems in the converted document, and the separation of current page issues and whole document issues guides such users through the document with ease. Personally I was more comfortable editing the document first before using the issues tool - picking up the issues I could see, modifying structure, adding or correcting alt attributes, generally tidying the document up - but that's probably no more than a reflection of my workflow habits.

Test 2 results

It took 10 minutes from opening the original PDF to
publishing this version - very impressive results in such a
short space of time.

Test 3 - getting more
complex

To really test the software we need something a little more
complex than a single-column, text and images document. On the
Clackmannanshire Council website I found a 24 page consultation
document laid out in 2 columns, which included multiple levels
of headings and a data table:

It took me about 30 minutes to tidy the document up in
RiverDocs, but I was still left with a lot of redundant classes
with names like "font19" and all those named anchors generated
for the table of contents. Cleaning up the mark-up in RiverDocs
proved to be a bit of a chore, so I tried again, this time
dumping the output immediately into DreamWeaver.

This was a 12 page Word document, and conversion took
noticeably longer than PDF conversion, at about 20 seconds. The
only real issues with the conversion were the failure to
convert Word bullets to HTML lists and the failure to pick up
alternative text on images. Other than this the structure was
accurately represented and the images correctly positioned.

The converter doesn't appear to parse the styles used in
Word documents - I converted a test document which was styled
throughout as paragraphs, but with headings made bold with
larger font sizes. RiverDocs therefore accommodates poorly
authored, unstructured source documents, by analysing the font
size and weight and assigning heading levels accordingly. This
is a great feature given the preponderance of incorrectly
produced Word documents in many organisations.

Annoyances

Given the immaturity of the package there are some
inevitable annoyances with the interface and output:

Table of contents There appears to be
no way to disable the table of contents for a document you
wish to publish as a single page. This means superfluous
named anchors are scattered throughout the output HTML, and
removing them within RiverDocs is only

possible in source mode. In a long
document this can quickly become tedious. It would also be
an improvement if the TOC used ids rather than named
anchors.

Keystrokes in source mode Some of the
standard Windows keystrokes have been hijacked in source
mode - for example Ctrl+A should highlight the entire text,
but instead pops-up an "insert anchor" dialogue - worse
still cancelling that dialogue inserts a single "a"
character into the source.

Give me vanilla output I'd love to see
the option to output plain, vanilla HTML with no ids, named
anchors, classes or other generated content. In many
organisations the HTML output will be dropped into a
template where headings, paragraphs and other elements are
already styled by a surrounding div or the document body
itself. (Note: I did manage to suppress the proliferation
of classes like "font19" by specifying a blank CSS file in
the output options.)

Mark-up issues The mark-up is
sometimes sub-optimal, for example:

<p class="font9"><span
class="font9"><strong>NOTE: Some chimneys act as a
buttress and provide support to long
walls.</strong></span> <strong>Please check
with Building Control or a structural
engineer</strong><span
class="font9"><strong>,
before</strong></span> <span
class="font9"><strong>proceeding, to determine if this
is the case.</strong></span></p>

None of these are major problems though, and I would expect
the interface to improve as the software is developed further.
The key feature of the product is the conversion algorithm,
which is extremely impressive.

Conclusions

RiverDocs is an impressive product and an essential tool for
any organisation which has a need to publish more than a small
number of PDF and Word documents online. Simple documents take
no time at all to convert and tidy using the RiverDocs editor,
while I found more complex documents are best converted in
RiverDocs and then edited in a more powerful and functional
dedicated HTML editor such as DreamWeaver.

Future versions of RiverDocs are very likely to offer significant improvements, both in terms of quality of conversion and the application interface. Apart from being a single-product company, concentrating solely on the development of the RiverDocs Converter, they also fund applied research at Queens University Belfast as well as other universities engaged in the fields of accessibility, artificial intelligence and character recognition.

About the reviewer

Dan Champion has worked in the web industry since 1995 through his company Champion Internet Solutions Limited, with clients in the private and public sectors. Between 1999 and 2007 he was responsible for Clackmannanshire Council's multi-award winning websites.

He is a regular speaker on the subjects of web accessibility, web standards and web strategy at conferences and workshops throughout the UK, has written on the subjects of e-government and web accessibility for the Guardian, and featured on national BBC Radio in various guises.