Writing Your Own RTF Converter

Introduction

In 1992, Microsoft introduced the Rich Text Format for specifying simple formatted text with
embedded graphics. Initially intended to transfer such data between different applications on different Operating Systems (MS-DOS, Windows, OS/2, and Apple Macintosh),
today this format is commonly used in Windows for enhanced editing capabilities (RichTextBox).

However, as soon as you have to work with such data in RTF format, additional wishes start to get strong:

Simple extraction of text without consideration of any format information

Extraction and conversion of embedded image information (scaled or unscaled)

Conversion of the RTF layout and/or data into another format such as XML or HTML

Transferring RTF data into a custom data model

The component introduced in this article has been designed with the following goals in mind:

The component offers no high-level functionality to create RTF content

The present RTF interpreter is restricted to content data and basic formatting options

There is no special support for the following RTF layout elements:

Tables

Lists

Automatic numbering

All features which require knowledge of how Microsoft Word might mean it ...

In general, this should not pose a big problem for many areas of use. A conforming RTF writer should always write content with readers in mind that they
do not know about tags and features which were introduced later in the standards history. As a consequence, a lot of the content in an RTF document is stored
several times (at least if the writer cares about other applications). This is taken advantage of by the interpreter here, which just simply focuses on the visual content.
Some writers in common use, however, improperly support this alternate representation, which will result in differences in the resulting output.

Thanks to its open architecture, the RTF parser is a solid base for development of an RTF converter which focuses on layout.

RTF Parser

The actual parsing of the data is being done by the class RtfParser. Apart from the tag recognition, it also handles (a first level of) character
encoding and Unicode support. The RTF parser classifies the RTF data into the following basic elements:

RTF Group: A group of RTF elements

RTF Tag: The name and value of an RTF tag

RTF Text: Arbitrary text content (not necessarily visible!)

The actual parsing process can be monitored by ParserListeners (Observer Pattern),
which offers an opportunity to react on specific events and perform corresponding actions.

The integrated parser listener RtfParserListenerFileLogger can be used to write the structure of the RTF elements into a log file (mainly intended
for use during development). The produced output can be customized using its RtfParserLoggerSettings. The additional RtfParserListenerLogger parser
listener can be used to log the parsing process to any ILogger implementation (see System functions).

The parser listener RtfParserListenerStructureBuilder generates the Structure Model from the RTF elements encountered during parsing.
That model represents the basic elements as instances of IRtfGroup, IRtfTag, and IRtfText. Access to the hierarchical structure
can be gained through the RTF group available in RtfParserListenerStructureBuilder.StructureRoot.
Based on the Visitor Pattern, it is easily possible to examine the structure model via any
IRtfElementVisitor implementation:

Note, however, that the same result for such simple functionality could be achieved by writing a custom IRtfParserListener (see below).
This can, in some cases, be useful to avoid the overhead of creating the structure model in memory.

The utility class RtfParserTool offers the possibility to receive RTF data from a multitude of sources, such as string,
TextReader, and Stream. It also allows, via its IRtfSource interface, to handle all these (and even other) scenarios in a uniform way.

The interface IRtfParserListener, with its base utility implementation RtfParserListenerBase, offers a way to react in custom ways
to specific events during the parsing process:

Note that the used base class already provides (empty) implementations for all the interface methods, so only the ones which are required for a specific purpose need to be overridden.

RTF Interpreter

Once an RTF document has been parsed into a structure model, it is subject to interpretation through the RTF interpreter. One obvious way to interpret the structure
is to build a Document Model which provides high-level access to the meaning of the document's contents. A very simple document model is part of this component,
and consists of the following building blocks:

The various Visuals represent the recognized visible RTF elements, and can be examined with any IRtfVisualVisitor implementation.

Analogous to the possibilities of the RTF parser, the provided RtfInterpreter supports monitoring the interpretation process with InterpreterListeners
for specific purposes.

Analyzing documents might be simplified by using the RtfInterpreterListenerFileLogger interpreter listener, which writes the recognized RTF elements into a log file.
Its output can be customized through its RtfInterpreterLoggerSettings. The additional RtfInterpreterListenerLogger interpreter listener can be used
to log the interpretation process to any ILogger implementation (see System functions).

Construction of the document model is also achieved through such an interpreter listener (RtfInterpreterListenerDocumentBuilder) which, in the end,
delivers an instance of an IRtfDocument.

The following example shows how to make use of the high-level API of the document model:

As with the parser, the class RtfInterpreterTool offers convenience functionality for easy interpretation of RTF data and creation of a corresponding
IRtfDocument. In case no IRtfGroup is yet available, it also provides for passing any source to the RtfParserTool for automatic on-the-fly parsing.

The interface IRtfInterpreterListener, with its base utility implementation RtfInterpreterListenerBase, offers the necessary foundation
for a custom interpreter listener:

The IRtfInterpreterContext passed to all of these methods contains the document information which is available at the very moment (colors, fonts, formats, etc.)
as well as information about the state of the interpretation.

RTF Base Converters

As a foundation for the development of more complex converters, there are four base converters available for text, images, XML,
and HTML. They are designed to be extended by inheritance.

Text Converter

The RtfTextConverter can be used to extract plain text from an RTF document. Its RtfTextConvertSettings determines how to represent
special characters, tabulators, white space, breaks (line, page, etc.), and what to do with them.

Image Converter

The RtfImageConverter offers a way to extract images from an RTF document. The size of the images can remain unscaled or as they appear in the RTF document.
Optionally, the format of the image can be converted to another ImageFormat. File name, type, and size can be controlled by an IRtfVisualImageAdapter.
The RtfImageConvertSettings determines the storage location as well as any scaling.

HTML Converter

The RtfHtmlConverter converts the recognized RTF visuals into an HTML document. File names, type, and size of any encountered images can be controlled
through an IRtfVisualImageAdapter, while the RtfHtmlConvertSettings determines storage location, stylesheets, and other HTML document information.

RTF Converter Applications

The console applications Rtf2Raw, Rtf2Xml, and Rtf2Html demonstrate the range of functionality of the corresponding base converters, and offer a starting point
for the development of our own RTF converter.

Rtf2Raw

The command line application Rtf2Raw converts an RTF document into plain text and images:

RtfParser: Fixed to properly handle skipping of Unicode alternative representation in case these are written in hex-encoded form

RtfHtmlConverter: New property DocumentImages which provides information about the converted images using IRtfConvertedImageInfo

Added ChangeHistory.txt

15th October, 2008

Added support for tags

\sub: Changes font size to 2/3 of the current font size and moves 'down' by half the current font size

\super: Changes font size to 2/3 of the current font size and moves 'up' by half the current font size

\nosupersub: Resets the 'up'/'down' baseline alignment to zero; Attention: this leaves the font size unchanged as it is not known
by the current implementation what the 'previous' font size was; hence, depending on the RTF-writer, this might lead to content that is displayed with a smaller font size than intended

\v*: Toggles the new IsHidden property of IRtfTextFormat; \v and \v1 turn it on while \v0 turns
it off (according to the behavior or 'boolean tags')

\viewkind: Triggers the transition from interpreter state InHeader to InDocument (but only if the font table is already defined);
this supports documents without color table and prevents formatting or content at the beginning from being ignored

Extended/fixed support for tags

\dn and \up: will use the specified default value of '6' if none is given in the RTF (instead of resetting to zero)

RtfTextConverterSettings/ RtfXmlConverterSettings/ RtfHtmlConverterSettings: have a new flag IsShowHiddenText
which defaults to false

RtfTextConverter/ RtfXmlConverter/ RtfHtmlConverter: will only append found text to the plain text buffer if it is not marked
hidden in its text format or if the new setting IsShowHiddenText is explicitly set to true

Share

About the Author

Comments and Discussions

Really appreciate the quick response. Took me a little while to find the right place (being the 'expert' I am), but the value of width was definitely negative at that point. I made the changes you suggested and the initial test with the example WinForms app seems to work. The HTML produced referred to an image size of 1900 x 1200 - which is my monitor size and was erroring prior to the change.

My VB program calls you compiled DLL's (added as references) using code like below:

If I recompile with the change you suggested, should that be all that is needed to fix my problem? I have to convert the my whole thing to VS 2010 as it was written in 2008 and I don't have that installed any more.

Just FYI, the RTF provider is Windows 7 (and now 8) with the standard Visual Studio 2008/2010 RTF control. I'm not skilled enough to do anything too fancy. The tests I've done for this question used your test program RtfWinForms2010.sln. I'm surprised more haven't reported it!

At the very beginning, thanks a lot for the magnificent work and efforts in the RTFtoHTML project. I've been using it and it helped me so much. But, recently I faced a problem when converting an rtf file that contains tables. The RTFtoHTML renders the output HTML file successfully except from the "Tables" part. I've read the note you wrote in the introduction part:

Quote:

There is no special support for the following RTF layout elements:

Tables
Lists
Automatic numbering
All features which require knowledge of how Microsoft Word might mean it ...

Actually, this is a very important part in the project I'm working on now. So, is there is any way you can help me continuing this "Tables" part.

if convert from RTF to HTML don't generate tag <sup></sup> or vertical-align:super; in tag style for superscript. For subscript too don't work. I run RtfWinForms demo application from VS2010Experss
Why?
Thanks.

You're right - the current implementation (see constructor of class RtfText) doesn't allow the text to be null. I've included your extension to the library and it will be available within the next (unscheduled) update.

Hi! Thanks for the great code!
I would need to get rtf converted to html but without font types etc.
I have been using command line app for converting.

Now I'm getting this which is perfect but I would not want the font styles:

<p><spanstyle="background-color:#FFFFFF;font-family:Times New Roman;font-size:11pt;">times new Roman</span></p><p><spanstyle="background-color:#FFFFFF;font-family:Arial;font-size:11pt;">Arial</span></p><p><spanstyle="background-color:#FFFFFF;font-family:Arial;font-size:11pt;">asd</span></p><p><spanstyle="background-color:#FFFFFF;font-family:Arial;font-size:20pt;">Arial 20px</span></p>

There is no build in functionality to convert an OLE object. However, you can build your own IRtfElementVisitor (Visitor_pattern[^]). Within the VisitTag method you can trigger to the objemb and objclass tags, end extract the binary data.

Usually the RTF document contains an image, representing of the embedded OLE object.