Convert HTML to Text

Introduction

Recently, I wrote an article that presented code to convert plain text to HTML. So it occurred to me that it might be useful to publish some code that does the opposite: convert HTML to plain text.

Any application that extracts information from web pages will need to deal with HTML. For these applications, a conversion is required if you want to produce plain text. This conversion includes removing HTML tags, stripping tag content that isn't readable text (from tags such as <script>), and removing excess whitespace.

For the most part, this conversion is pretty simple. However, I'll discuss a few issues I ran into that made it a little more complicated.

The HtmlToText Class

Listing 1 shows my HtmlToText class. This class contains a single public method, Convert(), which converts HTML to plain text. This method loops through each character in the input text and performs the translation as it goes.

If it encounters an HTML tag, the code starts by parsing the tag from the input. Special handling is included for the <body> and <pre> tags. This method works fine with HTML that does not include a <body> tag. But, if it encounters one, it goes ahead and discards any data that came before it. In my testing, there is no point trying to extract useful text from outside the <body> tags.

Normally, the method will discard excess whitespace characters just as Web browsers do. However, if the code encounters a <pre> (preformatted) tag, then it changes mode and no whitespace is removed while in preformatted mode.

Next, the code looks up the tag in the _tags dictionary. This dictionary contains a list of tags and text that should replace them. For example, since <p> (paragraph) tags separate enclosed text from the surrounding text, the replacement text for both <p> and </p> is a new line. (Note that the parser is happy with just "\n" as a newline). By placing this information in a table this way, it is very easy to customize the translations without changing the code.

Finally, the code attempts to lookup the tag in the _ignoreTags list. If found, this indicates the contents of this tag should not be written to the output. For example, the inner text from a <script> tag should not be part of the resulting plain text. In this case, the EatInnerContent() method is called to consume text inside this tag. Because tags can be nested, this method will recursively call itself to process any tags it finds within the inner text.

For characters that are not part of any tag, they are written to the output string with special handling for whitespace, which I'll discuss next.

Handling Whitespace

When writing this code, I was able to get this far in pretty short order. However, things then became a little more complex. What I was ending up with was a lot of extra whitespace.

As mentioned previously, I wanted the code to replace any sequence of whitespace with a single space character, just as browsers do. But there are exceptions. For example, all whitespace is retained when in preformatted mode. And I don't want to discard whitespace specified using &nbsp;. Also, I don't really want spaces at the beginning or end of a line, so they should be discarded too.

In addition, I had to deal with the fact that the replacement text in the _tags dictionary included a lot of newlines. My initial results included places with many empty lines between text. Sometimes I want a single newline, other times I want a double newline, but I really don't want more than two newlines together.

As you can see, this can start getting a little convoluted. I ended up adding a protected helper class, TextBuilder. This class includes the logic to handle the conditions I've described above. The main class calls the TextBuilder class with text to be added to the output, and the TextBuilder class takes care to remove extra whitespace.

Note on the HttpUtility.HtmlDecode() Method

I should point out that my code uses the HttpUtility.HtmlDecode() method to decode HTML-encoded text. This method is defined in System.Web. By default, a reference to System.Web is added to ASP.NET applications but not to desktop applications.

If you want to use this code from a desktop application, you'll need to go into your project's properties and set the target framework to ".NET Framework 4" instead of ".NET Framework 4 Client", add a reference to System.Web, and add using System.Web; in your source file.