Background

Using the Code

The code uses System.Text.RegularExpressions namespace and consists of a single function, StripHTML().

First, the development formatting is removed such as tabs used for step-identations and repeated whitespaces. As a result, the input HTML is "flattened" into one continuous string. This serves two reasons:

To remove the formatting ignored by browsers

To make the regexes work reliably (they seem to get confused by escape characters)

Then the header is removed by removing anything between <head> and </head> tags.

Then, all scripts are removed by chopping out anything between <script> and </script> tags inclusive. Similarly with styles.

Then the basic formatting tags, such as <BR> and <DIV> are replaced with \r or \r\r. Also <TR> tags are replaced by line breaks and <TD>s by tabs.

<LI>s are replaced by *s and special characters such as are replaced with their corresponding values.

Finally all the remaining tags are replaced with empty strings.

By this stage, there are likely to be a lot of redundant repeating line breaks and tabs. Any sequence over 2 line breaks long is replaced by two line breaks. Similarly with tabs: sequences over 4 tabs long are replaced by 4 tabs.