Remove HTML tags. A string contains HTML tags. We want to remove those tags. This is useful for displaying HTML in plain text and stripping formatting like bold and italics. We remove no actual textual content.

Caution: A Regex cannot handle all HTML documents. An iterative solution, with a for-loop, may be best in many cases: always test methods.

Example. First here is a static class that tests three ways of removing HTML tags and their contents. The methods receive string arguments and then process the string and return new strings that have no HTML tags.

The example is a public static class that saves no state. You can call into the class using the code HtmlRemoval.StripTags. Normally, you can put this class in a separate file named HtmlRemoval.cs. It is useful for many programs.

StripTagsRegex uses a static call to Regex.Replace, and therefore the expression is not compiled. For this reason, this method could be optimized by pulling the Regex out of the method, such as in the second method.

Regex: This specifies that all sequences matching < and > with any number of characters (but the minimal number) are removed.

StripTagsRegexCompiled. This method does the same thing as the previous method. Its regular expression is pulled out of the method call. The regular expression (Regex) object is stored in the static class.

Tip: I recommend this method for most programs, as it is very simple to inspect and considerably faster than the first method.

StripTagsCharArray. This method is a heavily-optimized version of an approach that could instead use StringBuilder. In most benchmarks, this method is faster and is appropriate for when you need to strip lots of HTML files.

And: A detailed description of the method's body is available below. It was designed for performance.

Tests. We run these methods through a simple test. The three methods work identically on valid HTML. The char array method will strip anything that follows a <, but the Regex methods will require a > before they strip the tag.

C# program that tests HTML removal
using System;
using System.Text.RegularExpressions;
class Program
{
static void Main()
{
const string html = "<p>There was a <b>.NET</b> programmer " +
"and he stripped the <i>HTML</i> tags.</p>";
Console.WriteLine(HtmlRemoval.StripTagsRegex(html));
Console.WriteLine(HtmlRemoval.StripTagsRegexCompiled(html));
Console.WriteLine(HtmlRemoval.StripTagsCharArray(html));
}
}
Output
There was a .NET programmer and he stripped the HTML tags.
There was a .NET programmer and he stripped the HTML tags.
There was a .NET programmer and he stripped the HTML tags.

Benchmarks. First, regular expressions are usually not the fastest way to process test. I wrote an algorithm that uses a combination of char arrays and the new string constructor to strip HTML tags, filling the requirement and often performing better.

The benchmark for these methods stripped 10000 HTML files of around 8000 characters in tight loops. The file was read in from File.ReadAllText. The result was that the char array method was considerably faster.

Char arrays. One method here uses char arrays. It is much faster than the other two methods. It uses a neat algorithm for parsing the HTML. It iterates through all characters, flipping a flag Boolean depending on whether it is inside a tag block.

It only adds characters to the array buffer if it is not a tag. For performance, it uses char arrays and the new string constructor that accepts a char array and a range. This is faster than using StringBuilder.

Compiled. Using RegexOptions.Compiled and a separate Regex results in better performance than using the Regex static method. But RegexOptions.Compiled has some drawbacks. It can increase startup time by ten times in some cases.

Tip: More material is available pertaining to making Regexes simpler and faster to run.

Self-closing. In XHTML, certain elements such as BR and IMG have no separate closing tag, and instead use the "/>" at the end of the first tag. The test file noted includes these self-closing tags, and the methods correctly handle it.

Next: Here are some HTML tags supported. Invalid tags may not work in the Regex methods.

Comments. The methods in this article may have problems with removing some comments. Sometimes, comments contain invalid markup. This may result in comments being incompletely removed. It might be necessary to scan for incorrect markup.

Caution: The methods shown cannot handle all HTML documents. Please be careful when using them.

Validate. There are several ways to validate XHTML using methods similar to the iterative method here. One way you can validate HTML is simply counting the number of < and > tags and making sure the counts match.

Also: You can run the Regex methods and then look for < > characters that are still present.