Encoding and Strings

A Web page is an HTML document that might contain plain text as well as layout information and script code. Script code and layout information are separated from text by using special characters—the angle bracket characters—and tags. It goes without saying that if the text contains angle brackets, or some other special characters, that would probably produce a weird effect when the browser works on the document.

HTML encoding replaces these critical characters with ad hoc sequences that the browser recognizes as the intended character. For example, when the opening angle bracket symbol (<) is used as plain text, HTML encoding transforms it as &lt.

The need for encoding HTML came about with the advent of dynamic pages, which allow text to be read and injected from databases. HTML encoding is also important from a security viewpoint. HTML encoding protects against script exploits neutralizing unwanted script tags that might be silently injected in your pages.

Since its first version, ASP.NET has provided tools for encoding (and decoding) HTML text. In particular, you'll find a pair of methods for encoding and decoding—HtmlEncode and HtmlDecode—on the ASP.NET Server object. In ASP.NET 4, using these methods is quicker, and to some extent smoother, due to a new syntax and a new subsystem.

A Quick Syntax in ASP.NET 4

ASP.NET 4's new subsystem for auto-encoding HTML text saves you from the burden of always wrapping any piece of text in a call to HtmlEncode. The new syntax is a special version of the classic code block. When you have a code block, you simply use the colon symbol (:) to instruct the runtime to HTML-encode any text being displayed. Here’s an example:

<%: "<script>alert('Hello ASP.NET 4');</script>" %>

If you try this expression in an ASP.NET 4 sample page, you’ll obtain what’s depicted in Figure 1. The script command is output as plain text and doesn’t execute. Replace the : symbol with an = sign and the script code just executes. So far so good; now look at the following code snippet:

<% var text = "<script>alert('Hello ASP.NET 4');</script>"; %>

<%: text %>

This snippet produces exactly the same output you might see in Figure 1. However, the structure of the code opens up new possibilities. What if you emit text in the code block from an existing utility that already provides sanitized HTML? Imagine that the variable text receives its value from a method you don’t control and that already returns encoded markup. As a result, HTML encoding will be performed twice. What happens in this case? You might end up in a situation like the one illustrated below.

The result is shown in Figure 2. As you can see in the figure, the original text will be encoded twice—once because of the explicit call to HtmlEncode and once because of the : symbol in the code block. Clearly, this is not desirable.

Preventing Double Encoding

To prevent this nasty situation, the auto-encoding subsystem of ASP.NET 4 has been designed to recognize special strings that don’t have to be encoded further. It's interesting to note that the auto-encoding subsystem doesn’t really check whether the string is already encoded; it simply looks at whether or not the string belongs to a special new class and exposes a special interface—the IHtmlString interface.

Put another way, the auto-encoding subsystem is not really implemented to be idempotent—more simply, it just knows when it has to stop. A function is said to be idempotent if it always produces the same result, regardless of the number of times it's invoked on the same input. The IHtmlString interface is defined as follows:

\\{

// Methods

string ToHtmlString();

\\}

You should also be aware that a string object decorated with the IHtmlString interface is properly handled by ASP.NET 4's new auto-encoding subsystem and it's also fully supported by HttpUtility.HtmlEncode—another method in ASP.NET for encoding HTML text. However, a string marked with the IHtmlString interface is blissfully ignored by the HtmlEncode method on the Server object. Let’s find out more.

The New Type HtmlString

In ASP.NET 4, a new type is introduced that natively implements the IHtmlString interface. The new type, HtmlString, is defined in the System.Web namespace. Here’s a simple way to obtain an HtmlString:

As mentioned,you can pass an HtmlString to the ASP.NET 4 auto-encoder and you can also use it with HttpUtility.HtmlEncode. In ASP.NET 4, the method has been added as an overload for HtmlEncode and defined as shown in Figure 3.

The same overload doesn’t exist for Server.HtmlEncode, which accepts only plain strings. If you try to pass an HtmlString to Server.HtmlEncode, your code won’t just compile because no overload of method Server.HtmlEncode actually accepts an HtmlString object. The code in Figure 4 would work as expected and would avoid double encoding. However, the code won’t work if you replace HttpUtility.HtmlEncode with Server.HtmlEncode.

In Figure 4, the innermost call to Server.HtmlEncode encodes the text as usual. Next, the encoded text is placed in an HtmlString container and passed to HttpUtility.HtmlEncode, which detects the IHtmlString interface and skips the encoding pass.

Note that HttpUtility.HtmlEncode returns a plain string, not an HtmlString object. This means that if you try to encode it further, encoding occurs as expected. Likewise, note that you can’t nest HtmlString objects. Try the code in Figure 5, where the auto-encoder is invoked at the root of the code block, and see what happens. The result is the following doubly encoded string shown below. The HttpUtility.HtmlEncode method gets an HtmlString object but returns a plain string. Subsequently, the auto-encoder receives a plain string and encodes it again.

&lt;script&gt;alert(&#39;Hello ASP.NET 4&#39;);&lt;/script&gt;

What about ASP.NET MVC?

Because HtmlString and the auto-encoder belong to the ASP.NET 4 platform, they should be available to ASP.NET MVC as well. But, as announced in Phil Haack’s blog a few months ago (see http://haacked.com), ASP.NET MVC 2 is not compiled for each .NET platform. In other words, ASP.NET MVC 2 is not natively compiled for ASP.NET 4.

Instead, the system.web.mvc assembly is built only for ASP.NET 3.5 Service Pack 1 and is then included with both Visual Studio 2008 SP1 and Visual Studio 2010 with a product-specific tooling. ASP.NET MVC 2 exists as just one assembly built for one particular platform and works on any newer platform by taking advantage of .NET backward platform compatibility.

Maybe this is shocking news; maybe it's not. But how does the news relate to HTML encoding and the new HtmlString type? No HTML helpers available in ASP.NET MVC 2 return a plain string that contains HTML; instead, all HTML helpers return a new MvcHtmlString object. This new type is designed to be an ASP.NET MVC-specific version of the HtmlString type you have in ASP.NET 4.

Inside the MvcHtmlString Type

Hey, wait a moment. How can you take advantage of the IHtmlString interface and the auto-encoding feature defined for ASP.NET 4 that requires the .NET 4 platform on ASP.NET MVC 2 and that is compiled against .NET 3.5 Service Pack 1? To answer that question, let’s look at the source code of the MvcHtmlString class in ASP.NET MVC 2.

The documentation presents the class as the class that represents an HTML-encoded string that should not be encoded again. An excerpt of the source code is shown in Figure 6. The first consideration is that the class doesn’t implement IHtmlString. The reason is fairly obvious—no such interface exists in .NET 3.5 Service Pack 1. Subsequently, any instances of the MvcHtmlString class are created via a factory—the MvcHtmlString.Create method.

The factory checks the availability of the IHtmlString interface. If that interface is not available, the factory proceeds with the dynamic generation of a type that implements the interface. To check whether the IHtmlString type is available in the runtime environment, the code grabs information about the assembly where HttpContext is defined and checks whether that assembly also hosts the type IHtmlString.

Auto-Encoder and MvcHtmlString

If you write your ASP.NET MVC 2 application for the .NET 4 platform, you can use the auto-encoding subsystem of ASP.NET 4. The implementation of the class MvCHtmlString in fact detects the platform and emits a proxy that implements IHtmlString if it is running on .NET 4.

Finally, let me clarify a key statement, because it might be a source of misunderstanding: neither HtmlString in ASP.NET 4 nor MvcHtmlString in ASP.NET MVC 2 perform internal HTML encoding. They are simple string wrappers that, by exposing an interface, tell the ASP.NET 4 auto-encoding infrastructure not to further encode their content.

HTML encoding capabilities belong exclusively to Server.HtmlEncode and HttpUtility.HtmlEncode. In addition, in ASP.NET 4 you can use the : symbol to trigger an automatic encoding feature. Curiously, Server.HtmlEncode has always been implemented to call into HttpUtility.HtmlEncode and this is true also with ASP.NET 4. However, of the two classes, only HttpUtility is designed to support the HtmlString type.

A Simplified Syntax

HTML encoding is an important feature of Web applications—and a golden rule of Web development states that any text programmatically emitted to the output stream must be encoded. ASP.NET 4 offers a greatly simplified syntax (a variation of the popular code block syntax) for silently HTML encoding any text being output. For years, we debated whether Microsoft had to make HTML encoding default in any access to the response output stream. With the auto-encoder of ASP.NET 4 we don’t have free and automatic HTML encoding, but we do have a much simplified syntax.

If you use the ASP.NET 4 platform, keep two things in mind: use the auto-encoder syntax and make sure that any code (i.e., HTML helpers in ASP.NET MVC) that produces markup does so through HtmlString-based types.