Introduction

If you've ever used the File | Save As... menu in Internet Explorer, you might have noticed a few interesting options IE provides under the Save As Type drop-down box:

The options provided are:

Web Page, complete

Web Archive, single file

Web Page, HTML only

Text File

Most of these are self-explanatory, with the exception of the Web Archive (MHTML) format. What's neat about this format is that it bundles the web page and all of its references, into a single compact .MHT file. It's a lot easier to distribute a single self-contained file than it is to distribute a HTML file with a subfolder full of image/CSS/Flash/XML files referenced by that HTML file. In our case, we were generating HTML reports and we needed to check these reports into a document management system which expects a single file. The MHTML (*.mht) format solves this problem beautifully!

This project contains the MhtBuilder class, a 100% .NET managed code solution which can auto-generate a MHT file from a target URL, in one line of code. As a bonus, it will also generate all the other formats listed above, too. And it's completely free, unlike some commercial solutions you might find out there.

Background

I know people assume the worst of Microsoft, but the MHTML format is actually based on RFC standard 2557, compliant Multipart MIME Message (MHTML web archive). So it's an actual Internet standard! Web Archive, a.k.a. MHTML, is a remarkably simple plain text format which looks a lot like (and is in fact almost exactly identical to) an email. Here's the header of the MHT file you are viewing at the top of the page:

To generate a MHTML file, we simply merge together all of the files referenced in the HTML. The red line marks the first content block; there will be one content block for each file. We need to follow a few rules, though:

Use Quoted-Printable encoding for the text formats.

Use Base64 encoding for the binary formats.

Make sure the Content-Location has the correct absolute URL for each reference.

Not all websites will tolerate being packaged into a MHTML file. This version of Mht.Builder supports frames and IFrame, but watch out for pages that include lots of complicated JavaScript. You'll want to use the .StripScripts option on sites like that.

Using Mht.Builder

MhtBuilder comes with a complete demo app:

Try it out on your favorite website. The files will be generated by default in the \bin folder of the solution. Just click the View button to launch them. Bear in mind that for the Web Archive and complete tabs, all the content from the target web page must be downloaded to the /bin folder, so it might take a little while! Although I don't provide any feedback events yet, I do emit a lot of progress feedback via the Debug.Write, so switch to the debug output tab to see what's happening in real time.

There are four tabs here, just like the four options IE provides in its Save As Type options. In MhtBuilder, these are the four methods being called, in the order they appear on the tabs:

As of Windows XP Service Pack 2, HTML files opened from disk result in security blocks. In order to avoid this, we need to add the "Mark of the Web" to the file so IE knows what URL it came from, and can thus assign an appropriate security zone to the HTML. That's what the blnAddMark parameter is for; it causes the HTML file to be tagged with this single line at the top:

<!-- saved from url=(0027)http://www.codeproject.com/ -->

The other thing we need to do when saving these files is fix up the URLs. Any relative URLs such as:

<imgsrc="/images/standard/logo225x72.gif">

must be converted to absolute URLs like so:

<imgsrc="http://www.codeproject.com/images/standard/logo225x72.gif">

We do this using regular expressions, which gets us a NameValueCollection of all the references we need to fix. We loop through each reference and perform the fixup on the HTML string.

We use a similar technique to get a list of all the files we need to download, which are then downloaded via my WebClientEx class. Why use that instead of the built in Net.WebClient? Good question! Because it doesn't support HTTP compression. My class, on the other hand, does:

HTTP compression is a no-brainer: it increases your effective bandwidth by 75 percent by using standard GZIP compression-- courtesy of the SharpZipLib library.

Conclusion

Creating MHTML files isn't hard, but there are lots of little gotchas when dealing with HTML, regular expressions, and HTTP downloads. I tried to document all the difficult bits in the source code. I've also tested MhtBuilder on dozens of different websites so far with excellent results.

There are many more details and comments in the source code provided at the top of the article, so check it out. Please don't hesitate to provide feedback, good or bad! I hope you enjoyed this article. If you did, you may also like my other articles as well.

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

Share

About the Author

My name is Jeff Atwood. I live in Berkeley, CA with my wife, two cats, and far more computers than I care to mention. My first computer was the Texas Instruments TI-99/4a. I've been a Microsoft Windows developer since 1992; primarily in VB. I am particularly interested in best practices and human factors in software development, as represented in my recommended developer reading list. I also have a coding and human factors related blog at www.codinghorror.com.

I know what this is now. Word is looking for a trailing "--" at the final boundary. So at the very end of the file change this..

------=_NextPart_000_00

to this..

------=_NextPart_000_00--

2) carriage return / linefeed in <TITLE> and other places can cause regexes to fail

In other words, this is still valid HTML:

<a hr
ef="http://
blah.com"
>
blah
<
/a>

.. but it makes the regexes, er.. hella harder!

3) There are some pages on the internet that cannot be saved as MHT

I never promised you a rose garden.. IE itself barfs on a lot of pages, some that I can actually save. The more javascript and dynamic AJAX-y stuff on the page, the less likely it is that MHT will work.

However, if there are pages that you feel should work (eg they are not CRAZY html), I will definitely look at them, so feel free to post 'em.

Using your class to make an mht out of this URL http://www.mediatrack.com, it doesn't properly resolve the relative paths of the CSS files. Copy of the output below. This is because the above URL actually goes to http://www.mediatrack.com/main/default.aspx. If you request the actual page rather than just http://www.mediatrack.com then it works just fine.

This isn't a problem for the app I am working on as I'll always be downloading specific pages but I thought that you would want to know.

I noticed that in addition to the trailing "--" for the final BoundaryTag, MS Word also requires that the boundary written in AppendMhtHeader not have a trailing space so I needed to change boundary="----=_NextPart_000_00 " to boundary="----=_NextPart_000_00". I set Private Const _MimeBoundaryTag As String = "----=_NextPart_000_00".

I am not able to save/download .mht only from www.lowes.com and www.homedepot.com.

When I tryed to save page from www.heavydutystore.com, it is giving an error message saying that invalid file/folder format. This is because a linefeed character in coming along with url. I have removed this error.

But can you help me to save .mht from www.lowes.com and www.homedepot.com

Wondering... wrote:Re: how can i use it to convert a local html file?
I removed this functionality from the current version, because most people wanted to save webpages via HTTP, not convert webpages on the local disk..

hi all,i got a problem when i am saving the url as .mht file.my web page consists of multiple pages and so, prev, next links are provided. now i am at the middle of the paging.after saving my url into .mht i found only first page is saving not the intended (middle) 3rd page which i saved.

Also tried running the app from SocksCap (Version 2). Details of SocksCap are given below.

Please let me know what URL should I put to run the application behind the proxy server.

The application works fine if I give any local url.

SocksCap
------------
SocksCap automatically enables 32-bit Windows-based TCP and UDP
WinSock client applications to traverse a SOCKS firewall. SocksCap
intercepts the networking calls from a client application and
redirects them through the SOCKS server without any modification to
the original application or to the operating system software or
drivers.

SocksCap centralizes SOCKS client configuration, eliminating the need
to individually configure each network application's firewall or proxy
setup. SocksCap can be configured to connect directly to some
addresses (for example an Intranet) while redirecting all other
connections through the SOCKS server. Direct connections based on
application name can also be configured.

I tried to load a mht file as text of a Database in the axbrowsercontrol using vb.net. When I load the file with navigate its fine, but when I load it with doc.body.innerHTML = htmlstring it doesnt work, it shows only the plain html text. With plain HTML it does work, not with mht. Do you know a way to tell the control to render MHTML ?