Harvesting Web Content into MHTML Archive

Capture and archive web resources by saving them in RFC-2557 (MHTML) compliant format. This library includes a framework for augmenting the capture process and allowing programmatic control, from downloading through saving. A replacement for CDOSYS/CDONTS.

Introduction

Most people who use the web are familiar with pages written using HTML, which reference other files such as images and stylesheets. As email became more prevalent, there was the desire to email HTML content and all related resources. In 1999, an RFC (RFC-2557) proposal was submitted that described aggregating HTML and related resources into a single file. This soon became known as MHTML.

MHTML defines a standard format for the encoding and storing of web related resources so that it can be stored or transmitted easily.

Motivation

A couple years ago, I was working on a project for a client in which it would periodically harvest content from specified sites. Once downloaded, the content would be saved to a database allowing the user to later view a snapshot of the site anytime.

This application was pretty cool; however it suffered from a major deficiency; that is it used CDOSYS as its primary engine. CDOSYS and CDONTS have existed for a while and is the common way for most developers to create MHTML documents. In fact, CDOSYS has the method CreateMHTMLBody(), so it was no surprise when this was chosen to be used.

I was dissatisfied with the interface CDOSYS exposed. I found it to be cumbersome and difficult to use in the context of this application. I began thinking of how this might be improved and made easier. Written entirely in C#, this library exposes an object model for the downloaded content and makes integrating into an application very easy. It also provides an extensibility model.

Extensibility model

A design goal for this library was to support as much customization and extensibility as possible while providing a common, powerful system. Through the use of a plugin-like architecture, many of the framework features can be extended beyond the original design.

Extending the content handling

One of the challenges in designing an extensible framework is there are many different types of content and more being developed as the web evolves. How could a single framework understand all past, present, and future types of content?

To solve this, the concept of an artifact was developed. An artifact is an encapsulation of a single resource or file. An artifact by itself does not know about any content, rather each type of content is given its own class derived from ArtifactBase that knows how to handle that piece of data, as shown in the image below. For instance, HTML and images are handled by the HtmlArtifact and ImageArtifact classes respectively. By implementing new artifacts derived from ArtifactBase, new content types can be handled:

Artifacts have two primary classifications: TextArtifact and BinaryArtifact. These define general handling for text and binary content, respectively. Although the design could have made these abstract, they are in fact catch-all for the content that is not recognized. For instance, suppose a video clip was encountered, and no concrete class for such content exists, therefore, it would be loaded into a BinaryArfact object. This way, all the content for a site can be represented, even if the content is not specifically known.

Artifacts encapsulate content, but nothing has been said about who is responsible for creating the artifacts. How does the engine know what artifact to create when a page is downloaded? To handle the creation of an artifact, content handlers provide the bridge between the raw downloaded content and the artifacts. Similar to the class hierarchy for artifacts, handlers derive from the ContentHandlerBase class. Each handler acts as a factory, taking the downloaded content and creates the necessary artifacts:

Well, nothing has been said about how the downloaded content knows to be handled by a particular ContentHandlerBase object. It is one thing for a handler to know about the content it is given; it is another to invoke the correct handler to begin with.

All routing information is maintained in the configuration file. This allows new handlers to be easily plugged into the framework without modifying any code. The following XML shows the structure of the configuration file that maps content handlers with the downloaded content:

The type and mimePattern attributes define the mapping. All the downloaded content has a Content-Type value provided by the server (i.e. image/JPG). This value allows the receiving agent to know how to process the data. In the case of this framework, the content type allows for knowledge of which class can create the necessary artifact.

The content type pattern can be a specific value, or contain wildcards. Notice the first handler above. It uses the pattern audio/* meaning the class AudioContentHandler will handle all the audio content. Now look at the pattern for CSS files. The parent type for CSS is text which is also used for other content; therefore a specific pattern must be used to identify its handler, which is StylesheetContentHandler. The configuration also allows more than one pattern to be defined. Take a look at the handler for JavaScript. There are two different patterns used, application/x-javascript and text/javascript allowing both to be handled by ScriptContentHandler when encountered.

The order in which the handler mappings will be interrogated is the order defined in the configuration file. Observe the last two handler mapping definitions. Any text content that is not known will be handled by TextContentHandler while all other unknown content will be handled by BinaryContentHandler.

If the need to specifically handle any unknown content such as video is encountered, all that is required is to inherit a class from ContentHandlerBase, a class inherited from ArtifactBase, and the appropriate configuration settings. The core will then invoke the handler when that content is downloaded. Likewise, if certain handlers are not necessary, simply remove them from the configuration file.

Extending the link resolution

One of the powerful features of this framework is the ability to traverse a page identifying other resources that are related. Therefore, all other content can be identified and downloaded and encapsulated within an ArtifactBase object.

All link identification is handled by the classes derived from LinkResolverBase. This set of classes can extract many of the common external references that appear within the HTML and CSS documents. These classes do not care about the type of content or whether the content points to a valid location. The only goal of these classes is to provide the core with a list of links extracted from the processed document:

Similar to content handlers, link resolvers also have a configurations section indicating which classes can locate links. Unlike the content handler configuration, the section shown below does not map any information. All HTML and CSS content are passed through each resolver where each will return a subset of the links found on the page:

Link resolvers can also be configured through the configuration file. Observe the ImageLinkResolver setting above. This setting has a context element used to hold configuration settings. In this case, there are attributes such as includeBackImg and evalCss. Any link resolvers can implement the interface IExtendedConfiguration which informs the core that custom configurations can be provided. The resolver can then initialize itself from the provided configurations.

Unlike the content handler configuration, there is no guaranteed order of processing for link resolvers. The order in which they appear in the configuration is not necessarily the order in which they will be invoked. No assumptions should be made that particular links have been located by the time a resolver is invoked.

Regardless of what links are provided by the resolver, there are some caveats:

Links located within script blocks will not be ignored. Dynamic content cannot be guaranteed and therefore this rule holds for all links, even if they are complete and valid in the script block.

Links do not have to point to the resources that exist. The core will ignore resources that it cannot access and continue processing other links.

Although the resolvers provided can scan in most common areas of the HTML or CSS, they may not locate all the links, especially if no tags are developed in the future. Implementing resolvers has the simple requirement that it inherit from LinkResolverBase. Once it is added to the configuration file, it can then be invoked by the core. Likewise, if there is no need to scan for certain resources, then by removing the setting from the configuration, the core will not load the resolver and therefore the links will not be located.

Loading and saving content

The ability to load and save content has not been implemented yet.

Bugs and feature requests

Since this library is in pre-alpha, no guarantees of functionality are implied. Many features still need to be implemented while those that are may still have bugs or be incomplete. I provide this for you to review and provide comments as to its direction.

In addition to the forum below, you can submit bugs and feature requests at sourceforge.net. I am particularly interested in learning the URLs that cannot be processed by this library. I appreciate any comment and am working diligently to add more functionality.

Confession

Now that I have gotten your hopes up, I must extinguish them just as fast. Currently, the only part of the library implemented is the downloading of the content. This can be experimented with by using the demo application.

When integrating this framework into your projects, only the code called, either directly or indirectly, from the method MhtmlDocument.LoadFromUrl(...) is implemented.

This library might be sufficient for your particular needs or you may just want to experiment but be advised, this library is still in a pre-alpha with much functionality still lacking. Among the features not yet supported are:

Saving content: Through the use of adapters, the content can be saved to any location, whether it is a database or a single file.

Loading content: Like saving, loading will happen through the use of adapters. The approach is that downloading the content should be independent of storage.

Managing link replacement: Links presented on a page must be transformed so that it can point to a resource.

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

One thing I'm looking for is a library that will allow me to build an HTML body with full HRL references to stylesheets, images, and such, and have the library generate the MHTML. I do not often have a deployed web page that I am trying to turn into an MHTML... it's usually just HTML with linked resources that I want to send as an email.

Will you be augmenting your library to support this type of operation?

The main core of the library provides an object model to some HTML and related content. In the core, there is no concept of where the code came from or where it will be stored. It is a complete, and disconnected, snapshop of some content.

The library will allow you to do what you are looking for through the use of adapters. It is too early to tell whether that functionality will be built-in, but through the extensibility model, there will be classes and interfaces that can be extended to allow content to be loaded from any location and any context.

Right now the architecture relies on an HttpWebRequest object, but as the code matures, I am hoping to make the source more generic.

I can't say when it will be ready for such tasks. It may be some time since I am more focused on getting the primary functionality implemented.