Building XML Parsers for Microsoft's IE4

Building XML Parsers for Microsoft's IE4

Jean Paoli, David Schach, Chris Lovett, Andrew Layman, Istvan Cseri

Abstract

Microsoft cofounded the XML working group at the W3C in July 96 and actively participated in the definition of the standard. This article describes why Microsoft implemented its first XML application and how it led to the development of two XML parsers shipping in Internet Explorer 4.0, one written in C++ and the other in Java. We describe the importance of designing an object model API and our vision of XML as a universal, open data format for the Internet.

Motivation

Our First Application: Active Channels for Internet Explorer 4.0

Conventional Web use waits for a user to request a page before sending it. That is known as the "pull" mode. A powerful alternative exists, however, called "push" or "webcasting," in which pages are sent to a user in advance, based on automatic matching of pages to the user's interests. Webcasting provides each user with automatic delivery and offline access to the information and Web sites that he uses most often.

To bring this idea to reality, in February 1997 the Internet Explorer team needed a standard way of describing sites and pages. The first broadly popular form of Web "metadata" (so called because it describes data about other data) is the Channel Definition Format, or CDF [1]. This allows a Web site to post a description of itself in a standard form. Having done so, it is no longer just a site; it is also an "Active Channel."

A channel is a set of related Web pages. Channel Definition Format files include the following characteristics:

A minimal CDF file contains a list of URLs pointing to the pages that make up the content of the channel.

A more advanced CDF file can include title and abstract information describing individual items, a schedule for updates, and a hierarchical organization of the channel's offerings.

A CDF file must be easy to create and not require changes to existing HTML pages.

In looking for a suitable technology on which to build channels, the Internet Explorer team found that XML and Active Channels are a perfect fit. XML is excellent for metadata, since many of its rules are similar to the widely known HTML language rules; yet it has more facilities for structure and extensibility. This gave the IE team the assurance that parsers would be easy to implement and the format would be broadly usable.

CDF is an application of XML that deals with the particulars of Web metadata. CDF consists of a vocabulary of terms that are related to Web sites and their Active Channel content. Technically, the terms are used as "Elements" and "attributes," and CDF defines how they can be used together to expand a Web site into a webcasting channel (see Example 1).

A Universal, Open Data Format for the Internet

At the same time as the metadata CDF work was proceeding, members of the Internet Explorer team and others in Microsoft started to understand the broad need for a universal, open data format for the Internet. The opportunities are very exciting.

The Web has created an opportunity to communicate with anyone, anywhere. Fully realizing this potential depends on widespread use of standards--as with the telephone, this communication depends on numerous layers of interoperating technology. One such important layer is visual display and user interface, exemplified by standards such as HTML, GIF, and ECMAScript (previously JavaScript). These standards allow a page to be created once, yet displayed at different times by many receivers.

Although visual and user interface standards are a necessary layer, they are insufficient for representing access medium to text and pictures. There are no standards for intelligent search, data exchange, adaptive presentation, and presonalization. The Internet must go beyond setting an information access and display standard; it must set an information understanding standard--a standard way of representing data so that software can better search, move, display, and otherwise manipulate information currently hidden in contextual obscurity. HTML cannot fulfill these needs because it is a format that describes how a Web page should look, rather than one that represents data. For example:

HTML does not provide a standard way for a doctor to send a prescription to a pharmacist.

HTML does not enable a medical laboratory to publish statistical information in a format that any receiver can analyze.

HTML does not describe an electronic payment in a form that any recipient can decode and process.

HTML does not provide a standard way to search legal libraries to find, for example, all litigation documents about a certain topic.

HTML does not specify how information in a company catalog can be transmitted, such that a salesman can work offline, show the catalog to clients, take orders, then upload those orders in a standard format.

In short, while HTML provides rich facilities for display, it does not provide any standards-based way to manage as data.

A standard for data representation will expand the Internet in much the same way that the HTML standard did for display a few years ago. The data standard will be the vehicle for business transactions, publication of personal preference profiles, automated collaboration, and database sharing. Payments, medical histories, pharmaceutical research data, semi-conductor part sheets, and purchase orders will all be written in this format. It will open up a wide variety of new uses, all based on a standard representation for moving structured data around the Web as easily as we move HTML pages today. That data standard is XML.

XML: A Standard Format for Data

XML provides a data standard that can encode the content, semantics, and schemata for a range of cases, from simple to complex. XML can encode the representation for the following:

An ordinary document

A structured record, such as a appointment record or purchase order

An object with data and methods (for example, the persistent form of a Java object or ActiveX control)

A data record, such as the result set of a query

Meta-content about a Web site (such as CDF)

Graphical presentation (such as an application's user interface)

Standard schema entities and types

All the links between information and people on the Web

Benefits of XML

As a universal standard for the expression of data, XML offers many advantages to organizations, software developers, Web sites, and ultimately to end-users.

For software developers building Web applications and line-of-business Intranet software, XML provides a powerful, flexible format for expressing data--whether as a wire format for sending data between client and server, a transfer format for sharing data between applications, or a persistent storage format on disk. Because structured data in XML can include a self-describing schema, XML promises interoperability between applications that manipulate structured data independent of the underlying semantics.

For example, because XML enables publishers to supplement their Web sites with metadata such as CDF, users can receive "pushed" content as structured channels. XML can also provide a means for embedding arbitrary data and annotations within HTML, extending the possibilities for Web-based applications based on HTML and scripts.

For end-users, XML promises to provide a much richer set of Web applications for browsing, communication, and collaboration. The growing use of XML will improve Web-browsing applications for viewing, filtering, and manipulating information on the Internet.

As collaboration on the Web spreads to more businesses, customer services will eventually migrate from phone lines and storefronts to Web sites. The majority of these Intranet and Internet business applications will involve manipulation or transfer of data and database records, such as purchase orders, invoices, customer information, appointments, maps, and so forth. XML promises a revolution in the richness of end-user possibilities on the Web because it enables such a wide array of business applications to be implemented on the Internet.

Microsoft XML Parsers

Our long-term goal of XML is that it function as a data format that anyone can use to build a range of Web applications. To achieve this goal, we decided to write an XML parser and make it freely available. The result of these efforts was two XML parsers--one in C++ and the other in Java--both of which are included as part of Microsoft Internet Explorer 4.0. The parsers were written in parallel, but with somewhat different design goals.

The Microsoft XML parser in C++ (MSXML in C++) was written to perform as an integral part of Internet Explorer 4.0. Consequently, its design was oriented toward the following:

Fast parsing speed

Low memory usage

Asynchronous parsing during download

Strong international support

In other words, this is a performance parser. Although much effort was spent on wringing the most efficiency from the code, all non-essential features were eliminated. For example, MSXML in C++ is a non-validating parser.

In contrast to the XML parser in C++, the goals of the Microsoft XML parser in Java (MSXML in Java) included the following:

To be a reference implementation

To be a full validating parser

To be cross-platform

To promote widespread acceptance of the XML standard

To experiment with leading edge XML standards efforts, like DOM and namespaces

For this reason, the Java parser is fully validating, it implements the latest proposed features (such as namespaces), and the source code is freely available.

With some minor exceptions (such as no current support for conditional sections), Microsoft's XML parsers completely implement the W3C Working Draft of the XML specification dated June 30, 1997.[1]

Object Model

Once parsed, an XML document is manipulated through an object model (or API). To really help make XML the standard format for data over the Web, we felt that a standard object model was crucial; one that was simple, scriptable, minimal, and consistent with the Document Object Model (DOM) Working Group.[2] We are currently working with the W3C to standardize the XML object model. The object model is language neutral, which means it is equally accessible from all programming languages. To keep the object model independent of the parsers, it was designed prior to implementing them. The idea was to completely separate the parser implementation from the XML data structures. Having the parser use the object model ensured that problems with the object model would be flushed out during development.

Document object

The object model is very simple. It models the XML document as a tree structure using only three classes of objects:

A Document

An Element

A Collection

The Document object represents an entire XML document. This object holds the Element tree and document information such as the document type, version and character encoding. The Element object is used for representing the nodes in the tree, and the Collection object is used to represent the child Elements of a given node.

Element object

All XML data is stored in a tree of Element objects. Container Elements are non-leaf nodes. Empty Elements, text, as well as comments and processing instructions are stored as leaf nodes in the tree. An Element's type is revealed by the type property. Currently, the following types are returned:

ELEMENT

For container and empty XML Elements

TEXT

For PCDATA and CDATA

COMMENT

For comments

OTHER

Processing instructions

We considered using a different object in the object model for each of these types rather than a single object with a type property, but decided that multiple objects complicated the object model. This was particularly the case when navigating the Element tree and for untyped languages like JavaScript and VBScript.

The other important properties of the Element object are:

tagName

The name (or GI) for objects of type ELEMENT (otherwise an empty string)

parent

The parent Element of this object in the tree.

text

The text for objects of type TEXT or COMMENT (otherwise an empty string)

children

A collection of the objects contained by this object. This collection is empty for all other types besides Element

Finally, the Element class provides a basic set of methods for setting, getting, and removing attribute values as well as adding and removing child Elements.

Element collections

Element collections are used to walk the XML tree. An Element collection has one property, the length, which is the number of Elements in the collection. Child Elements are fetched via the item method, which returns either an Element by index, or by name. When more than one Element has the same name, the item method returns a new collection with all of the child Elements with that name.

The object model for the C++ parser is written using Microsoft's component object model architecture (COM). As a result, it is language neutral and equally accessible from JavaScript and VBScript as well as C++ and Java. For example, once a Document object is created, loading a document involves setting the document's URL. The following JavaScript code fragment shows how to load an XML document from an HTML page using the C++ parser:

While the object model is minimal, it is functionally complete. We expect that it will evolve over time.

For more information about Microsoft's XML object model see [2] and [3].

Technical Details

Simplicity of design

The Microsoft XML parsers are simple. This is by design. They are implemented as hand-coded, recursive-descent parsers. This has a couple of benefits:

First, the minimal syntax of XML makes a parser generator unnecessary: a hand-coded parser works just fine.

Second, recursive-descent parsers are both easy to write and easier to understand.

This latter point is especially important since the source code for MSXML in Java is available to the public on the Microsoft Web site. We want it to be a reference implementation that can be understood by any Java programmer. (Another reason parser generating tools are not used is that the language has many lexical Elements that are unlimited in length; we do not want to test a parser generator's buffer size limits.)

Character encodings

Although XML parsers are required only to read UTF-8 and UCS-2 encodings, the Microsoft's XML parsers handle many more encodings, such as shift-jis, euc-jp, and big5. In fact, the C++ parser supports the same set of character encodings as IE40, and the Java parser supports all the encodings supported by the Java VM. The recursive-descent parsers are isolated from these different encodings by input readers that convert everything to Unicode. While this increases memory usage for European languages, it simplifies string processing overall.

Storage of Element and Attribute names

Because Element and Attribute names tend to repeat, they are stored as atoms so that only one copy of each string is stored. This also speeds up string comparisons because atom objects can be compared for equality very quickly, without comparing the characters in the strings. This technique amortizes some of the cost of checking for NameChar characters and converting Unicode characters to uppercase.

Object model implementation

The Java parser builds the Element tree using the object model. When it creates new Elements it uses an Element class factory that is passed in by the creator of the parser. The parsers come with a default object model implementation that is fully functional; however, clients with special needs can write their own class factory that creates custom objects. This makes it easy for programs that want to use XML but still need to process legacy data structures.

The Java parser does not parse asynchronously, it could be run on a separate thread. The C++ parser parses asynchronously by running on a fiber. The object model was designed so that asynchronous parsing can be implemented transparently to the programmer. Because all properties and methods are function calls, the object model can block the caller when attempting to access a node in the tree that isn't completely downloaded.

Entities and other language features

The Java parser also implements DTD validation, full Entity handling, and the namespace proposal. We found that DTD validation was relatively easy. The XML spec was clear and pointers to algorithms for implementing validation were helpful, but we found that supporting validation does seem to impact the overall performance of the parser.

Correct entity handling was actually quite subtle--especially when we were trying to figure out how to expose entity references in the Object Model. The problem is that some clients of the Object Model (like JavaScript's) prefer the entities to be fully expanded and thereby essentially invisible to their scripts. Other clients of the Object Model (like an authoring tool), on the other hand, want to actually know where the entities are, how to edit them, and so on. We decided that entity references should be simple leaf nodes in the tree of type ENTITYREF that point to the full entity definition in the DTD and also decided to provide helper functions like getText() for those clients who just want the fully expanded text. Parameter entities in the DTD are more difficult. Currently parameter entities are expanded by the parser and not represented in the Object Model. It is not clear whether we can ever represent parameter entities in the Object Model or in fact we'd even want to.

Namespaces were relatively simple since we already had an atomized Name object in the Java parser to represent all tag and attribute names in the document. We simply added a namespace field to these Name objects, support for parsing the name space declarations, and we were done.

The parsers are small and fast. MSXML in C++ with full international character support is less than 100K and the MSXML Java Parser is 127K.

Using the Object Model to Process XML Data

To illustrate how the Object Model can be used to do interesting things we will show you a small example based on the CDF data we saw earlier in Example 1. Example 2 shows how to walk the XML Object Model to find out the INTERVALTIME of the scheduled event.

Notice that the GetInterval() method uses a small fixed set of objects and methods to manipulate the XML data that is independent of display-oriented things like HTML. As long as the CDF DTD (or schema) stays relatively fixed, this script code will work on any CDF file. In other words, this is robust enough to build Web-based business applications.

Conclusion

When we choose XML to encode CDF files, we were a little bit anxious. XML was just created--even though Microsoft co-created the W3C XML Working Groups in July 1996, it was as new to us as anyone else. In addition, launching "channels"--by using the first broad, public application of metadata--by using an untried standard was risky. A few months later (as of this writing in August 1997), we know that we have made the right choice.

The flexibility and ease of use of a text format for representing and exchanging structured information has been demonstrated. CDF is now widely used by industry's leading content providers, Web and Java authoring tool vendors, and "push" developers (such as PointCast, AirMedia, and BackWeb). Multiple tools have been developed to produce CDF files. Because it is simple text-based format, tools are easily developed to generate and process it. XML helped make CDF successful.

Now a set of XML enabling technologies, including C++ and Java parsers with their Object Models, are shipping in Internet Explorer 4.0. Because IE 4.0 will be integrated into Windows 98, there will be an XML parser on each desktop--another step toward the vision of making structured data an integral part of the Web.

At Microsoft, we strongly believe that XML is the standard, extensible, universal data format for the Internet. It is simple and easily authored. It is based on international standards that have been tested for many years. It is enormously extensible. It is flexible enough to allow representation of an incredibly wide range of information, and it also allows this information to be self-describing, so that structured data expressed in XML may be manipulated by software that doesn't have previous knowledge of the underlying meaning behind the data. XML provides a file format for representing data and can be extended to contain a description of its own structure. It is a means of formatting data and also a mechanism for extending and annotating standard HTML.

With its powerful expressiveness and flexibility, XML promises to add structure to data on the Internet, bringing the Web one step closer to realizing the potential for universal communication with anyone, anywhere.

About the Authors

Jean Paoli is a Product Manager in the Internet Explorer 4.0 team where he manages the XML and databinding effort. Prior to joining Microsoft in May 1996, he was the technical director of GRIF S.A., a leader in the creation of SGML authoring tools. Jean has a strong background in SGML and designed for important corporations a lot of systems where SGML, in its approach of structuring and storing information, ensured the long life and easy exchangeability of the data. Jean is a co-editor of the XML standard and co-created with Jon Bosak (and others) the W3C XML working group in July 1996.

Andrew Layman is a Senior Program Manager at Microsoft where he works on Internet and database technologies. Prior to joining Microsoft in 1992, he was a Vice President of Symantec Corporation and original author of the Time Line project management program.

Istvan Cseri is the technical architect of the XML project at Microsoft. Istvan designed the Java XML parser and is one of the co-authors of the Proposal for Extensible Style Language (XSL), which was recently submitted to the W3C. Istvan has a strong background in object oriented frameworks and user interfaces. Prior to join Microsoft, Istvan was at Borland where he was one of the designers and developpers of Quattro Pro for Windows.

Chris is one of the developer leads on the XML project at Microsoft. He has been working mostly on the Java XML parser reference release. He joined Microsoft in May of this year from a silicon valley startup company where he was working on CD ROM quality multimedia delivery over the web. Chris has a strong background in networking, communications and user interface work from his former work at Taligent and IBM's Santa Teresa Labs.

David Schach is a developer lead on XML in the Internet Explorer Group. He collaborated on the XML Object Model design and wrote the Microsoft XML Parser in C++. Currently, he is working on using XML as a style sheet language and is a co-author of Proposal for Extensible Style Language (XSL), which was recently submitted to the W3C. He has master's degree in computer science from the University of Pennsylvania and joined Microsoft in 1994.