Design considerations for a Gnome DOM

Raph Levien, 28 March 1999

Here are some thoughts regarding defining and implementing DOM
interfaces for Gnome, as part of the world
DOMination project.

Basically, the problem we are trying to solve is to make the DOM
an attractive interface for Gnome applications to store their state.
The primary cost is the forcing the use of a fairly narrow interface
for applications to be able to modify and access their state, as well
as bloated memory requirements for the state itself. The primary
advantage is that providing a unified framework should allow mixing of
components within an application, as well as providing generic
services for such things as remote access to state, loading and saving
of state, undo, and many important parts of a plugin mechanism.

Currently, there is a gnome-dom
codebase in Gnome CVS. I am concerned that this codebase and the
interfaces it defines may be inadequate for the uses I am considering.
The primary shortcomings are:

Memory representation on the order of 10x the original XML
document.

No mechanism (currently) for notifying the application of changes
within the tree. It should be noted that the W3C has a working draft
of a Level 2 DOM
specification that has support for adding listeners to events such
as mutations of the tree.

Implementing a lightweight, yet fully standards compliant DOM
seems to be challenging. Here I will outline some of my ideas about
how it might be done.

Nodes and node handles

The DOM spec
is largely organized around a Node data structure, with a number of
traversal methods defined (parent, first child, next sibling, etc.).
One principle of the DOM is that these Nodes are persistent across
modifications of the tree. For example, a Node pointing to the first
child of the second child of the root will point to the 1st of the
1st after the 1st child of the root is deleted.

When implementing a DOM that is accessible over CORBA IIOP, remote
references to Nodes will always be through a "node handle" data
structure. There is one specific challenge here: once it hands out a
handle, the server does not know if or when the client "forgets" it.
Thus, node handles can't be represented as simply as "1st child of 2nd
child of root", as they would be invalidated by tree changes. How
might node handles be implemented?

The first approach, and the one seemingly chosen by gnome-dom, is
that node handles correspond to pointers to allocated memory. By
storing links to parents, children, and siblings, all of the traversal
methods can be trivially implemented. Further, these node handles
easily survive changes to the tree.

However, this type of node handle is not adequate to fully
implement the DOM. For one, the DOM provides a "getElementsByTagName"
method that returns a NodeList of all descendant elements with the
given tag name. This NodeList also tracks changes to the tree, i.e. if
a node is deleted, the size of the NodeList decreases by one. It is
not practical to use a pointer to allocated memory as a node handle
for this NodeList. For one, the server would never know when this
memory might be safely deallocated.

Thus, a second style of node handle comes into focus: a string
description of a _procedure_ that might retrieve the value. For
example, el.getElementsByTagName("img") might return a node handle
that is the concatenation of el's node handle and the string "img".
Then, when this node handle is queried for its length, it can traverse
the tree starting at el and count the number of <img> descendants.

My approach here is that this type of node handle can be extended
to vanilla tree traversal as well, opening up the possibility of a
very compact representation.

Here is one concrete proposal for a compact data type. I do not
claim it is in any way optimal.

A top-level document consists of a number of chunks. A chunk
contains some number of nodes. Further, chunks can hold references to
other chunks. The overall structure is quite reminiscent of a B-tree.
Further, tag and attribute names are interned in a global table and
can be referred to by an index. (i.e. once a name is interned, it can
never be deleted from the document representation).

References to other chunks and character data nodes are handled
similarly. There are also a few spare bits we can use (using the Lisp
trick of 4-byte alignment and using the bottom bits as tag info).
These bits will distinguish between the various node types and also
indicate whether an element is the last child of its parent.

Conceptually, DOM traversal operations can be implemented by
searching the entire document for the node id. This is, of course,
hideously inefficient in the general case. Thus, I propose that the
DOM implementation contains a cache that maps node id's to
physical locations. Only when there is a miss in the cache does the
tree need to be searched to find an id. The first child and next
sibling operations can be done in constant time given a physical
location - the result of these operations is also placed in the cache.
Thus, a preorder traversal of the tree is likely to always hit in the
cache, even if the cache is quite small.

Mutations of the tree also invalidate entries in the cache, as
appropriate.

Additionally, there are quite a few "nodes" in the DOM interface
that don't correspond directly to a node in the physical
representation. For example, you can request an Attr node of an
element. The node handle for this node would be a concatenation of the
node id of the element and the attrName id of the attribute. This node
has a child which is (usually) a Text node containing the attribute
value. The id of this node is the same as the node handle for the
attribute, plus a flag indicating that it's the child.

I believe this representation satisfies the requirements of being
reasonably compact, while also allowing a 100% standards-compliant DOM
interface on the back-end, and can be implemented without memory
leaks.

Thus, I suggest that the interface be designed with to handle a
representation of node handles that includes both the traditional node
pointer approach and also these id-based node handles.

For the C language implementation of the DOM interface, a node
handle should include both a pointer to the enclosing Document
structure and also the various nodes, ids, and flags required. Memory
for the node handle should be allocated on the client side. Thus,
methods returning a node handle should have a pointer to the node
handle structure passed in as a "result" argument.

The B-tree representation requires the following layouts for node handles:

Document * the enclosing documentint32 node_id the node id of the node itselfint32 attrName_id the attribute name for Attr nodes of elementsboolean to distinguish between the Attr node and its Text child

A node representation can fit Nodes into the following structure:

Document * the enclosing documentNode * a pointer to the node

However, NodeLists (the result of a getElementsByTagName invocation) require this layout:

Document * the enclosing documentNode * a pointer to the nodeint32 tagName_id the id of the tagName (or maybe a pointer)

Thus, it seems to me that three pointers should be adequate to
represent node handles (the extra bit for the boolean can be snuck in a
number of different ways).

Prototype

I have a small bit of working code implementing some of these
ideas. This code was written strictly for the purposes of exploring
the XML DOM implementation. This code uses a simple node-based
representation, but attempts to use interfaces that should allow for alternate representations.

Custom representations

One topic that's come up (thanks Havoc!) is the possibility of
custom representations for parts of the tree. For example, if the data
structure is an array of doubles, it makes sense to implement it as an
array of doubles. It may be possible for people to write their
own custom implementations of the DOM interface that internally use
much more compact, streamlined representations of the data. Further,
applications may use their own specialized access methods to get at
this data (i.e. directly indexing the array to get a double, rather
than using DOM traversal methods followed by an ASCII-to-double
conversion of the stored text).

Because of the need for node handles to be persistent across
mutation of the tree and other considerations, making these
implementations conform 100% to the DOM interface is probably not
realistic. The way I'd suggest handling this is to allow the interface
between the custom interface and the main DOM engine to implement a
subset of the DOM interface. When a DOM operation occurs that is not
supported by the custom implementation, the DOM engine will transfer
the data to its own generic implementation (preserving the node
handles), and use the generic DOM operations from there on out.

It may be worth looking at "compromise" approaches to such custom
representations. For example, if the array of doubles additionally
stored a node id (a 50% increase in memory requirements), then the
same cache technique mentioned above could be used, and insertions and
deletions of the array would be possible even after node handles had
been given out to clients.

I'm not sure to what extent these custom representations will be
useful. However, it's probably worth not ruling them out by poor
choices of interface now.

Non-standard approaches

Much of the complexity and difficulty of the DOM approach is a
consequence of trying to be standards compliant. It may be worth
exploring a nonstandard approach, if for no other reason than to
decide that the benefits of the standards outweigh their costs. In
this section, I briefly outline what might be done, and the technical
advantages of these approaches over the standards.

First, XML can be quite significantly simplified. James Clark has
already defined a Canonical XML
specification that shows what
might be done. Canonical XML is probably a bit too extreme in its
quest for simplicity, but I do think it would be possible to eliminate
entities (other than the built-in entities needed to escape
metacharacters), parameter entities, default attribute values, dtd's,
notations, processing instructions, cdata, and encodings other than
UTF-8 without losing any significant functionality for applications. I
believe a full-fledged and high performance parser can be written for
this subset of XML in about 1000 lines of C. In fact, many "XML
parsers" (including python's xmllib)
implement just this subset or one very similar. There is no need to
make the syntax incompatible with standard XML, so interoperability
would still be preserved in most cases.

The key difficulty of the DOM, I think, is the requirement that
node handles be persistent across tree mutation. An approach allowing
much more lightweight DOM implementations would be to represent
handles statelessly, probably as a sequence of navigation instructions
from the root. Further, the standard DOM operations are probably too
fine-grained in some cases (there is no need to have separate nodes
for attributes and their values, imho).

Stateless node handles would allow for easy implementation of
custom representations such as arrays of doubles. I don't think there
are any serious techincal disadvantages. Yes, node handles would
potentially be a bit bigger (esp. for very deep trees), and yes, most
DOM query operations would be proportional to depth rather than being
implementable in constant time, but I think realistically most trees
are shallow, so these are not serious concerns.

Apps would need to be organized a little differently to deal with
stateless node handles. But I think this is a question of "different",
not necessarily better or worse. An app that truly required a stateful
node handle could fake it by using event listeners to keep track of
tree mutations, and maintain the needed state on the client side of
the CORBA interface.

One additional limitation of the DOM is that in a CORBA
implementation, it would tend to be quite latency limited for the
traversal of large trees. I.e. each latency roundtrip would only yield
one node of the tree. A workaround may be to use threads to issue
multiple parallel DOM requests, but even this workaround is clearly
not going to yield optimal performance.

Thus, a perfectly reasonable approach is to implement methods for
retrieving and updating using simple flattened representations of big
chunks of the tree at a time. It's worth noting that this could also
be implemented as an extension to the standards-compliant DOM, i.e. we
don't have to use a nonstandard base to get this advantage.

A decision to roll our own rather than use existing standards is
not to be taken lightly. It does seem to me that many design decisions
of the DOM were made in a totally different context (i.e. easy
scripting of Web pages) than the one we're trying to apply it in
(i.e. a framework for high performance applications written in C or
one of its variants), so the extra costs are also not to be taken
lightly. I don't think we'll truly be able to decide these issues
without a working prototype.