HTML 5

A vocabulary and associated APIs
for HTML and XHTML

8.4 Serializing HTML fragments

The following steps form the HTML fragment
serialization algorithm. The algorithm takes as input a DOM
Element or Document, referred to as the node, and either returns a string or raises an
exception.

This algorithm serializes the children of the node
being serialized, not the node itself.

Let s be a string, and initialise it to the empty
string.

For each child node of the node, in tree order, run the following steps:

Let current node be the child node being
processed.

Append the appropriate string from the following list to s:

If current node is an Element

Append a U+003C LESS-THAN SIGN (<)
character, followed by the element's tag name. (For nodes created by
the HTML parser, Document.createElement(), or Document.renameNode(), the tag name will be
lowercase.)

For each attribute that the element has, append a U+0020 SPACE
character, the attribute's name (which, for attributes set by the HTML parser or by Element.setAttributeNode() or Element.setAttribute(), will be lowercase), a U+003D
EQUALS SIGN (=) character, a U+0022 QUOTATION
MARK (") character, the attribute's
value, escaped
as described below in attribute mode, and a second U+0022
QUOTATION MARK (") character.

While the exact order of attributes is UA-defined, and may depend
on factors such as the order that the attributes were given in the
original markup, the sort order must be stable, such that
consecutive invocations of this algorithm serialize an element's
attributes in the same order.

Append a U+003E GREATER-THAN SIGN (>)
character.

If current node is an area, base,
basefont, bgsound, br, col,
embed, frame,
hr, img, input, link, meta, param, spacer, or
wbr element, then continue on to the next child node at
this point.

If current node is a pretextarea, or
listing element, append a U+000A LINE FEED (LF)
character.

Append the value of running the HTML
fragment serialization algorithm on the current
node element (thus recursing into this algorithm for that
element), followed by a U+003C LESS-THAN SIGN (<) character, a U+002F SOLIDUS (/) character, the element's tag name again, and
finally a U+003E GREATER-THAN SIGN (>)
character.

If current node is a Text or CDATASection node

If one of the ancestors of current node is a
style, script, xmp, iframe, noembed,
noframes, noscript, or plaintext
element, then append the value of current node's
data DOM attribute literally.

Append the literal string <? (U+003C LESS-THAN
SIGN, U+003F QUESTION MARK), followed by the value of current node's target DOM
attribute, followed by a single U+0020 SPACE character, followed by
the value of current node's data DOM attribute, followed by a single U+003E
GREATER-THAN SIGN character ('>').

Other node types (e.g. Attr) cannot occur as
children of elements. If, despite this, they somehow do occur, this
algorithm must raise an INVALID_STATE_ERR exception.

The result of the algorithm is the string s.

Escaping a string (for the purposes of the
algorithm above) consists of replacing any occurrences of the "&" character by the string "&amp;", any occurrences of the "<" character by the string "&lt;", any occurrences of the ">" character by the string "&gt;", any occurrences of the U+00A0 NO-BREAK SPACE
character by the string "&nbsp;", and, if the
algorithm was invoked in the attribute mode, any occurrences of the
""" character by the string "&quot;".

Entity reference nodes are assumed to be expanded by the user agent,
and are therefore not covered in the algorithm above.

It is possible that the output of this algorithm, if parsed
with an HTML parser, will not return the original
tree structure. For instance, if a textarea element to which
a Comment node has been appended is serialized and
the output is then reparsed, the comment will end up being displayed in
the text field. Similarly, if, as a result of DOM manipulation, an element
contains a comment that contains the literal string "-->", then when the result of serializing the element
is parsed, the comment will be truncated at that point and the rest of the
comment will be interpreted as markup. More examples would be making a
script element contain a text node
with the text string "</script>", or having a p element that contains a ul element (as the ul
element's start tag would imply the
end tag for the p).

8.5 Parsing HTML fragments

The following steps form the HTML fragment
parsing algorithm. The algorithm takes as input a DOM
Element, referred to as the context
element, which gives the context for the parser, as well as input, a string to parse, and returns a list of zero or
more nodes.

Parts marked fragment case in
algorithms in the parser section are parts that only occur if the parser
was created for the purposes of this algorithm. The algorithms have been
annotated with such markings for informational purposes only; such
markings have no normative weight. If it is possible for a condition
described as a fragment case to occur even when
the parser wasn't created for the purposes of handling this algorithm,
then that is an error in the specification.

The parser will reference the context
element as part of that algorithm.

Set the parser's form element
pointer to the nearest node to the context
element that is a form element (going straight up the
ancestor chain, and including the element itself, if it is a
form element), or, if there is no such form
element, to null.