8 The HTML syntax

This section only describes the rules for resources labeled with an HTML
MIME type. Rules for XML resources are discussed in the section below entitled "The
XHTML syntax".

8.1 Writing HTML documents

This section only applies to documents, authoring tools, and markup generators. In
particular, it does not apply to conformance checkers; conformance checkers must use the
requirements given in the next section ("parsing HTML documents").

Space characters before the root html element, and space characters at the start
of the html element and before the head element, will be dropped when
the document is parsed; space characters after the root html element will
be parsed as if they were at the end of the body element. Thus, space characters
around the root element do not round-trip.

It is suggested that newlines be inserted after the DOCTYPE, after any comments that are
before the root element, after the html element's start tag (if it is not omitted), and after any comments that are inside the
html element but before the head element.

Many strings in the HTML syntax (e.g. the names of elements and their attributes) are
case-insensitive, but only for uppercase ASCII letters and lowercase ASCII
letters. For convenience, in this section this is just referred to as
"case-insensitive".

8.1.1 The DOCTYPE

A DOCTYPE is a
required preamble.

DOCTYPEs are required for legacy reasons. When omitted, browsers tend to use a
different rendering mode that is incompatible with some specifications. Including the DOCTYPE in a
document ensures that the browser makes a best-effort attempt at following the relevant
specifications.

For the purposes of HTML generators that cannot output HTML markup with the short DOCTYPE
"<!DOCTYPE html>", a DOCTYPE legacy string may be inserted
into the DOCTYPE (in the position defined above). This string must consist of:

The contents of the element must be placed between
just after the start tag (which might be implied, in certain
cases) and just before the end tag (which again, might be
implied in certain cases). The exact allowed contents of each individual element depend on
the content model of that element, as described earlier in
this specification. Elements must not contain content that their content model disallows. In
addition to the restrictions placed on the contents by those content models, however, the five
types of elements have additional syntactic requirements.

Void elements can't have any contents (since there's no end tag, no content can be
put between the start tag and the end tag).

The innermost element, cdr:license, is actually in the SVG namespace, as
the "xmlns:cdr" attribute has no effect (unlike in XML). In fact, as the
comment in the fragment above says, the fragment is actually non-conforming. This is because the
SVG specification does not define any elements called "cdr:license" in the
SVG namespace.

Tags contain a tag name, giving the element's name. HTML
elements all have names that only use alphanumeric ASCII characters. In the HTML
syntax, tag names, even those for foreign elements, may be written with any mix of
lower- and uppercase letters that, when converted to all-lowercase, matches the element's tag
name; tag names are case-insensitive.

8.1.2.1 Start tags

Start tags must have the following format:

The first character of a start tag must be a "<" (U+003C) character.

The next few characters of a start tag must be the element's tag name.

If there are to be any attributes in the next step, there must first be one or more space characters.

Then, the start tag may have a number of attributes, the syntax for which is described below. Attributes must be
separated from each other by one or more space
characters.

8.1.2.3 Attributes

Attributes for an element are expressed inside the
element's start tag.

Attributes have a name and a value. Attribute names
must consist of one or more characters other than the space
characters, U+0000 NULL, U+0022 QUOTATION MARK ("), U+0027 APOSTROPHE ('), ">" (U+003E), "/" (U+002F), and "=" (U+003D) characters, the control
characters, and any characters that are not defined by Unicode. In the HTML syntax, attribute
names, even those for foreign elements, may be written with any mix of lower- and
uppercase letters that are an ASCII case-insensitive match for the attribute's
name.

In the following example, the disabled attribute is
given with the empty attribute syntax:

<input disabled>

If an attribute using the empty attribute syntax is to be followed by another attribute, then
there must be a space character separating the two.

Unquoted attribute value syntax

The attribute name, followed by zero or more space characters, followed by a single U+003D EQUALS SIGN
character, followed by zero or more space characters,
followed by the attribute value, which, in addition
to the requirements given above for attribute values, must not contain any literal space characters, any U+0022 QUOTATION MARK characters ("),
U+0027 APOSTROPHE characters ('), "=" (U+003D) characters, "<" (U+003C) characters, ">" (U+003E) characters, or U+0060 GRAVE ACCENT characters
(`), and must not be the empty string.

In the following example, the value attribute is given
with the unquoted attribute value syntax:

<input value=yes>

If an attribute using the unquoted attribute syntax is to be followed by another attribute or
by the optional "/" (U+002F) character allowed in step 6 of the start tag syntax above, then there must be a space
character separating the two.

Single-quoted attribute value syntax

The attribute name, followed by zero or more space characters, followed by a single U+003D EQUALS SIGN
character, followed by zero or more space characters,
followed by a single "'" (U+0027) character, followed by the attribute value, which, in addition to the requirements
given above for attribute values, must not contain any literal "'" (U+0027) characters,
and finally followed by a second single "'" (U+0027) character.

In the following example, the type attribute is given
with the single-quoted attribute value syntax:

<input type='checkbox'>

If an attribute using the single-quoted attribute syntax is to be followed by another
attribute, then there must be a space character separating the two.

Double-quoted attribute value syntax

The attribute name, followed by zero or more space characters, followed by a single U+003D EQUALS SIGN
character, followed by zero or more space characters,
followed by a single """ (U+0022) character, followed by the attribute value, which, in addition to the requirements
given above for attribute values, must not contain any literal U+0022 QUOTATION MARK characters
("), and finally followed by a second single """ (U+0022) character.

In the following example, the name attribute is given with
the double-quoted attribute value syntax:

<input name="be evil">

If an attribute using the double-quoted attribute syntax is to be followed by another
attribute, then there must be a space character separating the two.

There must never be two or more attributes on the same start tag whose names are an ASCII
case-insensitive match for each other.

When a foreign element has one of the namespaced
attributes given by the local name and namespace of the first and second cells of a row from the
following table, it must be written using the name given by the third cell from the same row.

Whether the attributes in the table above are conforming or not is defined by
other specifications (e.g. the SVG and MathML specifications); this section only describes the
syntax rules if the attributes are serialized using the HTML syntax.

8.1.2.4 Optional tags

Certain tags can be omitted.

Omitting an element's start tag in the
situations described below does not mean the element is not present; it is implied, but it is
still there. For example, an HTML document always has a root html element, even if
the string <html> doesn't appear anywhere in the markup.

An rb element's end tag may be omitted if the
rb element is immediately followed by an rb, rt,
rtc or rp element, or if there is no more content in the parent
element.

An rt element's end tag may be omitted if the
rt element is immediately followed by an rb, rt, rtc, or
rp element, or if there is no more content in the parent element.

An rtc element's end tag may be omitted if
the rtc element is immediately followed by an rb,
rtc or rp element, or if there is no more content in the parent
element.

An rp element's end tag may be omitted if the
rp element is immediately followed by an rb, rt,
rtc or rp element, or if there is no more content in the parent
element.

An optgroup element's end tag may be omitted
if the optgroup element is
immediately followed by another optgroup element, or if there is no more content in
the parent element.

An option element's end tag may be omitted if
the option element is immediately followed by another option element, or
if it is immediately followed by an optgroup element, or if there is no more content
in the parent element.

A colgroup element's start tag may be
omitted if the first thing inside the colgroup element is a col element,
and if the element is not immediately preceded by another colgroup element whose
end tag has been omitted. (It can't be omitted if the element
is empty.)

A tbody element's start tag may be omitted
if the first thing inside the tbody element is a tr element, and if the
element is not immediately preceded by a tbody, thead, or
tfoot element whose end tag has been omitted. (It
can't be omitted if the element is empty.)

A tbody element's end tag may be omitted if
the tbody element is immediately followed by a tbody or
tfoot element, or if there is no more content in the parent element.

A tfoot element's end tag may be omitted if
the tfoot element is immediately followed by a tbody element, or if
there is no more content in the parent element.

A tr element's end tag may be omitted if the
tr element is immediately followed by another tr element, or if there is
no more content in the parent element.

A td element's end tag may be omitted if the
td element is immediately followed by a td or th element,
or if there is no more content in the parent element.

A th element's end tag may be omitted if the
th element is immediately followed by a td or th element,
or if there is no more content in the parent element.

8.1.2.5 Restrictions on content models

For historical reasons, certain elements have extra restrictions beyond even the restrictions
given by their content model.

A table element must not contain tr elements, even though these
elements are technically allowed inside table elements according to the content
models described in this specification. (If a tr element is put inside a
table in the markup, it will in fact imply a tbody start tag before
it.)

A single newline may be placed immediately after the start tag of pre and textarea elements.
If the element's contents are intended to start with a newline,
two consecutive newlines thus need to be included by the author.

8.1.2.6 Restrictions on the contents of raw text and escapable raw text elements

The text in raw text and escapable raw text
elements must not contain any occurrences of the string "</"
(U+003C LESS-THAN SIGN, U+002F SOLIDUS) followed by characters that case-insensitively match the
tag name of the element followed by one of "tab" (U+0009), "LF" (U+000A), "FF" (U+000C), "CR" (U+000D), U+0020 SPACE, ">" (U+003E), or "/" (U+002F).

8.1.3 Text

Text is allowed inside elements, attribute values, and comments.
Extra constraints are placed on what is and what is not allowed in text based on where the text is
to be put, as described in the other sections.

8.1.3.1 Newlines

Newlines in HTML may be represented either as "CR" (U+000D) characters, "LF" (U+000A) characters, or pairs of "CR" (U+000D), "LF" (U+000A) characters in that order.

Where character references are allowed, a character
reference of a "LF" (U+000A) character (but not a "CR" (U+000D) character)
also represents a newline.

8.1.4 Character references

In certain cases described in other sections, text may be
mixed with character references. These can be used to escape
characters that couldn't otherwise legally be included in text.

Character references must start with a U+0026 AMPERSAND character (&). Following this,
there are three possible kinds of character references:

Named character references

The ampersand must be followed by one of the names given in the named character
references section, using the same case. The name must be one that is
terminated by a ";" (U+003B) character.

Decimal numeric character reference

The ampersand must be followed by a "#" (U+0023) character, followed by one or more
ASCII digits, representing a base-ten integer that corresponds to a Unicode code
point that is allowed according to the definition below. The digits must then be followed by a
";" (U+003B) character.

Hexadecimal numeric character reference

The ampersand must be followed by a "#" (U+0023) character, which must be followed
by either a "x" (U+0078) character or a "X" (U+0058) character, which must then be followed by one or more ASCII hex digits,
representing a base-sixteen integer that corresponds to a Unicode code point that is allowed
according to the definition below. The digits must then be followed by a ";" (U+003B) character.

The numeric character reference forms described above are allowed to reference any Unicode code
point other than U+0000, U+000D, permanently undefined Unicode characters (noncharacters),
surrogates (U+D800–U+DFFF), and control characters other than space characters.

8.1.5 CDATA sections

CDATA sections must consist of the following components, in
this order:

The string "<![CDATA[".

Optionally, text, with the additional restriction that the
text must not contain the string "]]>".

The string "]]>".

CDATA sections can only be used in foreign content (MathML or SVG). In this example, a CDATA
section is used to escape the contents of an ms element:

<p>You can add a string to a number, but this stringifies the number:</p>
<math>
<ms><![CDATA[x<y]]></ms>
<mo>+</mo>
<mn>3</mn>
<mo>=</mo>
<ms><![CDATA[x<y3]]></ms>
</math>

8.1.6 Comments

Comments must start with the four character sequence U+003C
LESS-THAN SIGN, U+0021 EXCLAMATION MARK, U+002D HYPHEN-MINUS, U+002D HYPHEN-MINUS (<!--). Following this sequence, the comment may have text, with the additional restriction that the text must not start with
a single ">" (U+003E) character, nor start with a U+002D HYPHEN-MINUS character
(-) followed by a ">" (U+003E) character, nor contain two consecutive U+002D
HYPHEN-MINUS characters (--), nor end with a U+002D HYPHEN-MINUS character
(-). Finally, the comment must be ended by the three character sequence U+002D HYPHEN-MINUS,
U+002D HYPHEN-MINUS, U+003E GREATER-THAN SIGN (-->).

8.2 Parsing HTML documents

This section only applies to user agents, data mining tools, and conformance
checkers.

The rules for parsing XML documents into DOM trees are covered by the next
section, entitled "The XHTML syntax".

User agents must use the parsing rules described in this section to generate the DOM trees from
text/html resources. Together, these rules define what is referred to as the
HTML parser.

While the HTML syntax described in this specification bears a close resemblance to SGML and
XML, it is a separate language with its own parsing rules.

Some earlier versions of HTML (in particular from HTML2 to HTML4) were based on SGML and used
SGML parsing rules. However, few (if any) web browsers ever implemented true SGML parsing for
HTML documents; the only user agents to strictly handle HTML as an SGML application have
historically been validators. The resulting confusion — with validators claiming documents
to have one representation while widely deployed Web browsers interoperably implemented a
different representation — has wasted decades of productivity. This version of HTML thus
returns to a non-SGML basis.

Authors interested in using SGML tools in their authoring pipeline are encouraged to use XML
tools and the XML serialization of HTML.

This specification defines the parsing rules for HTML documents, whether they are syntactically
correct or not. Certain points in the parsing algorithm are said to be parse errors. The error handling for parse errors is well-defined (that's the
processing rules described throughout this specification), but user agents, while parsing an HTML
document, may abort the parser at the first parse
error that they encounter for which they do not wish to apply the rules described in this
specification.

Conformance checkers must report at least one parse error condition to the user if one or more
parse error conditions exist in the document and must not report parse error conditions if none
exist in the document. Conformance checkers may report more than one parse error condition if more
than one parse error condition exists in the document.

Parse errors are only errors with the syntax of HTML. In addition to
checking for parse errors, conformance checkers will also verify that the document obeys all the
other conformance requirements described in this specification.

As stated in the terminology
section, references to element types that do not
explicitly specify a namespace always refer to elements in the HTML namespace. For
example, if the spec talks about "a div element", then that is an element with
the local name "div", the namespace "http://www.w3.org/1999/xhtml", and the interface HTMLDivElement.
Where possible, references to such elements are hyperlinked to their definition.

There is only one set of states for the tokenizer stage and the tree
construction stage, but the tree construction stage is reentrant, meaning that while the tree
construction stage is handling one token, the tokenizer might be resumed, causing further tokens
to be emitted and processed before the first token's processing is complete.

In the following example, the tree construction stage will be called upon to handle a "p"
start tag token while handling the "script" end tag token:

...
<script>
document.write('<p>');
</script>
...

To handle these cases, parsers have a script nesting level, which must be initially
set to zero, and a parser pause flag, which must be initially set to false.

8.2.2 The input byte stream

The stream of Unicode code points that comprises the input to the tokenization stage will be
initially seen by the user agent as a stream of bytes (typically coming over the network or from
the local file system). The bytes encode the actual characters according to a particular
character encoding, which the user agent uses to decode the bytes into characters.

For XML documents, the algorithm user agents must use to determine the character
encoding is given by the XML specification. This section does not apply to XML documents. [XML]

Given a character encoding, the bytes in the input byte stream must be converted
to Unicode code points for the tokenizer's input stream, as described by the rules
for that encoding's decoder.

Bytes or sequences of bytes in the original byte stream that did not conform to
the encoding specification (e.g. invalid UTF-8 byte sequences in a UTF-8 input byte stream) are
errors that conformance checkers are expected to report.

Leading Byte Order Marks (BOMs) are not stripped by the decoder algorithms, they
are stripped by the algorithm below.

The decoder algorithms describe how to handle invalid input; for security
reasons, it is imperative that those rules be followed precisely. Differences in how invalid byte
sequences are handled can result in, amongst other problems, script injection vulnerabilities
("XSS").

When the HTML parser is decoding an input byte stream, it uses a character encoding and a confidence. The confidence is either tentative,
certain, or irrelevant. The encoding used, and whether the confidence in that
encoding is tentative or certain, is used
during the parsing to determine whether to change the encoding. If no encoding is
necessary, e.g. because the parser is operating on a Unicode stream and doesn't have to use a
character encoding at all, then the confidence is
irrelevant.

8.2.2.1 Parsing with a known character encoding

When the HTML parser is to operate on an input byte stream that has a known definite
encoding, then the character encoding is that encoding and the confidence is certain.

8.2.2.2 Determining the character encoding

In some cases, it might be impractical to unambiguously determine the encoding before parsing
the document. Because of this, this specification provides for a two-pass mechanism with an
optional pre-scan. Implementations are allowed, as described below, to apply a simplified parsing
algorithm to whatever bytes they have available before beginning to parse the document. Then, the
real parser is started, using a tentative encoding derived from this pre-parse and other
out-of-band metadata. If, while the document is being loaded, the user agent discovers a character
encoding declaration that conflicts with this information, then the parser can get reinvoked to
perform a parse of the document with the real encoding.

User agents must use the following algorithm, called the encoding
sniffing algorithm, to determine the character encoding to use when decoding a document in
the first pass. This algorithm takes as input any out-of-band metadata available to the user agent
(e.g. the Content-Type metadata of the document) and all the
bytes available so far, and returns a character encoding and a confidence that is either tentative or
certain.

If the user has explicitly instructed the user agent to override the document's character
encoding with a specific encoding, optionally return that encoding with the confidencecertain and abort these steps.

Typically, user agents remember such user requests across sessions, and in some
cases apply them to documents in iframes as well.

The user agent may wait for more bytes of the resource to be available, either in this step
or at any later step in this algorithm. For instance, a user agent might wait 500ms or 1024
bytes, whichever came first. In general preparsing the source to find the encoding improves
performance, as it reduces the need to throw away the data structures used when parsing upon
finding the encoding information. However, if the user agent delays too long to obtain data to
determine the encoding, then the cost of the delay could outweigh any performance improvements
from the preparse.

The authoring conformance requirements for character encoding declarations limit
them to only appearing in the first 1024 bytes. User agents are
therefore encouraged to use the prescan algorithm below (as invoked by these steps) on the first
1024 bytes, but not to stall beyond that.

For each of the rows in the following table, starting with the first one and going down, if
there are as many or more bytes available than the number of bytes in the first column, and the
first bytes of the file match the bytes given in the first column, then return the encoding
given in the cell in the second column of that row, with the confidencecertain, and abort these steps:

Bytes in Hexadecimal

Encoding

FE FF

Big-endian UTF-16

FF FE

Little-endian UTF-16

EF BB BF

UTF-8

This step looks for Unicode Byte Order Marks (BOMs).

That this step happens before the next one honoring the HTTP
Content-Type header is a willful violation of the HTTP specification,
motivated by a desire to be maximally compatible with legacy content. [HTTP]

If the transport layer specifies a character encoding, and it is supported, return that
encoding with the confidencecertain, and
abort these steps.

Optionally prescan the byte
stream to determine its encoding. The end condition is that the user
agent decides that scanning further bytes would not be efficient. User agents are encouraged to
only prescan the first 1024 bytes. User agents may decide that scanning any bytes is
not efficient, in which case these substeps are entirely skipped.

The aforementioned algorithm either aborts unsuccessfully or returns a character encoding. If
it returns a character encoding, then this algorithm must be aborted, returning the same
encoding, with confidencetentative.

Otherwise, if the user agent has information on the likely encoding for this page, e.g.
based on the encoding of the page when it was last visited, then return that encoding, with the
confidencetentative, and abort these
steps.

The user agent may attempt to autodetect the character encoding from applying frequency
analysis or other algorithms to the data stream. Such algorithms may use information about the
resource other than the resource's contents, including the address of the resource. If
autodetection succeeds in determining a character encoding, and that encoding is a supported
encoding, then return that encoding, with the confidencetentative, and abort these steps.
[UNIVCHARDET]

The UTF-8 encoding has a highly detectable bit pattern. Documents that contain
bytes with values greater than 0x7F which match the UTF-8 pattern are very likely to be UTF-8,
while documents with byte sequences that do not match it are very likely not. User-agents are
therefore encouraged to search for this common encoding. [PPUTF8][UTF8DET]

Otherwise, return an implementation-defined or user-specified default character encoding,
with the confidencetentative.

In controlled environments or in environments where the encoding of documents can be
prescribed (for example, for user agents intended for dedicated use in new networks), the
comprehensive UTF-8 encoding is suggested.

In other environments, the default encoding is typically dependent on the user's locale (an
approximation of the languages, and thus often encodings, of the pages that the user is likely
to frequent). The following table gives suggested defaults based on the user's locale, for
compatibility with legacy content. Locales are identified by BCP 47 language tags. [BCP47][ENCODING]

Locale language

Suggested default encoding

ar

Arabic

windows-1256

ba

Bashkir

windows-1251

be

Belarusian

windows-1251

bg

Bulgarian

windows-1251

cs

Czech

windows-1250

el

Greek

ISO-8859-7

et

Estonian

windows-1257

fa

Persian

windows-1256

he

Hebrew

windows-1255

hr

Croatian

windows-1250

hu

Hungarian

ISO-8859-2

ja

Japanese

Shift_JIS

kk

Kazakh

windows-1251

ko

Korean

euc-kr

ku

Kurdish

windows-1254

ky

Kyrgyz

windows-1251

lt

Lithuanian

windows-1257

lv

Latvian

windows-1257

mk

Macedonian

windows-1251

pl

Polish

ISO-8859-2

ru

Russian

windows-1251

sah

Yakut

windows-1251

sk

Slovak

windows-1250

sl

Slovenian

ISO-8859-2

sr

Serbian

windows-1251

tg

Tajik

windows-1251

th

Thai

windows-874

tr

Turkish

windows-1254

tt

Tatar

windows-1251

uk

Ukrainian

windows-1251

vi

Vietnamese

windows-1258

zh-CN

Chinese (People's Republic of China)

GB18030

zh-TW

Chinese (Taiwan)

Big5

All other locales

windows-1252

The contents of this table are derived from the intersection of
Windows, Chrome, and Firefox defaults.

The document's character encoding must immediately be set to the value returned
from this algorithm, at the same time as the user agent uses the returned value to select the
decoder to use for the input byte stream.

When an algorithm requires a user agent to prescan a byte stream to determine its
encoding, given some defined end condition, then it must run the
following steps. These steps either abort unsuccessfully or return a character encoding. If at any
point during these steps (including during instances of the get an attribute algorithm invoked by this
one) the user agent either runs out of bytes (meaning the position pointer
created in the first step below goes beyond the end of the byte stream obtained so far) or reaches
its end condition, then abort the prescan a byte stream to determine its
encoding algorithm unsuccessfully.

Let position be a pointer to a byte in the input byte stream, initially
pointing at the first byte.

Loop: If position points to:

A sequence of bytes starting with: 0x3C 0x21 0x2D 0x2D (ASCII '<!--')

Advance the position pointer so that it points at the first 0x3E byte
which is preceded by two 0x2D bytes (i.e. at the end of an ASCII '-->' sequence) and comes
after the 0x3C byte that was found. (The two 0x2D bytes can be the same as the those in the
'<!--' sequence.)

A sequence of bytes starting with: 0x3C, 0x4D or 0x6D, 0x45 or 0x65, 0x54 or 0x74, 0x41 or 0x61, and one of 0x09, 0x0A, 0x0C, 0x0D, 0x20, 0x2F (case-insensitive ASCII '<meta' followed by a space or slash)

Advance the position pointer so that it points at the next 0x09,
0x0A, 0x0C, 0x0D, 0x20, or 0x2F byte (the one in sequence of characters matched
above).

Let attribute list be an empty list of strings.

Let got pragma be false.

Let need pragma be null.

Let charset be the null value (which, for the purposes of this
algorithm, is distinct from an unrecognised encoding or the empty string).

Attributes: Get an
attribute and its value. If no attribute was sniffed, then jump to the
processing step below.

If the attribute's name is already in attribute list, then return
to the step labeled attributes.

Add the attribute's name to attribute list.

Run the appropriate step from the following list, if one applies:

If the attribute's name is "http-equiv"

If the attribute's value is "content-type", then set got pragma to true.

Abort the get an attribute
algorithm. The attribute's name is the value of attribute name, its value
is the empty string.

If it is in the range 0x41 (ASCII A) to 0x5A (ASCII Z)

Append the Unicode character with code point b+0x20 to attribute name (where b
is the value of the byte at position). (This converts the input to
lowercase.)

Anything else

Append the Unicode character with the same code point as the value of the byte at position to attribute name. (It doesn't actually matter how
bytes outside the ASCII range are handled here, since only ASCII characters can contribute to
the detection of a character encoding.)

Advance position to the next byte and return to the previous
step.

Spaces: If the byte at position is one of 0x09 (ASCII TAB),
0x0A (ASCII LF), 0x0C (ASCII FF), 0x0D (ASCII CR), or 0x20 (ASCII space) then advance position to the next byte, then, repeat this step.

If the byte at position is not 0x3D (ASCII =), abort the
get an attribute algorithm. The
attribute's name is the value of attribute name, its value is the empty
string.

Advance position past the 0x3D (ASCII =) byte.

Value: If the byte at position is one of 0x09 (ASCII TAB), 0x0A
(ASCII LF), 0x0C (ASCII FF), 0x0D (ASCII CR), or 0x20 (ASCII space) then advance position to the next byte, then, repeat this step.

Process the byte at position as follows:

If it is 0x22 (ASCII ") or 0x27 (ASCII ')

Let b be the value of the byte at position.

Quote loop: Advance position to the next byte.

If the value of the byte at position is the value of b, then advance position to the next byte and abort the
"get an attribute" algorithm. The attribute's name is the value of attribute
name, and its value is the value of attribute value.

Otherwise, if the value of the byte at position is in the range 0x41
(ASCII A) to 0x5A (ASCII Z), then append a Unicode character to attribute
value whose code point is 0x20 more than the value of the byte at position.

Otherwise, append a Unicode character to attribute value whose code
point is the same as the value of the byte at position.

Return to the step above labeled quote loop.

If it is 0x3E (ASCII >)

Abort the get an attribute
algorithm. The attribute's name is the value of attribute name, its value
is the empty string.

If it is in the range 0x41 (ASCII A) to 0x5A (ASCII Z)

Append the Unicode character with code point b+0x20 to attribute value (where b is the value of the byte at position). Advance position to the next byte.

Anything else

Append the Unicode character with the same code point as the value of the byte at position to attribute value. Advance position to the next byte.

Abort the get an attribute
algorithm. The attribute's name is the value of attribute name and its
value is the value of attribute value.

If it is in the range 0x41 (ASCII A) to 0x5A (ASCII Z)

Append the Unicode character with code point b+0x20 to attribute value (where b is the value of the byte at position).

Anything else

Append the Unicode character with the same code point as the value of the byte at position to attribute value.

Advance position to the next byte and return to the previous
step.

For the sake of interoperability, user agents should not use a pre-scan algorithm that returns
different results than the one described above. (But, if you do, please at least let us know, so
that we can improve this algorithm and benefit everyone...)

8.2.2.3 Character encodings

User agents must support the encodings defined in the Encoding standard. User agents
should not support other encodings.

Support for encodings based on EBCDIC is especially discouraged. This encoding is rarely used
for publicly-facing Web content. Support for UTF-32 is also especially discouraged. This encoding
is rarely used, and frequently implemented incorrectly.

This specification does not make any attempt to support EBCDIC-based encodings and
UTF-32 in its algorithms; support and use of these encodings can thus lead to unexpected behavior
in implementations of this specification.

8.2.2.4 Changing the encoding while parsing

When the parser requires the user agent to change the encoding, it must run the
following steps. This might happen if the encoding sniffing algorithm described above
failed to find a character encoding, or if it found a character encoding that was not the actual
encoding of the file.

If the encoding that is already being used to interpret the input stream is a UTF-16
encoding, then set the confidence to
certain and abort these steps. The new encoding is ignored; if it was anything but the
same encoding, then it would be clearly incorrect.

If the new encoding is identical or equivalent to the encoding that is already being used to
interpret the input stream, then set the confidence to certain and abort these steps.
This happens when the encoding information found in the file matches what the encoding
sniffing algorithm determined to be the encoding, and in the second pass through the
parser if the first pass found that the encoding sniffing algorithm described in the earlier
section failed to find the right encoding.

If all the bytes up to the last byte converted by the current decoder have the same Unicode
interpretations in both the current encoding and the new encoding, and if the user agent supports
changing the converter on the fly, then the user agent may change to the new converter for the
encoding on the fly. Set the document's character encoding and the encoding used to
convert the input stream to the new encoding, set the confidence to certain, and abort these
steps.

Otherwise, navigate to the document again, with
replacement enabled, and using the same source browsing context, but
this time skip the encoding sniffing algorithm and instead just set the encoding to
the new encoding and the confidence to
certain. Whenever possible, this should be done without actually contacting the network
layer (the bytes should be re-parsed from memory), even if, e.g., the document is marked as not
being cacheable. If this is not possible and contacting the network layer would involve repeating
a request that uses a method other than HTTP GET (or
equivalent for non-HTTP URLs), then instead set the confidence to certain and ignore the new
encoding. The resource will be misinterpreted. User agents may notify the user of the situation,
to aid in application development.

8.2.2.5 Preprocessing the input stream

The input stream consists of the characters pushed into it as the input byte
stream is decoded or from the various APIs that directly manipulate the input stream.

One leading U+FEFF BYTE ORDER MARK character must be ignored if any are present in the
input stream.

The requirement to strip a U+FEFF BYTE ORDER MARK character regardless of whether
that character was used to determine the byte order is a willful violation of
Unicode, motivated by a desire to increase the resilience of user agents in the face of naïve
transcoders.

"CR" (U+000D) characters and "LF" (U+000A) characters are treated
specially. All CR characters must be converted to LF characters, and any LF characters that
immediately follow a CR character must be ignored. Thus, newlines in HTML DOMs are represented by
LF characters, and there are never any CR characters in the input to the tokenization
stage.

The next input character is the first character in the input stream
that has not yet been consumed or explicitly ignored by the requirements in this
section. Initially, the next input character is the first character in the input. The
current input character is the last character to have been consumed.

The insertion point is the position (just before a character or just before the end
of the input stream) where content inserted using document.write() is actually inserted. The insertion point is
relative to the position of the character immediately after it, it is not an absolute offset into
the input stream. Initially, the insertion point is undefined.

The "EOF" character in the tables below is a conceptual character representing the end of the
input stream. If the parser is a script-created parser, then the end of
the input stream is reached when an explicit "EOF" character (inserted by
the document.close() method) is consumed. Otherwise, the
"EOF" character is not a real character in the stream, but rather the lack of any further
characters.

The handling of U+0000 NULL characters varies based on where the characters are
found. In general, they are ignored except where doing so could plausibly introduce an attack
vector. This handling is, by necessity, spread across both the tokenization stage and the tree
construction stage.

8.2.3 Parse state

8.2.3.1 The insertion mode

The insertion mode is a state variable that controls the primary operation of the
tree construction stage.

Several of these modes, namely "in head", "in body", "in
table", and "in select", are special, in
that the other modes defer to them at various times. When the algorithm below says that the user
agent is to do something "using the rules for the m insertion
mode", where m is one of these modes, the user agent must use the rules
described under the minsertion mode's section, but must leave
the insertion mode unchanged unless the rules in m themselves
switch the insertion mode to a new value.

When the insertion mode is switched to "text" or
"in table text", the original insertion
mode is also set. This is the insertion mode to which the tree construction stage will
return.

Similarly, to parse nested template elements, a stack of template insertion
modes is used. It is initially empty. The current template insertion mode is the
insertion mode that was most recently added to the stack of template insertion modes.
The algorithms in the sections below will push insertion modes onto this stack, meaning
that the specified insertion mode is to be added to the stack, and pop insertion modes from
the stack, which means that the most recently added insertion mode must be removed from the
stack.

When the steps below require the UA to reset the insertion mode appropriately, it
means the UA must follow these steps:

8.2.3.2 The stack of open elements

Initially, the stack of open elements is empty. The stack grows downwards; the
topmost node on the stack is the first one added to the stack, and the bottommost node of the
stack is the most recently added node in the stack (notwithstanding when the stack is manipulated
in a random access fashion as part of the handling for misnested
tags).

The stack of open elements is said to have an element target node in a specific scope consisting of a
list of element types list when the following algorithm terminates in a match
state:

Initialize node to be the current node (the bottommost
node of the stack).

If node is the target node, terminate in a match state.

Otherwise, if node is one of the element types in list, terminate in a failure state.

Otherwise, set node to the previous entry in the stack of open
elements and return to step 2. (This will never fail, since the loop will always terminate
in the previous step if the top of the stack — an html element — is
reached.)

Nothing happens if at any time any of the elements in the stack of open elements
are moved to a new location in, or removed from, the Document tree. In particular,
the stack is not changed in this situation. This can cause, amongst other strange effects, content
to be appended to nodes that are no longer in the DOM.

8.2.3.3 The list of active formatting elements

Initially, the list of active formatting elements is empty. It is used to handle
mis-nested formatting element tags.

The list contains elements in the formatting category, and scope markers. The
scope markers are inserted when entering applet elements, buttons,
object elements, marquees, table cells, and table captions, and are used to prevent
formatting from "leaking" intoapplet elements, buttons, object
elements, marquees, and tables.

The scope markers are unrelated to the concept of an element being in scope.

In addition, each element in the list of active formatting elements is associated
with the token for which it was created, so that further elements can be created for that token if
necessary.

When the steps below require the UA to push onto the list of active formatting
elements an element element, the UA must perform the following
steps:

If there are already three elements in the list of active formatting elements
after the last list marker, if any, or anywhere in the list if there are no list markers, that
have the same tag name, namespace, and attributes as element, then remove the
earliest such element from the list of active formatting elements. For these
purposes, the attributes must be compared as they were when the elements were created by the
parser; two elements have the same attributes if all their parsed attributes can be paired such
that the two attributes in each pair have identical names, namespaces, and values (the order of
the attributes does not matter).

This is the Noah's Ark clause. But with three per family instead of two.

This has the effect of reopening all the formatting elements that were opened in the current
body, cell, or caption (whichever is youngest) that haven't been explicitly closed.

The way this specification is written, the list of active formatting
elements always consists of elements in chronological order with the least recently added
element first and the most recently added element last (except for while steps 8 to 11 of the
above algorithm are being executed, of course).

When the steps below require the UA to clear the list of active formatting elements up to
the last marker, the UA must perform the following steps:

If entry was a marker, then stop the algorithm at this point. The list
has been cleared up to the last marker.

Go to step 1.

8.2.3.4 The element pointers

Initially, the head element pointer and the form element pointer are both null.

Once a head element has been parsed (whether implicitly or explicitly) the
head element pointer gets set to point to this node.

The form element pointer points to the last
form element that was opened and whose end tag has not yet been seen. It is used to
make form controls associate with forms in the face of dramatically bad markup, for historical
reasons. It is ignored inside template elements.

8.2.3.5 Other parsing state flags

The scripting flag is set to "enabled" if scripting
was enabled for the Document with which the parser is associated when the
parser was created, and "disabled" otherwise.

The frameset-ok flag is set to "ok" when the parser is created. It is set to "not
ok" after certain tokens are seen.

8.2.4 Tokenization

Implementations must act as if they used the following state machine to tokenize HTML. The
state machine must start in the data state. Most states consume a single character,
which may have various side-effects, and either switches the state machine to a new state to
reconsume the same character, or switches it to a new state to consume the next character,
or stays in the same state to consume the next character. Some states have more complicated
behavior and can consume several characters before switching to another state. In some cases, the
tokenizer state is also changed by the tree construction stage.

The output of the tokenization step is a series of zero or more of the following tokens:
DOCTYPE, start tag, end tag, comment, character, end-of-file. DOCTYPE tokens have a name, a public
identifier, a system identifier, and a force-quirks flag. When a DOCTYPE token is created,
its name, public identifier, and system identifier must be marked as missing (which is a distinct
state from the empty string), and the force-quirks flag must be set to off (its
other state is on). Start and end tag tokens have a tag name, a self-closing flag,
and a list of attributes, each of which has a name and a value. When a start or end tag token is
created, its self-closing flag must be unset (its other state is that it be set), and its
attributes list must be empty. Comment and character tokens have data.

When a token is emitted, it must immediately be handled by the tree construction
stage. The tree construction stage can affect the state of the tokenization stage, and can insert
additional characters into the stream. (For example, the script element can result in
scripts executing and using the dynamic markup insertion APIs to insert characters
into the stream being tokenized.)

Creating a token and emitting it are distinct actions. It is possible for a token
to be created but implicitly abandoned (never emitted), e.g. if the file ends unexpectedly while
processing the characters that are being parsed into a start tag token.

When a start tag token is emitted with its self-closing flag set, if the flag is not
acknowledged when it is processed by the tree
construction stage, that is a parse error.

When an end tag token is emitted with attributes, that is a parse error.

When an end tag token is emitted with its self-closing flag set, that is a parse
error.

An appropriate end tag token is an end tag token whose tag name matches the tag name
of the last start tag to have been emitted from this tokenizer, if any. If no start tag has been
emitted from this tokenizer, then no end tag token is appropriate.

Before each step of the tokenizer, the user agent must first check the parser pause
flag. If it is true, then the tokenizer must abort the processing of any nested invocations
of the tokenizer, yielding control back to the caller.

The tokenizer state machine consists of the states defined in the following subsections.

8.2.4.8 Tag open state

Create a new start tag token, set its tag name to the lowercase version of the current
input character (add 0x0020 to the character's code point), then switch to the tag
name state. (Don't emit the token yet; further details will be filled in before it is
emitted.)

8.2.4.9 End tag open state

Create a new end tag token, set its tag name to the lowercase version of the current
input character (add 0x0020 to the character's code point), then switch to the tag
name state. (Don't emit the token yet; further details will be filled in before it is
emitted.)

Switch to the RCDATA state. Emit a U+003C LESS-THAN SIGN character token, a
U+002F SOLIDUS character token, and a character token for each of the characters in the
temporary buffer (in the order they were added to the buffer). Reconsume the
current input character.

Switch to the RAWTEXT state. Emit a U+003C LESS-THAN SIGN character token, a
U+002F SOLIDUS character token, and a character token for each of the characters in the
temporary buffer (in the order they were added to the buffer). Reconsume the
current input character.

Start a new attribute in the current tag token. Set that attribute's name to the lowercase
version of the current input character (add 0x0020 to the character's code point),
and its value to the empty string. Switch to the attribute name state.

U+0000 NULL

Parse error. Start a new attribute in the current tag token. Set that
attribute's name to a U+FFFD REPLACEMENT CHARACTER character, and its value to the empty string.
Switch to the attribute name state.

When the user agent leaves the attribute name state (and before emitting the tag token, if
appropriate), the complete attribute's name must be compared to the other attributes on the same
token; if there is already an attribute on the token with the exact same name, then this is a
parse error and the new attribute must be removed from the token.

If an attribute is so removed from a token, it, along with the value that gets
associated with it, if any, are never subsequently used by the parser, and are therefore
effectively discarded. Removing the attribute in this way does not change its status as the
"current attribute" for the purposes of the tokenizer, however.

Start a new attribute in the current tag token. Set that attribute's name to the lowercase
version of the current input character (add 0x0020 to the character's code point),
and its value to the empty string. Switch to the attribute name state.

U+0000 NULL

Parse error. Start a new attribute in the current tag token. Set that
attribute's name to a U+FFFD REPLACEMENT CHARACTER character, and its value to the empty string.
Switch to the attribute name state.

8.2.4.44 Bogus comment state

Consume every character up to and including the first ">" (U+003E) character
or the end of the file (EOF), whichever comes first. Emit a comment token whose data is the
concatenation of all the characters starting from and including the character that caused the
state machine to switch into the bogus comment state, up to and including the character
immediately before the last consumed character (i.e. up to the character just before the U+003E or
EOF character), but with any U+0000 NULL characters replaced by U+FFFD REPLACEMENT CHARACTER
characters. (If the comment was started by the end of the file (EOF), the token is empty.
Similarly, the token is empty if it was generated by the string "<!>".)

Otherwise, if there is an adjusted current node and it is not an element in the
HTML namespace and the next seven characters are a case-sensitive match
for the string "[CDATA[" (the five uppercase letters "CDATA" with a U+005B LEFT SQUARE BRACKET
character before and after), then consume those characters and switch to the CDATA section
state.

Otherwise, this is a parse error. Switch to the bogus comment state.
The next character that is consumed, if any, is the first character that will be in the
comment.

8.2.4.68 CDATA section state

Consume every character up to the next occurrence of the three character sequence U+005D RIGHT
SQUARE BRACKET U+005D RIGHT SQUARE BRACKET U+003E GREATER-THAN SIGN (]]>),
or the end of the file (EOF), whichever comes first. Emit a series of character tokens consisting
of all the characters consumed except the matching three character sequence at the end (if one was
found before the end of the file).

If the end of the file was reached, reconsume the EOF character.

8.2.4.69 Tokenizing character references

This section defines how to consume a character reference, optionally with an
additional allowed character, which, if specified where the algorithm is invoked, adds
a character to the list of characters that cause there to not be a character reference.

If no characters match the range, then don't consume any characters (and unconsume the U+0023
NUMBER SIGN character and, if appropriate, the X character). This is a parse error;
nothing is returned.

Otherwise, if the next character is a U+003B SEMICOLON, consume that too. If it isn't, there
is a parse error.

If one or more characters match the range, then take them all and interpret the string of
characters as a number (either hexadecimal or decimal as appropriate).

If that number is one of the numbers in the first column of the following table, then this is
a parse error. Find the row with that number in the first column, and return a
character token for the Unicode character given in the second column of that row.

Number

Unicode character

0x00

U+FFFD

REPLACEMENT CHARACTER

0x80

U+20AC

EURO SIGN (€)

0x82

U+201A

SINGLE LOW-9 QUOTATION MARK (‚)

0x83

U+0192

LATIN SMALL LETTER F WITH HOOK (ƒ)

0x84

U+201E

DOUBLE LOW-9 QUOTATION MARK („)

0x85

U+2026

HORIZONTAL ELLIPSIS (…)

0x86

U+2020

DAGGER (†)

0x87

U+2021

DOUBLE DAGGER (‡)

0x88

U+02C6

MODIFIER LETTER CIRCUMFLEX ACCENT (ˆ)

0x89

U+2030

PER MILLE SIGN (‰)

0x8A

U+0160

LATIN CAPITAL LETTER S WITH CARON (Š)

0x8B

U+2039

SINGLE LEFT-POINTING ANGLE QUOTATION MARK (‹)

0x8C

U+0152

LATIN CAPITAL LIGATURE OE (Œ)

0x8E

U+017D

LATIN CAPITAL LETTER Z WITH CARON (Ž)

0x91

U+2018

LEFT SINGLE QUOTATION MARK (‘)

0x92

U+2019

RIGHT SINGLE QUOTATION MARK (’)

0x93

U+201C

LEFT DOUBLE QUOTATION MARK (“)

0x94

U+201D

RIGHT DOUBLE QUOTATION MARK (”)

0x95

U+2022

BULLET (•)

0x96

U+2013

EN DASH (–)

0x97

U+2014

EM DASH (—)

0x98

U+02DC

SMALL TILDE (˜)

0x99

U+2122

TRADE MARK SIGN (™)

0x9A

U+0161

LATIN SMALL LETTER S WITH CARON (š)

0x9B

U+203A

SINGLE RIGHT-POINTING ANGLE QUOTATION MARK (›)

0x9C

U+0153

LATIN SMALL LIGATURE OE (œ)

0x9E

U+017E

LATIN SMALL LETTER Z WITH CARON (ž)

0x9F

U+0178

LATIN CAPITAL LETTER Y WITH DIAERESIS (Ÿ)

Otherwise, if the number is in the range 0xD800 to 0xDFFF or is greater
than 0x10FFFF, then this is a parse error. Return a U+FFFD REPLACEMENT CHARACTER
character token.

Otherwise, return a character token for the Unicode character whose code point is that
number.
Additionally, if the number is in the range 0x0001 to 0x0008, 0x000D to 0x001F, 0x007F to 0x009F, 0xFDD0 to 0xFDEF, or is
one of 0x000B, 0xFFFE, 0xFFFF, 0x1FFFE, 0x1FFFF, 0x2FFFE, 0x2FFFF, 0x3FFFE, 0x3FFFF, 0x4FFFE,
0x4FFFF, 0x5FFFE, 0x5FFFF, 0x6FFFE, 0x6FFFF, 0x7FFFE, 0x7FFFF, 0x8FFFE, 0x8FFFF, 0x9FFFE,
0x9FFFF, 0xAFFFE, 0xAFFFF, 0xBFFFE, 0xBFFFF, 0xCFFFE, 0xCFFFF, 0xDFFFE, 0xDFFFF, 0xEFFFE,
0xEFFFF, 0xFFFFE, 0xFFFFF, 0x10FFFE, or 0x10FFFF, then this is a parse error.

Anything else

Consume the maximum number of characters possible, with the consumed characters matching one
of the identifiers in the first column of the named character references table (in
a case-sensitive manner).

If no match can be made, then no characters are consumed, and nothing is returned. In this
case, if the characters after the U+0026 AMPERSAND character (&) consist of a sequence of
one or more alphanumeric ASCII characters followed by a U+003B SEMICOLON character
(;), then this is a parse error.

If the character reference is being consumed as part of an attribute, and the last character matched is not a ";" (U+003B) character, and the next character is either a "=" (U+003D) character or
an alphanumeric ASCII character, then, for
historical reasons, all the characters that were matched after the U+0026 AMPERSAND character
(&) must be unconsumed, and nothing is returned.
However, if this next character is in fact a "=" (U+003D) character, then this is a
parse error, because some legacy user agents will
misinterpret the markup in those cases.

Otherwise, a character reference is parsed. If the last character matched is not a ";" (U+003B) character, there is a parse error.

Return one or two character tokens for the character(s) corresponding to the character
reference name (as given by the second column of the named character references
table).

If the markup contains (not in an attribute) the string I'm &notit; I
tell you, the character reference is parsed as "not", as in, I'm ¬it;
I tell you (and this is a parse error). But if the markup was I'm
&notin; I tell you, the character reference would be parsed as "notin;", resulting
in I'm ∉ I tell you (and no parse error).

8.2.5 Tree construction

The input to the tree construction stage is a sequence of tokens from the
tokenization stage. The tree construction stage is associated with a DOM
Document object when a parser is created. The "output" of this stage consists of
dynamically modifying or extending that document's DOM tree.

This specification does not define when an interactive user agent has to render the
Document so that it is available to the user, or when it has to begin accepting user
input.

As each token is emitted from the tokenizer, the user agent must follow the appropriate steps
from the following list, known as the tree construction dispatcher:

Not all of the tag names mentioned below are conformant tag names in this
specification; many are included to handle legacy content. They still form part of the algorithm
that implementations are required to implement to claim conformance.

The algorithm described below places no limit on the depth of the DOM tree
generated, or on the length of tag names, attribute names, attribute values, Text
nodes, etc. While implementors are encouraged to avoid arbitrary limits, it is recognized that practical concerns will likely force user agents to impose nesting
depth constraints.

8.2.5.1 Creating and inserting nodes

While the parser is processing a token, it can enable or disable foster parenting. This affects the following algorithm.

The appropriate place for inserting a node, optionally using a particular
override target, is the position in an element returned by running the following steps:

If there was an override target specified, then let target be the
override target.

If there is a last template and either there is no last table, or there is one, but last template is lower
(more recently added) than last table in the stack of open
elements, then: let adjusted insertion location be inside last template's template contents, after its last child (if any),
and abort these substeps.

If there is no last table, then let adjusted insertion
location be inside the first element in the stack of open elements (the
html element), after its last child (if any), and abort these substeps.
(fragment case)

If last table has a parent element, then let adjusted insertion location be inside last table's parent
element, immediately before last table, and abort these
substeps.

These steps are involved in part because it's possible for elements, the
table element in this case in particular, to have been moved by a script around
in the DOM, or indeed removed from the DOM entirely, after the element was inserted by the
parser.

Otherwise

Let adjusted insertion location be inside target,
after its last child (if any).

When the steps below require the UA to create an
element for a token in a particular given namespace and with a
particular intended parent, the UA must run the following steps:

Create a node implementing the interface appropriate for the element type corresponding to
the tag name of the token in given namespace (as given in the specification
that defines that element, e.g. for an a element in the HTML
namespace, this specification defines it to be the HTMLAnchorElement
interface), with the tag name being the name of that element, with the node being in the given
namespace, and with the attributes on the node being those given in the given token.

When the steps below require the user agent to adjust MathML attributes for a token,
then, if the token has an attribute named definitionurl, change its name to
definitionURL (note the case difference).

When the steps below require the user agent to adjust SVG attributes for a token,
then, for each attribute on the token whose attribute name is one of the ones in the first column
of the following table, change the attribute's name to the name given in the corresponding cell in
the second column. (This fixes the case of SVG attributes that are not all lowercase.)

Attribute name on token

Attribute name on element

attributename

attributeName

attributetype

attributeType

basefrequency

baseFrequency

baseprofile

baseProfile

calcmode

calcMode

clippathunits

clipPathUnits

contentscripttype

contentScriptType

contentstyletype

contentStyleType

diffuseconstant

diffuseConstant

edgemode

edgeMode

externalresourcesrequired

externalResourcesRequired

filterres

filterRes

filterunits

filterUnits

glyphref

glyphRef

gradienttransform

gradientTransform

gradientunits

gradientUnits

kernelmatrix

kernelMatrix

kernelunitlength

kernelUnitLength

keypoints

keyPoints

keysplines

keySplines

keytimes

keyTimes

lengthadjust

lengthAdjust

limitingconeangle

limitingConeAngle

markerheight

markerHeight

markerunits

markerUnits

markerwidth

markerWidth

maskcontentunits

maskContentUnits

maskunits

maskUnits

numoctaves

numOctaves

pathlength

pathLength

patterncontentunits

patternContentUnits

patterntransform

patternTransform

patternunits

patternUnits

pointsatx

pointsAtX

pointsaty

pointsAtY

pointsatz

pointsAtZ

preservealpha

preserveAlpha

preserveaspectratio

preserveAspectRatio

primitiveunits

primitiveUnits

refx

refX

refy

refY

repeatcount

repeatCount

repeatdur

repeatDur

requiredextensions

requiredExtensions

requiredfeatures

requiredFeatures

specularconstant

specularConstant

specularexponent

specularExponent

spreadmethod

spreadMethod

startoffset

startOffset

stddeviation

stdDeviation

stitchtiles

stitchTiles

surfacescale

surfaceScale

systemlanguage

systemLanguage

tablevalues

tableValues

targetx

targetX

targety

targetY

textlength

textLength

viewbox

viewBox

viewtarget

viewTarget

xchannelselector

xChannelSelector

ychannelselector

yChannelSelector

zoomandpan

zoomAndPan

When the steps below require the user agent to adjust foreign attributes for a
token, then, if any of the attributes on the token match the strings given in the first column of
the following table, let the attribute be a namespaced attribute, with the prefix being the string
given in the corresponding cell in the second column, the local name being the string given in the
corresponding cell in the third column, and the namespace being the namespace given in the
corresponding cell in the fourth column. (This fixes the use of namespaced attributes, in
particular lang attributes in the XML
namespace.)

If the adjusted insertion location is in a Document node,
then abort these steps.

The DOM will not let Document nodes have Text node
children, so they are dropped on the floor.

If there is a Text node immediately before the adjusted insertion
location, then append data to that Text node's data.

Otherwise, create a new Text node whose data is data and
whose ownerDocument is the same as that of the
element in which the adjusted insertion location finds itself, and insert
the newly created node at the adjusted insertion location.

Here are some sample inputs to the parser and the corresponding number of Text
nodes that they result in, assuming a user agent that executes scripts.

One Text node before the table, containing "A BC" (A-space-B-C), and one Text node inside the table (as a child of a tbody) with a single space character. (Space characters separated from non-space characters by non-character tokens are not affected by foster parenting, even if those other tokens then get ignored.)

When the steps below require the user agent to insert a comment while processing a
comment token, optionally with an explicitly insertion position position, the
user agent must run the following steps:

If the DOCTYPE token's name is not a case-sensitive match for the string "html", or the token's public identifier is not missing, or the token's system
identifier is neither missing nor a case-sensitive match for the string
"about:legacy-compat", and none of the sets of conditions in the following list are
matched, then there is a parse error.

The DOCTYPE token's name is a case-sensitive match for the string "html", the token's public identifier is the case-sensitive string
"-//W3C//DTD HTML 4.0//EN", and the token's system identifier
is either missing or the case-sensitive string "http://www.w3.org/TR/REC-html40/strict.dtd".

The DOCTYPE token's name is a case-sensitive match for the string "html", the token's public identifier is the case-sensitive string
"-//W3C//DTD HTML 4.01//EN", and the token's system identifier
is either missing or the case-sensitive string "http://www.w3.org/TR/html4/strict.dtd".

The DOCTYPE token's name is a case-sensitive match for the string "html", the token's public identifier is the case-sensitive string
"-//W3C//DTD XHTML 1.0 Strict//EN", and the token's system
identifier is the case-sensitive string "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd".

The DOCTYPE token's name is a case-sensitive match for the string "html", the token's public identifier is the case-sensitive string
"-//W3C//DTD XHTML 1.1//EN", and the token's system identifier
is the case-sensitive string "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd".

Conformance checkers may, based on the values (including presence or lack thereof) of the
DOCTYPE token's name, public identifier, or system identifier, switch to a conformance checking
mode for another language (e.g. based on the DOCTYPE token a conformance checker could recognize
that the document is an HTML4-era document, and defer to an HTML4 conformance checker.)

Append a DocumentType node to the Document node, with the name attribute set to the name given in the DOCTYPE token, or the empty string
if the name was missing; the publicId attribute set to the public
identifier given in the DOCTYPE token, or the empty string if the public identifier was missing;
the systemId attribute set to the system identifier given in the DOCTYPE
token, or the empty string if the system identifier was missing; and the other attributes
specific to DocumentType objects set to null and empty lists as appropriate.
Associate the DocumentType node with the Document object so that it is
returned as the value of the doctype attribute of the
Document object.

The system identifier is not missing and the public identifier starts with: "-//W3C//DTD HTML 4.01 Frameset//"

The system identifier is not missing and the public identifier starts with: "-//W3C//DTD HTML 4.01 Transitional//"

The system identifier and public identifier strings must be compared to the values given in
the lists above in an ASCII case-insensitive manner. A system identifier whose
value is the empty string is not considered missing for the purposes of the conditions
above.

The root element can end up being removed from the Document object, e.g. by
scripts; nothing in particular happens in such cases, content continues being appended to the
nodes as described in the next section.

8.2.5.4.3 The "before head" insertion mode

When the user agent is to apply the rules for the "before head" insertion mode, the user agent must handle the token as
follows:

A character token that is one of U+0009 CHARACTER
TABULATION, "LF" (U+000A), "FF" (U+000C),
"CR" (U+000D), or U+0020 SPACE

This ensures that, if the script is external, any document.write() calls in the script will execute in-line,
instead of blowing the document away, as would happen in most other cases. It also prevents
the script from executing until the end tag is seen.

Otherwise, for each attribute on the token, check to see if the attribute is already present
on the top element of the stack of open elements. If it is not, add the attribute
and its corresponding value to that element.

Otherwise, set the frameset-ok flag to "not ok"; then, for each attribute on the
token, check to see if the attribute is already present on the body element (the
second element) on the stack of open elements, and if it is not, add the attribute
and its corresponding value to that element.

Once a start tag with the tag name "plaintext" has been seen, that will be the
last token ever seen other than character tokens (and the end-of-file token), because there is
no way to switch out of the PLAINTEXT state.

In the non-conforming stream
<a href="a">a<table><a href="b">b</table>x, the first
a element would be closed upon seeing the second one, and the "x" character would
be inside a link to "b", not to "a". This is despite the fact that the outer a
element is not in table scope (meaning that a regular </a> end tag at the start
of the table wouldn't close the outer a element). The result is that the two
a elements are indirectly nested inside each other — non-conforming markup
will often result in non-conforming DOMs when parsed.

If the token does not have an attribute with the name "type", or if it does, but that
attribute's value is not an ASCII case-insensitive match for the string "hidden", then: set the frameset-ok flag to "not ok".

Prompt: If the token has an attribute
with the name "prompt", then the first stream of characters must be the same string as given in
that attribute, and the second stream of characters must be empty. Otherwise, the two streams of
character tokens together should, together with the input element, express the
equivalent of "This is a searchable index. Enter search keywords: (input field)" in the user's
preferred language.

This algorithm's name, the "adoption agency algorithm", comes from the way it
causes elements to change parents, and is in contrast with other possible algorithms for dealing
with misnested content, which included the "incest algorithm", the "secret affair algorithm", and
the "Heisenberg algorithm".

8.2.5.4.8 The "text" insertion mode

When the user agent is to apply the rules for the "text" insertion mode, the user agent must handle the token as
follows:

Set the parser pause flag to true, and abort the processing of any nested
invocations of the tokenizer, yielding control back to the caller. (Tokenization will resume
when the caller returns to the "outer" tree construction stage.)

If the token does not have an attribute with the name "type", or if it does, but that
attribute's value is not an ASCII case-insensitive match for the string "hidden", then: act as described in the "anything else" entry below.

If the adjusted current node is an element in the SVG namespace, and the
token's tag name is one of the ones in the first column of the following table, change the tag
name to the name given in the corresponding cell in the second column. (This fixes the case of
SVG elements that are not all lowercase.)

8.2.7 Coercing an HTML DOM into an infoset

When an application uses an HTML parser in conjunction with an XML pipeline, it is
possible that the constructed DOM is not compatible with the XML tool chain in certain subtle
ways. For example, an XML toolchain might not be able to represent attributes with the name xmlns, since they conflict with the Namespaces in XML syntax. There is also some
data that the HTML parser generates that isn't included in the DOM itself. This
section specifies some rules for handling these issues.

If the XML API being used doesn't support DOCTYPEs, the tool may drop DOCTYPEs altogether.

If the XML API doesn't support attributes in no namespace that are named "xmlns", attributes whose names start with "xmlns:", or
attributes in the XMLNS namespace, then the tool may drop such attributes.

The tool may annotate the output with any namespace declarations required for proper
operation.

If the XML API being used restricts the allowable characters in the local names of elements and
attributes, then the tool may map all element and attribute local names that the API wouldn't
support to a set of names that are allowed, by replacing any character that isn't
supported with the uppercase letter U and the six digits of the character's Unicode code point
when expressed in hexadecimal, using digits 0-9 and capital letters A-F as the symbols, in
increasing numeric order.

For example, the element name foo<bar, which can be
output by the HTML parser, though it is neither a legal HTML element name nor a
well-formed XML element name, would be converted into fooU00003Cbar, which
is a well-formed XML element name (though it's still not legal in HTML by any means).

As another example, consider the attribute xlink:href. Used on a
MathML element, it becomes, after being adjusted,
an attribute with a prefix "xlink" and a local name "href". However, used on an HTML element, it becomes an attribute with no prefix
and the local name "xlink:href", which is not a valid NCName, and thus might
not be accepted by an XML API. It could thus get converted, becoming "xlinkU00003Ahref".

The resulting names from this conversion conveniently can't clash with any
attribute generated by the HTML parser, since those are all either lowercase or those
listed in the adjust foreign attributes algorithm's table.

If the XML API restricts comments from having two consecutive U+002D HYPHEN-MINUS characters
(--), the tool may insert a single U+0020 SPACE character between any such offending
characters.

If the XML API restricts comments from ending in a "-" (U+002D) character, the tool
may insert a single U+0020 SPACE character at the end of such comments.

If the XML API restricts allowed characters in character data, attribute values, or comments,
the tool may replace any "FF" (U+000C) character with a U+0020 SPACE character, and any
other literal non-XML character with a U+FFFD REPLACEMENT CHARACTER.

If the tool has no way to convey out-of-band information, then the tool may drop the following
information:

The mutations allowed by this section apply after the HTML
parser's rules have been applied. For example, a <a::> start tag
will be closed by a </a::> end tag, and never by a </aU00003AU00003A> end tag, even if the user agent is using the rules above to
then generate an actual element in the DOM with the name aU00003AU00003A for
that start tag.

8.2.8 An introduction to error handling and strange cases in the parser

This section is non-normative.

This section examines some erroneous markup and discusses how the HTML parser
handles these cases.

8.2.8.1 Misnested tags: <b><i></b></i>

This section is non-normative.

The most-often discussed example of erroneous markup is as follows:

<p>1<b>2<i>3</b>4</i>5</p>

The parsing of this markup is straightforward up to the "3". At this point, the DOM looks like
this:

The next token is a character ("4"), triggers the reconstruction of the active formatting elements, in this case just
the i element. A new i element is thus created for the "4"
Text node. After the end tag token for the "i" is also received, and the "5"
Text node is inserted, the DOM looks as follows:

Upon receiving the end tag token with the tag name "b", the "adoption
agency algorithm" is invoked, as in the previous example. However, in this case, there
is a furthest block, namely the p element. Thus, this
time the adoption agency algorithm isn't skipped over.

The common ancestor is the body element. A conceptual
"bookmark" marks the position of the b in the list of active formatting
elements, but since that list has only one element in it, the bookmark won't have much
effect.

As the algorithm progresses, node ends up set to the formatting element
(b), and last node ends up set to the furthest
block (p).

The last node gets appended (moved) to the common
ancestor, so that the DOM looks like:

8.2.8.3 Unexpected markup in tables

The highlighted b element start tag is not allowed directly inside a table like
that, and the parser handles this case by placing the element before the table. (This is
called foster parenting.) This can be seen by examining the DOM tree
as it stands just after the table element's start tag has been seen:

The tr start tag causes the b element to be popped off the stack and
a tbody start tag to be implied; the tbody and tr elements
are then handled in a rather straight-forward manner, taking the parser through the "in table body" and "in row" insertion modes, after which the DOM looks as follows:

Thus it is that the "bbb" character tokens are found. These trigger the "in table text" insertion mode to be used (with the original
insertion mode set to "in table body").
The character tokens are collected, and when the next token (the table element end
tag) is seen, they are processed as a group. Since they are not all spaces, they are handled as
per the "anything else" rules in the "in table"
insertion mode, which defer to the "in body"
insertion mode but with foster parenting.

8.2.8.4 Scripts that modify the page as it is being parsed

This section is non-normative.

Consider the following markup, which for this example we will assume is the document with
URLhttp://example.com/inner, being rendered as the content of
an iframe in another document with the URLhttp://example.com/outer:

This script does execute, resulting in an alert that says "http://example.com/inner".

8.2.8.5 The execution of scripts that are moving across multiple documents

This section is non-normative.

Elaborating on the example in the previous section, consider the case where the second
script element is an external script (i.e. one with a src attribute). Since the element was not in the parser's
Document when it was created, that external script is not even downloaded.

In a case where a script element with a src
attribute is parsed normally into its parser's Document, but while the external
script is being downloaded, the element is moved to another document, the script continues to
download, but does not execute.

In general, moving script elements between Documents is
considered a bad practice.

8.2.8.6 Unclosed formatting elements

This section is non-normative.

The following markup shows how nested formatting elements (such as b) get
collected and continue to be applied even as the elements they are contained in are closed, but
that excessive duplicates are thrown away.

Note how the second p element in the markup has no explicit b
elements, but in the resulting DOM, up to three of each kind of formatting element (in this case
three b elements with the class attribute, and two unadorned b elements)
get reconstructed before the element's "X".

Also note how this means that in the final paragraph only six b end tags are
needed to completely clear the list of formatting elements, even though nine b start
tags have been seen up to this point.

8.3 Serializing HTML fragments

The following steps form the HTML fragment serialization algorithm. The algorithm
takes as input a DOM Element, Document, or DocumentFragment
referred to as the node, and either returns a string or throws an
exception.

This algorithm serializes the children of the node being serialized, not
the node itself.

The attribute's serialized name is the string "xlink:"
followed by the attribute's local name.

If the attribute is in some other namespace

The attribute's serialized name is the attribute's qualified name.

While the exact order of attributes is UA-defined, and may depend on factors such as the
order that the attributes were given in the original markup, the sort order must be stable,
such that consecutive invocations of this algorithm serialize an element's attributes in the
same order.

If current node is a pre, textarea, or
listing element, and the first child node of the element, if any, is a
Text node whose character data has as its first character a "LF" (U+000A) character, then append a "LF" (U+000A) character.

Append the value of running the HTML fragment serialization algorithm on the
current node element (thus recursing into this algorithm for that
element), followed by a "<" (U+003C) character, a U+002F SOLIDUS character
(/), tagname again, and finally a U+003E GREATER-THAN SIGN character
(>).

Append the literal string <? (U+003C LESS-THAN SIGN, U+003F QUESTION
MARK), followed by the value of current node's target IDL attribute, followed by a single U+0020 SPACE character, followed
by the value of current node's data IDL attribute,
followed by a single ">" (U+003E) character.

It is possible that the output of this algorithm, if parsed with an HTML
parser, will not return the original tree structure.

For instance, if a textarea element to which a Comment node
has been appended is serialized and the output is then reparsed, the comment will end up being
displayed in the text field. Similarly, if, as a result of DOM manipulation, an element contains
a comment that contains the literal string "-->", then when the result
of serializing the element is parsed, the comment will be truncated at that point and the rest of
the comment will be interpreted as markup. More examples would be making a script
element contain a Text node with the text string "</script>", or
having a p element that contains a ul element (as the ul
element's start tag would imply the end tag for the
p).

This can enable cross-site scripting attacks. An example of this would be a page that lets the
user enter some font family names that are then inserted into a CSS style block via
the DOM and which then uses the innerHTML IDL attribute to get
the HTML serialization of that style element: if the user enters
"</style><script>attack</script>" as a font family name, innerHTML will return markup that, if parsed in a different context,
would contain a script node, even though no script node existed in the
original DOM.

Escaping a string (for the purposes of the algorithm above)
consists of running the following steps:

Replace any occurrence of the "&" character by the string "&amp;".

Replace any occurrences of the U+00A0 NO-BREAK SPACE character by the string "&nbsp;".

If the algorithm was invoked in the attribute mode, replace any occurrences of the
""" character by the string "&quot;".

If the algorithm was not invoked in the attribute mode, replace any
occurrences of the "<" character by the string "&lt;", and any occurrences of the ">" character by
the string "&gt;".

8.4 Parsing HTML fragments

The following steps form the HTML fragment parsing algorithm. The algorithm
optionally takes as input an Element node, referred to as the context element, which gives the context for
the parser, as well as input, a string to parse, and returns a list of zero or
more nodes.

Parts marked fragment case in algorithms in the parser section are
parts that only occur if the parser was created for the purposes of this algorithm (and with a
context element). The algorithms have been annotated
with such markings for informational purposes only; such markings have no normative weight. If it
is possible for a condition described as a fragment case to occur even when the
parser wasn't created for the purposes of handling this algorithm, then that is an error in the
specification.

For performance reasons, an implementation that does not report errors and
that uses the actual state machine described in this specification directly could use the
PLAINTEXT state instead of the RAWTEXT and script data states where those are mentioned in the
list above. Except for rules regarding parse errors, they are equivalent, since there is no
appropriate end tag token in the fragment case, yet they involve far fewer state
transitions.

The parser will reference the context element as part of that algorithm.

Set the parser's
form element pointer to the nearest node to the context element that is a form element
(going straight up the ancestor chain, and including the element itself, if it is a
form element), if any. (If there is no such form element, the form
element pointer keeps its initial value, null.)