For HTML documents, user agents must use the parsing
rules described in this section to generate the DOM trees. Together,
these rules define what is referred to as the HTML
parser.

While the HTML form of HTML5 bears a close resemblance to SGML
and XML, it is a separate language with its own parsing rules.

Some earlier versions of HTML (in particular from HTML2 to
HTML4) were based on SGML and used SGML parsing rules. However, few
(if any) web browsers ever implemented true SGML parsing for HTML
documents; the only user agents to strictly handle HTML as an SGML
application have historically been validators. The resulting
confusion — with validators claiming documents to have one
representation while widely deployed Web browsers interoperably
implemented a different representation — has wasted decades
of productivity. This version of HTML thus returns to a non-SGML
basis.

Authors interested in using SGML tools in their authoring
pipeline are encouraged to use XML tools and the XML serialization
of HTML5.

This specification defines the parsing rules for HTML documents,
whether they are syntactically correct or not. Certain points in the
parsing algorithm are said to be parse
errors. The error handling for parse errors is well-defined:
user agents must either act as described below when encountering
such problems, or must abort processing at the first error that they
encounter for which they do not wish to apply the rules described
below.

Conformance checkers must report at least one parse error
condition to the user if one or more parse error conditions exist in
the document and must not report parse error conditions if none
exist in the document. Conformance checkers may report more than one
parse error condition if more than one parse error conditions exist
in the document. Conformance checkers are not required to recover
from parse errors.

Parse errors are only errors with the
syntax of HTML. In addition to checking for parse errors,
conformance checkers will also verify that the document obeys all
the other conformance requirements described in this
specification.

8.2.1 Overview of the parsing model

The input to the HTML parsing process consists of a stream of
Unicode characters, which is passed through a
tokenization stage (lexical analysis) followed by a
tree construction stage (semantic analysis). The output
is a Document object.

Implementations that do not
support scripting do not have to actually create a DOM
Document object, but the DOM tree in such cases is
still used as the model for the rest of the specification.

There is only one set of state for the
tokeniser stage and the tree construction stage, but the tree
construction stage is reentrant, meaning that while the tree
construction stage is handling one token, the tokeniser might be
resumed, causing further tokens to be emitted and processed before
the first token's processing is complete.

In the following example, the tree construction stage will be
called upon to handle a "p" start tag token while handling the
"script" start tag token:

...
<script>
document.write('<p>');
</script>
...

To handle these cases, parsers have a script nesting
level, which must be initially set to zero, and a parser
pause flag, which must be initially set to false.

8.2.2 The input stream

The stream of Unicode characters that consists the input to the
tokenization stage will be initially seen by the user agent as a
stream of bytes (typically coming over the network or from the local
file system). The bytes encode the actual characters according to a
particular character encoding, which the user agent must
use to decode the bytes into characters.

For XML documents, the algorithm user agents must
use to determine the character encoding is given by the XML
specification. This section does not apply to XML documents. [XML]

8.2.2.1 Determining the character encoding

In some cases, it might be impractical to unambiguously determine
the encoding before parsing the document. Because of this, this
specification provides for a two-pass mechanism with an optional
pre-scan. Implementations are allowed, as described below, to apply
a simplified parsing algorithm to whatever bytes they have available
before beginning to parse the document. Then, the real parser is
started, using a tentative encoding derived from this pre-parse and
other out-of-band metadata. If, while the document is being loaded,
the user agent discovers an encoding declaration that conflicts with
this information, then the parser can get reinvoked to perform a
parse of the document with the real encoding.

User agents must use the following
algorithm (the encoding sniffing algorithm) to determine
the character encoding to use when decoding a document in the first
pass. This algorithm takes as input any out-of-band metadata
available to the user agent (e.g. the Content-Type metadata of the document)
and all the bytes available so far, and returns an encoding and a
confidence. The
confidence is either tentative, certain, or
irrelevant. The encoding used, and whether the confidence in
that encoding is tentative or confident, is used during the parsing to
determine whether to change the encoding. If no
encoding is necessary, e.g. because the parser is operating on a
stream of Unicode characters and doesn't have to use an encoding at
all, then the confidence is
irrelevant.

If the transport layer specifies an encoding, and it is
supported, return that encoding with the confidencecertain, and abort these steps.

The user agent may wait for more bytes of the resource to be
available, either in this step or at any later step in this
algorithm. For instance, a user agent might wait 500ms or 512
bytes, whichever came first. In general preparsing the source to
find the encoding improves performance, as it reduces the need to
throw away the data structures used when parsing upon finding the
encoding information. However, if the user agent delays too long to
obtain data to determine the encoding, then the cost of the delay
could outweigh any performance improvements from the
preparse.

For each of the rows in the following table, starting with
the first one and going down, if there are as many or more bytes
available than the number of bytes in the first column, and the
first bytes of the file match the bytes given in the first column,
then return the encoding given in the cell in the second column of
that row, with the confidencecertain, and abort these steps:

Bytes in Hexadecimal

Encoding

FE FF

UTF-16BE

FF FE

UTF-16LE

EF BB BF

UTF-8

This step looks for Unicode Byte Order Marks
(BOMs).

Otherwise, the user agent will have to search for explicit
character encoding information in the file itself. This should
proceed as follows:

Let position be a pointer to a byte in the
input stream, initially pointing at the first byte. If at any
point during these substeps the user agent either runs out of
bytes or decides that scanning further bytes would not be
efficient, then skip to the next step of the overall character
encoding detection algorithm. User agents may decide that scanning
any bytes is not efficient, in which case these substeps
are entirely skipped.

Now, repeat the following "two" steps until the algorithm
aborts (either because user agent aborts, as described above, or
because a character encoding is found):

If position points to:

A sequence of bytes starting with: 0x3C 0x21 0x2D 0x2D (ASCII '<!--')

Advance the position pointer so that it
points at the first 0x3E byte which is preceded by two 0x2D
bytes (i.e. at the end of an ASCII '-->' sequence) and comes
after the 0x3C byte that was found. (The two 0x2D bytes can be
the same as the those in the '<!--' sequence.)

A sequence of bytes starting with: 0x3C, 0x4D or 0x6D, 0x45 or 0x65, 0x54 or 0x74, 0x41 or 0x61, and finally one of 0x09, 0x0A, 0x0C, 0x0D, 0x20, 0x2F (case-insensitive ASCII '<meta' followed by a space or slash)

Advance the position pointer so
that it points at the next 0x09, 0x0A, 0x0C, 0x0D, 0x20, or
0x2F byte (the one in sequence of characters matched
above).

Get
an attribute and its value. If no attribute was
sniffed, then skip this inner set of steps, and jump to the
second step in the overall "two step" algorithm.

If the attribute's name is neither "charset" nor "content",
then return to step 2 in these inner steps.

If the attribute's name is "charset", let charset be
the attribute's value, interpreted as a character
encoding.

Repeatedly get an
attribute until no further attributes can be found,
then jump to the second step in the overall "two step"
algorithm.

A sequence of bytes starting with: 0x3C 0x21 (ASCII '<!')

A sequence of bytes starting with: 0x3C 0x2F (ASCII '</')

A sequence of bytes starting with: 0x3C 0x3F (ASCII '<?')

Advance the position pointer so that it
points at the first 0x3E byte (ASCII '>') that comes after the
0x3C byte that was found.

Any other byte

Do nothing with that byte.

Move position so it points at the next
byte in the input stream, and return to the first step of this
"two step" algorithm.

When the above "two step" algorithm says to get an
attribute, it means doing this:

If the byte at position is one of 0x09
(ASCII TAB), 0x0A (ASCII LF), 0x0C (ASCII FF), 0x0D (ASCII CR),
0x20 (ASCII space), or 0x2F (ASCII '/') then advance position to the next byte and redo this
substep.

If the byte at position is 0x3E (ASCII
'>'), then abort the "get an attribute" algorithm. There isn't
one.

Otherwise, the byte at position is the
start of the attribute name. Let attribute
name and attribute value be the empty
string.

Attribute name: Process the byte at position as follows:

If it is 0x3D (ASCII '='), and the attribute
name is longer than the empty string

Advance position to the next byte and
jump to the step below labeled value.

Abort the "get an attribute" algorithm. The attribute's
name is the value of attribute name, its
value is the empty string.

If it is in the range 0x41 (ASCII 'A') to 0x5A (ASCII
'Z')

Append the Unicode character with codepoint b+0x20 to attribute
name (where b is the value of the
byte at position).

Anything else

Append the Unicode character with the same codepoint as the
value of the byte at position) to attribute name. (It doesn't actually matter how
bytes outside the ASCII range are handled here, since only
ASCII characters can contribute to the detection of a character
encoding.)

Advance position to the next byte and
return to the previous step.

Spaces. If the byte at position is one of 0x09 (ASCII TAB), 0x0A (ASCII
LF), 0x0C (ASCII FF), 0x0D (ASCII CR), or 0x20 (ASCII space) then
advance position to the next byte, then,
repeat this step.

If the byte at position is
not 0x3D (ASCII '='), abort the "get an attribute"
algorithm. The attribute's name is the value of attribute name, its value is the empty
string.

Advance position past the 0x3D (ASCII
'=') byte.

Value. If the byte at position is one of 0x09 (ASCII TAB), 0x0A (ASCII
LF), 0x0C (ASCII FF), 0x0D (ASCII CR), or 0x20 (ASCII space) then
advance position to the next byte, then,
repeat this step.

Process the byte at position as
follows:

If it is 0x22 (ASCII '"') or 0x27 ("'")

Let b be the value of the byte at
position.

Advance position to the next
byte.

If the value of the byte at position
is the value of b, then advance position to the next byte and abort the "get
an attribute" algorithm. The attribute's name is the value of
attribute name, and its value is the
value of attribute value.

Otherwise, if the value of the byte at position is in the range 0x41 (ASCII 'A') to
0x5A (ASCII 'Z'), then append a Unicode character to attribute value whose codepoint is 0x20 more
than the value of the byte at position.

Otherwise, append a Unicode character to attribute value whose codepoint is the same as
the value of the byte at position.

Return to the second step in these substeps.

If it is 0x3E (ASCII '>')

Abort the "get an attribute" algorithm. The attribute's
name is the value of attribute name, its
value is the empty string.

If it is in the range 0x41 (ASCII 'A') to 0x5A (ASCII
'Z')

Append the Unicode character with codepoint b+0x20 to attribute
value (where b is the value of the
byte at position). Advance position to the next byte.

Anything else

Append the Unicode character with the same codepoint as the
value of the byte at position) to attribute value. Advance position to the next byte.

Abort the "get an attribute" algorithm. The attribute's
name is the value of attribute name and its
value is the value of attribute value.

If it is in the range 0x41 (ASCII 'A') to 0x5A (ASCII
'Z')

Append the Unicode character with codepoint b+0x20 to attribute
value (where b is the value of the
byte at position).

Anything else

Append the Unicode character with the same codepoint as the
value of the byte at position) to attribute value.

Advance position to the next byte and
return to the previous step.

For the sake of interoperability, user agents should not use a
pre-scan algorithm that returns different results than the one
described above. (But, if you do, please at least let us know, so
that we can improve this algorithm and benefit everyone...)

If the user agent has information on the likely encoding for
this page, e.g. based on the encoding of the page when it was last
visited, then return that encoding, with the confidencetentative, and abort these steps.

The user agent may attempt to autodetect the character
encoding from applying frequency analysis or other algorithms to
the data stream. If autodetection succeeds in determining a
character encoding, then return that encoding, with the confidencetentative, and abort these steps. [UNIVCHARDET]

Otherwise, return an implementation-defined or
user-specified default character encoding, with the confidencetentative. In non-legacy environments, the more
comprehensive UTF-8 encoding is
recommended. Due to its use in legacy content, windows-1252 is recommended as a default in
predominantly Western demographics instead. Since these encodings
can in many cases be distinguished by inspection, a user agent may
heuristically decide which to use as a default.

The document's character encoding must immediately
be set to the value returned from this algorithm, at the same time
as the user agent uses the returned value to select the decoder to
use for the input stream.

8.2.2.2 Character encoding requirements

User agents must at a minimum support the UTF-8 and Windows-1252
encodings, but may support more.

It is not unusual for Web browsers to support dozens
if not upwards of a hundred distinct character encodings.

User agents must support the preferred MIME name of every
character encoding they support that has a preferred MIME name, and
should support all the IANA-registered aliases. [IANACHARSET]

When comparing a string specifying a character encoding with the
name or alias of a character encoding to determine if they are
equal, user agents must ignore all characters in the ranges U+0009
to U+000D, U+0020 to U+002F, U+003A to U+0040, U+005B to U+0060, and
U+007B to U+007E (all whitespace and punctuation characters in
ASCII) in both names, and then perform the comparison in an
ASCII case-insensitive manner.

For instance, "GB_2312-80" and "g.b.2312(80)" are
considered equivalent names.

When a user agent would otherwise use an encoding given in the
first column of the following table, it must instead use the
encoding given in the cell in the second column of the same row. Any
bytes that are treated differently due to this encoding aliasing
must be considered parse
errors.

Support for encodings based on EBCDIC is not recommended. This
encoding is rarely used for publicly-facing Web content.

Support for UTF-32 is not recommended. This encoding is rarely
used, and frequently misimplemented.

This specification does not make any attempt to
support EBCDIC-based encodings and UTF-32 in its algorithms; support
and use of these encodings can thus lead to unexpected behavior in
implementations of this specification.

8.2.2.3 Preprocessing the input stream

Given an encoding, the bytes in the input stream must be
converted to Unicode characters for the tokeniser, as described by
the rules for that encoding, except that the leading U+FEFF BYTE
ORDER MARK character, if any, must not be stripped by the encoding
layer (it is stripped by the rule below).

Bytes or sequences of bytes in the original byte stream that
could not be converted to Unicode characters must be converted to
U+FFFD REPLACEMENT CHARACTER code points.

Bytes or sequences of bytes in the original byte
stream that did not conform to the encoding specification
(e.g. invalid UTF-8 byte sequences in a UTF-8 input stream) are
errors that conformance checkers are expected to report.

One leading U+FEFF BYTE ORDER MARK character must be ignored if
any are present.

All U+0000 NULL characters in the input must be replaced by
U+FFFD REPLACEMENT CHARACTERs. Any occurrences of such characters is
a parse error.

U+000D CARRIAGE RETURN (CR) characters and U+000A LINE FEED (LF)
characters are treated specially. Any CR characters that are
followed by LF characters must be removed, and any CR characters not
followed by LF characters must be converted to LF characters. Thus,
newlines in HTML DOMs are represented by LF characters, and there
are never any CR characters in the input to the
tokenization stage.

The next input character is the first character in the
input stream that has not yet been consumed. Initially,
the next input character is the first character in the
input. The current input character is the last character
to have been consumed.

The insertion point is the position (just before a
character or just before the end of the input stream) where content
inserted using document.write() is actually
inserted. The insertion point is relative to the position of the
character immediately after it, it is not an absolute offset into
the input stream. Initially, the insertion point is
uninitialized.

The "EOF" character in the tables below is a conceptual character
representing the end of the input stream. If the parser
is a script-created parser, then the end of the
input stream is reached when an explicit "EOF"
character (inserted by the document.close() method) is
consumed. Otherwise, the "EOF" character is not a real character in
the stream, but rather the lack of any further characters.

8.2.2.4 Changing the encoding while parsing

When the parser requires the user agent to change the
encoding, it must run the following steps. This might happen
if the encoding sniffing algorithm described above
failed to find an encoding, or if it found an encoding that was not
the actual encoding of the file.

If the new encoding is a UTF-16 encoding, change it to
UTF-8.

If the new encoding is identical or equivalent to the encoding
that is already being used to interpret the input stream, then set
the confidence to
confident and abort these steps. This happens when the
encoding information found in the file matches what the
encoding sniffing algorithm determined to be the
encoding, and in the second pass through the parser if the first
pass found that the encoding sniffing algorithm described in the
earlier section failed to find the right encoding.

If all the bytes up to the last byte converted by the current
decoder have the same Unicode interpretations in both the current
encoding and the new encoding, and if the user agent supports
changing the converter on the fly, then the user agent may change
to the new converter for the encoding on the fly. Set the
document's character encoding and the encoding used to
convert the input stream to the new encoding, set the confidence to
confident, and abort these steps.

Otherwise, navigate to the document again, with
replacement enabled, and using the same source
browsing context, but this time skip the encoding
sniffing algorithm and instead just set the encoding to the
new encoding and the confidence to
confident. Whenever possible, this should be done without
actually contacting the network layer (the bytes should be
re-parsed from memory), even if, e.g., the document is marked as
not being cacheable. If this is not possible and contacting the
network layer would involve repeating a request that uses a method
other than HTTP GET (or
equivalent for non-HTTP URLs), then instead set the confidence to
confident and ignore the new encoding. The resource will be
misinterpreted. User agents may notify the user of the situation,
to aid in application development.

8.2.3 Parse state

8.2.3.1 The insertion mode

The insertion mode is a flag that controls the primary
operation of the tree construction stage.

When the insertion mode is switched to "in CDATA/RCDATA", the original
insertion mode is also set. This is the insertion mode to
which the tree construction stage will return when the corresponding
end tag is parsed.

When the insertion mode is switched to "in foreign content", the
secondary insertion mode is also set. This secondary mode
is used within the rules for the "in foreign content" mode to handle HTML
(i.e. not foreign) content.

When the steps below require the UA to reset the insertion
mode appropriately, it means the UA must follow these
steps:

8.2.3.2 The stack of open elements

Initially the stack of open elements is empty. The
stack grows downwards; the topmost node on the stack is the first
one added to the stack, and the bottommost node of the stack is the
most recently added node in the stack (notwithstanding when the
stack is manipulated in a random access fashion as part of the handling for misnested tags).

Otherwise, set node to the previous
entry in the stack of open elements and return to step
2. (This will never fail, since the loop will always terminate in
the previous step if the top of the stack — an
html element — is reached.)

The stack of open elements is said to have an element in table
scope when the following algorithm terminates in a match
state:

Initialize node to be the current
node (the bottommost node of the stack).

If node is the target node, terminate in
a match state.

Otherwise, if node is one of the
following elements, terminate in a failure state:

Otherwise, set node to the previous
entry in the stack of open elements and return to step
2. (This will never fail, since the loop will always terminate in
the previous step if the top of the stack — an
html element — is reached.)

Nothing happens if at any time any of the elements in the
stack of open elements are moved to a new location in,
or removed from, the Document tree. In particular, the
stack is not changed in this situation. This can cause, amongst
other strange effects, content to be appended to nodes that are no
longer in the DOM.

8.2.3.3 The list of active formatting elements

Initially the list of active formatting elements is
empty. It is used to handle mis-nested formatting element tags.

The list contains elements in the formatting
category, and scope markers. The scope markers are inserted when
entering applet elements, buttons, object
elements, marquees, table cells, and table captions, and are used to
prevent formatting from "leaking" into applet elements,
buttons, object elements, marquees, and tables.

When the steps below require the UA to reconstruct the
active formatting elements, the UA must perform the following
steps:

This has the effect of reopening all the formatting elements that
were opened in the current body, cell, or caption (whichever is
youngest) that haven't been explicitly closed.

The way this specification is written, the
list of active formatting elements always consists of
elements in chronological order with the least recently added
element first and the most recently added element last (except for
while steps 8 to 11 of the above algorithm are being executed, of
course).

When the steps below require the UA to clear the list of
active formatting elements up to the last marker, the UA must
perform the following steps:

If entry was a marker, then stop the
algorithm at this point. The list has been cleared up to the last
marker.

Go to step 1.

8.2.3.4 The element pointers

Initially the head element
pointer and the form element
pointer are both null.

Once a head element has been parsed (whether
implicitly or explicitly) the head
element pointer gets set to point to this node.

The form element pointer
points to the last form element that was opened and
whose end tag has not yet been seen. It is used to make form
controls associate with forms in the face of dramatically bad
markup, for historical reasons.

8.2.3.5 Other parsing state flags

The scripting flag is set to "enabled" if the scripting was enabled for the
Document with which the parser is associated when the
parser was created, and "disabled" otherwise.

The frameset-ok flag is set to "ok" when the parser is
created. It is set to "not ok" after certain tokens are seen.