HTML5 Authoring Conformance Study

This page includes a review of some notable sites, and their conformance interpreted as HTML5 and interpreted as their declared doctype. Each page was checked for conformance to HTML5 and to its declared doctype. In addition, HTML5 conformance errors were broken down in detail.

Methodology

For each of these sites, validator.nu was used to determine what DOCTYPE is reported for the main page. Based on this, the following tests were applied:

http://validator.nu/ was used to check for HTML5 conformance and count errors (but not warnings or info messages). HTML5 validation and parsing mode were forced for pages that are not declared as HTML5.

For pages that declared themselves to be something other than HTML5, http://validator.w3.org/ was used to validate them as their declared type.

For pages with an XHTML doctype, http://validator.nu/ was used to check for XHTML5 conformance. XHTML5 validation mode and XML parsing mode were forced, and the "lax about content types" checkbox was checked.

Self-closing syntax on non-void elements (used in a way that may invoke the adoption agency algorithm)

Conformance Errors Under Discussion

Unclosed tag, in cases where the element is not void and the close tag is not implied; these seem likely to indicate an authoring error.

Why are close tags implied for some elements and not for others? In the cases used, the pages render as intended, arguing for at most a warning.

Some elements have implied end tags because that's what HTML4 historically allowed. In most cases these are implicitly closed by particular open tag. For instance, <p> is closed by a subsequent <p> or by many other block-level elements. However, other elements will end up containing the whole document if unclosed, or will invoke the adoption agency algorithm. For the one specific example in the study, we have inside information that it was unintentional. --Maciej Stachowiak 17:40, 3 April 2010 (UTC)

Almost standards / limited quirks mode doctype - triggers nonstandard behavior which may not be fully interoperable in legacy UAs

The behavior is defined by the HTML5 spec, and is interoperable. At most a warning is in order.

Which doctype triggers which mode is specified by the HTML5 spec. The spec also defines a few of the behaviors of quirks mode. The full behavior of the modes other than standards mode is largely undocumented and not interoperable.--Maciej Stachowiak 17:40, 3 April 2010 (UTC)

In the case of facebook it is autocomplete off; facebook makes extensive use of css and javascript, both of which can change the appearance. A good case can be made for a warning, but non-conforming?

I'm not sure what the CSS and JavaScript have to do with it - <input type=hidden> is never autofilled, even if you somehow make it render, which the Facebook page in question does not do. In general, particular <input> elements only allow the attributes that are applicable to their particular input subtype. Input types should be treated as effectively distinct elements. So <input type=hidden autocomplete=off> makes about as much sense as <div autocomplete=off>. It has no effect, and seems to indicate misunderstanding of the model. --Maciej Stachowiak 17:40, 3 April 2010 (UTC)

And yet, the pages that use this are interoperable. In fact, these tags are a key part of the strategy used by these sites to obtain interoperability despite the existence of less than fully standards compliant browsers that are still in wide use.

In all the cases I am aware of, this mechanism is used to request less standards-compliant and less interoperable behavior (IE7 mode) on a browser that is capable of delivering more standards-compliant and more interoperable behavior (IE8). You are correct that it is typically part of a comprehensive strategy. The goal seems to be to avoid having to send IE8 down the standards code path (which may require extra work, since IE8 still has many divergences from standards/consensus behavior), or creating a separate IE8 code path.--Maciej Stachowiak 17:40, 3 April 2010 (UTC)

Unknown attributes and unknown elements - these have to be flagged to detect typos and to protect future extensibility of the language.

No question on typos, but people will invent attributes. We need to separate out what is reserved to HTML (example: names consisting entirely of alphabetic characters) and what will never be used by HTML. This has already been relaxed partially, and there are proposals for more.

Creating a distinguished syntax for extensions would indeed not be a problem. Most of the cases in the study do not fall into a readily identifiable pattern.--Maciej Stachowiak 17:40, 3 April 2010 (UTC)

Why would an author deliberately add markup that has no effect? This is likely a bug at least some of the time it occurs, and does not have a valid use case, since it has no effect, so on the whole it seems better to flag it.--Maciej Stachowiak 17:40, 3 April 2010 (UTC)

elements after </body> - seems likely to be an error in context, will result in DOM that author does not expect, not interoperable with UAs that have limited error handling

These pages appear to be rendered as the author intended. Calling all such cases an error is likely overreaching.

My baseline assumption is that any time an element gets reparented relative to what appears in the markup, it is an error. That's so even if specific cases render as the author intended. In fact, the reason all of the reparenting rules exist in the first place is to render documents as the author intended. However, such cases are typically not interoperable in legacy UAs, and not interoperable with limited error handling or streaming UAs. </body><script> may be a little less surprising in its effects than <i>foo<b><bar</i>baz</b>, but the basic principle is the same.--Maciej Stachowiak 17:49, 3 April 2010 (UTC)

Unscoped <style> inside <body> - will result in unexpectedly bad performance, since it causes a style recalc of the full page when it may have already been incrementally rendered.

Seems to render acceptably in at least some browsers, at most this should be a warning.

Whitespace in name attribute - likely results in unexpected behavior, since name is treated as a single token with no whitespace.

"likely"? HTML5 needs to be interoperable even in the face of bad markup. And if this is something used in high profile sites, and works today, it needs to be supported.

Agree on "supported", but "supported" is not the same as "allowed".--Maciej Stachowiak 17:40, 3 April 2010 (UTC)

Worth further discussion. If it behaves interoperably and is useful, people will do it. I suggest a warning.

object element used to load plugin via classid/codebase instead of data/type - triggers non-interoperable behavior by requesting a specific piece of code by name, rather than code to handle particular content

Useful as a fallback

Loking closer, in the specific example in the study, there is a nested <embed> that does have a type attribute. Since nested object/embed is treated specially by most browsers, this particular pattern at least should be allowed.--Maciej Stachowiak 17:49, 3 April 2010 (UTC)

Bad <script type> value

the value is "text/javascript"

No, the bad value was "Javascript", not "text/javascript" (on microsoft.com). "Javascript" is not a valid MIME type.--Maciej Stachowiak 17:40, 3 April 2010 (UTC)

language attribute on script element with value other than "JavaScript"

appears on lines where the value is "javascript". Claims this attribute is obsolete.

No, the specific case of this that is an error had a value of "Javascript1.1" (on amazon.com). The spec makes this attribute obsolete, but allows the specific value of "Javascript" as conforming, but with a warning. I could imagine changing the validator error message or else changing the spec to make the allowed case not be a warning.--Maciej Stachowiak 17:49, 3 April 2010 (UTC)

Bogus image map

I can't find this one. An image map appears on amazon.com, ebay.com, and yahoo.co.jp, each of which uses some presentational attributes, in one case the attribute is mispelled.

On cnn.com there is an img element with a usemap attribute pointing to a nonexisting <map>--Maciej Stachowiak 17:49, 3 April 2010 (UTC)

is and will be widely ignored: usefulness of consecutive dashes in comments outweighs theoretical usage; validators won't protect those that wish to serialize with XML as this content will continue to exist; justifies a warning.

All the examples in the study used three dashes instead of two to close the comment, which in context seems like a typo rather than something useful.--Maciej Stachowiak 17:40, 3 April 2010 (UTC)

Unknown element names - likely author error

Need to look at on a case by case basis. The ones I can readily find are cases like "wbr" which are unknown to the standard, but apparently useful.

This wasn't referring to presentational elements with a specific effect - I classed all those as "presentational" even if the validator claimed they were unknown. The specific example of an unknown element is <n> on yahoo.co.jp.--Maciej Stachowiak 17:40, 3 April 2010 (UTC)

<script defer="true"> (value should just be "defer")

Does it work interoperably? If so, a warning may be in order.

The spec makes it an error to use values like "true" or "yes" or "on" for boolean attributes that actually take effect through mere presence, to avoid confusion about what "false", "no" or "off" would do. This is explained in the Conformance Requirements for Authors section. --Maciej Stachowiak 18:42, 3 April 2010 (UTC)

<meta> Content-Type claiming an XML document is text/html

Content is served as text/html

This error is no longer flagged on the site in question (w3.org). I believe that at the time the study was done originally, the page was served as XML to the validator.--Maciej Stachowiak 17:51, 3 April 2010 (UTC)

<acronym> - will save people time not to wonder whether to use this or abbr

Sounds like at most a suggestion.

Conformance Errors with Unclear Value

Other presentational table attributes (besides that three that have a bug already). It seems that these typlically do not harm accessibility, and depending on circumstances may not be less compact than the alternative.

Presentational attributes on the body element. These seem unlikely to create accessibility problems. They can also only possibly occur once per page, and so omitting them does not save compactness.

Note: the autocomplete attribute is allowed on <input> elements by HTML5, but not on <input type=hidden> or <select>. It seems like a bug in the XHTML5 mode of the validator that it doesn't flag these errors. Also it's not clear if those restrictions are desirable.