Web Authoring Statistics: Metadata

The <meta> element

The first thing to notice is the huge number of markup errors
involving the meta element. Markup such
as:

<meta name=description value=the best site for hot air balloons>

...which results in a meta element with eight
attributes, and which doesn't help anyone (least of all the search
engines it's aimed at, since the second attribute should have been
content, not value, and therefore the
entire element is likely to be ignored). These markup errors
explain the value, "", the
and and "attributes" on the graph above, as well as
most of the thousands of other attributes that aren't shown
here.

Before those we have the mysterious charset
attribute. This comes from another, just as common, markup
error:

If there are no quotes around the content
attribute's value, this looks like an element with three
attributes, the third of which is called charset.

Continuing up the chart we see scheme and
lang. Further research will be needed to find out how
scheme is used.

Finally we have the three most commonly used attributes of the
meta element, present on most pages:
http-equiv and name (the two possible key
attributes for the metadata) and content (the value
attribute).

Content-Type is naturally the most-used value,
since it's the standard way for giving the character encoding of an
HTML page.

Next we have two name values:
keywords, which these days is mostly useless,
ironically, and description, which is still somewhat
useful.

With progressively less usage are four more name
values: robots, to control whether spiders should
index the page or follow any of its links; generator,
used to indicate what tool was used to generate the page;
author, used to give the name of the author; and
revisit-after, supposedly used to tell search engines
how often to recrawl the page. To our knowledge only one search
engine has ever supported it, and that search engine was never
widely used — at this point, it is nothing more than a good
luck charm. A remarkably widely used one. More pages use
the completely worthless <meta
name="revisit-after"> than use the
<em> element!

Next is the Content-Language value (used on the
http-equiv attribute). Almost as many people use this
as specify the lang attribute on the html
element. In the HTML5 spec currently the http-equiv
attribute is only allowed for the one case of setting the character
encoding, which can't really be dropped, as the graph above
demonstrates. However, http-equiv="Content-Language"
is supported by at least one browser, and as we see here, it is
widely used — maybe http-equiv should not be
removed after all.

Next we have the last sane name value worth talking
about, namely copyright. This, and the fact that
copyright is a really popular
class name, suggests that either <meta
name="copyright"> should be an official way of giving the
copyright, or that the Web needs a <copyright>
element, or both.

progid seems to be a sign of pages made by
Microsoft editors (yes, that's a lot of
Microsoft-generated pages... or a lot of copy-and-paste from pages
made by Microsoft tools).

The http-equiv values pragma and
expires are attempts at bypassing caches without
having to set the HTTP headers correctly. These are probably
unnecessary uses; any scenario where there is a legitimate reason
to limit caching, the author is going to have enough control over
the server to send the appropriate headers. In addition, the
meta tags can't be considered reliable (e.g. proxies
and transparent caches aren't going to honour them).

The distribution value is supposedly used to
control who can access the document. Search engine "optimisers"
tell people to set it to "global" to ensure that
search engines index their pages.

The http-equiv="Content-Style-Type" value is an
HTTP header that
HTML4 defines that supposedly controls the language that the
page uses in the style attribute. In theory, it has no
default value, and so any page that uses the style
attribute must specify it. Since there is only one language that
can be used in the style attribute, it can only ever
usefully be set to one value, text/css. And since all
browsers assume that it is set to text/css by default,
there is really no reason to give it. We were very surprised to see
this many people set it. The Content-Script-Type
value is the same but for event handler attributes like
onclick.

The presence of the imagetoolbar value as being one
of the top ten most frequently used http-equiv
attributes is probably a sign to Microsoft that people don't like
their popup image toolbar. One of the common name
values is mssmarttagspreventparsing, too. Sorry
guys!

The Dublin Core people can take some comfort from the fact that
although their keywords didn't appear in the top ten chart above,
they were quite well featured in the next few dozen. Here are the
ten most used dc.foo values, most popular
first: dc.title, dc.language,
dc.creator, dc.subject,
dc.publisher, dc.description,
dc.identifier, dc.date,
dc.format, dc.rights. In fact the order
maps relatively closely to the frequency of similar metadata in
other constructs, like class names or rel values. Nice
to know people are consistent!

Finally it is worth noting the confusion of having two "key"
attributes on the meta element. the most common
http-equiv value, ignoring the magic
content-type incantation, is
content-language. The 19th most common
name value is content-language. In fact
this name value is specified once for every five
occurrences of the http-equiv value! One in five pages
specifying its language using meta is confused as to
which key attribute is appropriate! Similarly, the most common
name value is the same as the 10th most common
http-equiv value, keywords. The
http-equiv form is specified once for every fifty or
so occurrence of the name form.