Research and Teaching Updates from the Web Science and Digital Libraries Research Group at Old Dominion University.

Monday, August 7, 2017

2017-08-07: rel="canonical" does not mean what you think it means

The rel="identifier" draft has been submitted to the IETF. Some of the feedback we've received via Twitter and email are variations of 'why don't you use rel="canonical" to link to the DOI?' We discussed this in our original blog post about rel="identifier", but in fairness that post discussed a great deal of things and through updates and comments it has become quite lengthy.

The short answer is that rel="canonical" handles cases where there are two or more URIs for a single resource (AKA "URI aliases"), whereas rel="identifier" specifies relationships between multiple resources.

The first URI is what I got when I searched amazon.com for "dj shadow" and clicked on a search result. The second URI is the "canonical" version that should be indexed by Google et al. The first URI uses an HTML <link> element to inform search engines about the second URI so they know they haven't found two different resources with two different URIs:

We can see that the HTML is not exactly the same (which would be trivial for the search engines to dedup), but can see the rendered HTML is essentially the same, with the exception of the navigation trail ("‹ Back to search results for "dj shadow"") vs. the categorization ("CDs & Vinyl › Dance & Electronic › Electronica") on the left-hand side, right above the EP artwork:

It is clear there is no need for a search engine to index both pages. The raw HTML is nearly (but not exactly!) the same and unless it is aware of
amazon.com URI patterns, your crawler would not easily discover that they refer to the same
resource. We can construct a similar example with ebay.com: again the raw HTML differs slightly but in this case I cannot tell a difference in the rendered HTML:

So why can't we use rel="canonical" for, say, DOIs and publisher pages? In the case of DOIs, a technical reason is that the resource identified by the DOI and the resource identified by the publisher's page are not the same resource. Admittedly this is a detour into the esoteric realm of HTTP 303 semantics, but the HTTP URI of a DOI does not have a representation and the publisher's URI does; the resources identified by these URIs are related but fundamentally different.

Another reason would be when you wish to specify part-whole relationships between resources that comprise the resource identified by a DOI. For example, XML vs. HTML, Zip file(s) of associated code and data, embedded (and "recontextualizable"!) images, sound, or video, etc. This would be for the purpose of expressing identity, and would not preclude combination with navigation (e.g., rel="up") or SEO links (e.g., rel="canonical"). These identification patterns are presented in more detail at the Signposting web site.

Another argument against using rel="canonical" for linking to DOIs (and friends) is that publishers are already using canonical to manage SEO within their own domain. In the example below, springer.com signals to search engines that the URI in the third redirect from the DOI is canonical and not the previous two:

Furthermore, publishers are specifying DOIs with a variety of incompatible ad hoc approaches (see the prior blog post for examples), meaning there is demand for this function even though there is currently not a standardized method of achieving it.

But there are other applications for rel="identifier" outside of scholarly content. Consider the Wikipedia page for DJ Shadow. As I type this, it has not yet been edited to include the upcoming EP mentioned above, but there's a good chance that by the time you read this that will have changed.

I can reference the particular version of the page using the "permalink", which yields the URI https://en.wikipedia.org/w/index.php?title=DJ_Shadow&oldid=787867397. That page will remain static, and never mention "The Mountain Has Fallen". That page does use rel="canonical" to link back to the generic, current version of the page:$ curl --silent -i "https://en.wikipedia.org/w/index.php?title=DJ_Shadow&oldid=787867397" | grep "rel=.canonical"<link rel="canonical" href="https://en.wikipedia.org/wiki/DJ_Shadow"/>

Which is entirely expected and desirable: we don't want Google to separately index the 1000s of prior versions of this page, just the latest version. The generic version of the page also asserts that it is canonical:$ curl --silent -i "https://en.wikipedia.org/wiki/DJ_Shadow" | grep "rel=.canonical"<link rel="canonical" href="https://en.wikipedia.org/wiki/DJ_Shadow"/>

But if I were using a reference manager to cite https://en.wikipedia.org/wiki/DJ_Shadow, and if that page also had:

Then the reference manager would cite the specific version of the page, providing a machine-readable version of the human-readable guidance already provided under the "Cite This Page" link. This use of rel="identifier" would not collide with the rel="canonical" which is already in place for SEO*. In this Wikipedia example, the two rels coexist and specify URI preferences for different purposes:

rel="canonical": preferred for content indexing

rel="identifier": preferred for referencing

Herbert insisted on a New Mexico specific example, so we'll consider the ubiquitous multi-page articles, designed to expand content to increase advertising revenue. Of interest to us is page 5 of this particular article about TV continuity errors: http://www.coolimba.com/view/huge-tv-mistakes-no-one-noticed-c/?page=5. It uses rel="canonical" to inform search engines to strip off any common, superfluous arguments that might be also be present (e.g., "&utm_source=...&utm_medium=...&utm_campaign=..."):

In this case, rel="up" also serves as a simple navigation function, if you chose to view these pages as a tree and not a list (if this is indeed a list, then "up" is probably not applicable). But note that rel="up" would not be applicable in the Wikipedia (or even DOI) example(s) above. Also note that rel="up" and rel="identifier" sharing the same URI is something of a coincidence: if a multi-page article has more than two "levels" then we would expect the URIs to diverge.

In conclusion, SEO/indexing and referencing are different functions and thus require different rel types; cases where the target URIs overlap should be considered coincidences. rel="canonical" is used to collapse multiple URIs that yield duplicative text into a single, preferred URI to facilitate indexing, and rel="identifier" is used to select a single URI from among multiple URIs that yield different text to facilitate referencing.

* Note that rel="permalink" and rel="bookmark" (the former was never registered and ultimately
supplanted by the latter) do different things and are not usable in HTTP Link headers; see
the prior blog post for details.

2017-08-09 edit: See also this Twitter moment about rel="bookmark". I'll try to turn this into a separate blog post in the future.