Canonical chaos: doubling down on duplicate content

Search engines are getting smarter. There is little doubt about that. However, in a CMS-driven web where content can often exist on several URLs, it is not always clear what is the authoritative URL for a given piece of content. Also, having content on several URLs can lead to problems with link and ranking signals being split across several variations of a piece of content.

It is hard enough standing out in the often hypercompetitive search landscape, so you would imagine that most businesses had these foundational SEO issues under control. Unfortunately, our experience would tell us otherwise. In fact, it seems that in the wake of many sites moving towards HTTPS for the promised ranking boost, we are seeing even more issues of URL-based duplicate content than before.

Fortunately, we have the canonical tag. With rel=canonical, we can easily specify the authoritative URL for any piece of content. Google and the other engines will then consolidate link and rank signals for all variations of this content onto a single URL. This is, of course, if rel=canonical is correctly implemented.

In this article, I take a look at how incorrect implementation of canonical URLs can exacerbate URL-based duplicate content. I also share an example of a UK-based e-commerce store that recently saw their home page de-indexed (just the home page) due to what seemingly ended up being an issue with the canonical URLs.

Dodgy duplicates

It is not unusual for a piece of content to exist on multiple URLs. This could be on one site or many. It could be due to subdomains. It could be due to your CMS creating multiple entry points for a single piece of content. It could also be due to running your site over HTTPS in line with Google’s latest recommendations.

There are a bunch of potential situations that can lead to a piece of content being available on multiple URLs, but the most common tend to be:

Dynamic URLs — e.g., http://example.com/?post=1&var=2&var=3

Mobile sites — e.g., m.example.com and www.example.com

International sites without correct geo-targeting

www and subdomain issues — e.g., www.example.com or example.com

CMS generating multiple URLs

Content syndication on other blogs

Running your site on both HTTP and HTTPS

We also tend to see a mixture of these issues, and it is not unusual to find sites that run HTTP and HTTPS and have content available on the www and non-www version of the site. This can quickly create a situation where the same piece of content (or the home page) can be available on several different URLs.

As an example, just the very common running of the site with and without www, and over HTTP and HTTPS, can create four potential URLs for every piece of content on the site:

http://example.com/page

http://www.example.com/page

https://example.com/page

https://www.example.com/page

Canonical chaos

Now, in an ideal world, your canonical URL would sort this out, and each of the four URLs would have the same canonical URL specified. It could be any of the above, but if you have HTTPS, you may as well run with HTTPS, so let’s say your canonical URL is https://www.example.com. You’d put this piece of code into the HTML head of all the other versions:

<link rel="canonical" href="https://www.example.com" />

I have seen debate about whether the actual canonical page should canonicalize to itself — in practice we do, and I have seen this sentiment echoed by other SEOs over the years (and have never run into any issues doing so).

Unfortunately, what we are seeing quite a bit recently is that the canonical tag is present, yet each page has a canonical that matches the URL shown in the browser window.

http://example.com/page canonical = http://example.com/page

http://www.example.com/page canonical = http://www.example.com/page

https://example.com/page canonical = https://example.com/page

https://www.example.com/page canonical = https://www.example.com/page

Clearly, this is not ideal. The canonical tag is designed to resolve these very issues, but in this instance, it further exacerbates the situation. Each URL here is saying, “Me, me, index me!!!” The search engine then has to do what it can with this mess.

Issues like this impact trust and confidence. Trust and confidence impact rankings. Poor rankings impact your business. That may all sound like something the SEO Yoda may say, but the reality is that a goofed canonical tag will only impact your results in a negative way.

We recently worked with a UK business that saw their home page mysteriously de-indexed, which hit them hard for the big keywords they target. They typically sit among amazon.co.uk and other huge brands in the top three, so there is no room for these issues. After checking all the usual suspects, we identified issues with the canonical tag implementation — this was fixed, the site was crawled, and the home page popped back in again. I was somewhat staggered, but it drives home the importance of solid technical SEO.

Fortunately, this happened and we resolved it just before the big Christmas rush — but had this issue cropped up now, the financial impact could have been far worse.

HTTP and HTTPS

The move to HTTPS is generally a good thing. Security matters. And the web is faster than it once was. However, we have seen all manner of problems here, usually due to the site being indexed on both HTTP and HTTPS URL variations.

Unfortunately, we also tend to see the canonical tags use both HTTP and HTTPS, which again further exacerbates the underlying issue that the canonical tag should resolve.

Why does this happen?

I believe there are a couple of reasons we see these issues:

The site is running on HTTP and HTTPS, and the CMS has no way to force the protocol and/or subdomain for canonical URLs.

Developers take a checklist approach to SEO, implementing the canonical tag without really understanding what it is for and populating it with the address bar URL.

Correcting your canonicals

In most cases, duplicate content issues can be resolved pretty easily. Fixing the canonical is one way, but this can be tricky with some web CMS software, so we can utilize permanent HTTP redirects (301). This is typically the fastest and most logical approach in that the page variation is never crawled and Google does not have to analyze multiple pages — they just follow the redirection.

Correct canonicals. Where a canonical is required, you need to implement a page-level canonical from one variation to the other. As above, determine your primary subdomain and protocol, and ensure all duplicates have a canonical pointing to the primary page.

That is pretty much it — always redirect if you can, as it deals with duplicate content issues in the quickest and most efficient way (from a workload and ranking perspective).

Then, where this is not possible or desirable, implement page-level canonical tags. This may need some developer support.

Certainly, for WordPress there is a simple fix using the wpseo_canonical filter from the WordPress SEO plugin. This allows you to force HTTP or HTTPS or the subdomain with some fairly basic PHP. Your developer can often do the same to help you with other CMS and bespoke builds. This is not terribly complicated — it just requires a clear understanding of why the canonical exists.

One URL to rule them all

It’s not unusual for a piece of content to appear on multiple URLs. There is no duplicate content penalty as such. However, for a search engine to be 100 percent confident in the correct URL to return and to ensure all equity is consolidated in one primary version of a page, we need accurate redirections and canonical URLs.

Simply adding an SEO plugin or having your developer hack in a canonical URL is not enough — it must be implemented in a way that ensures that each piece of content has one authoritative URL.

One URL to rule them all. One URL to find them. One URL to bring them all and in the search results bind them.

Opinions expressed in this article are those of the guest author and not necessarily Search Engine Land. Staff authors are listed here.

About The Author

Marcus Miller is an experienced SEO and PPC consultant based in Birmingham, UK. Marcus focuses on strategy, audits, local SEO, technical SEO, PPC and just generally helping businesses dominate search and social. Marcus is managing director of the UK SEO and digital marketing company Bowler Hat and also runs wArmour aka WordPress Armour which focuses on helping WordPress owners get their security, SEO and site maintenance dialled in without breaking the bank.