You are here

Canonical URL is invalid on multilingual site

1. You may run a site with two languages e.g. German/English with default German.
2. Now create a language neutral node.
3. German node, Canonical URL is de/content/foo.
4. Switch to English, Canonical URL is en/content/foo. BUG - this is duplicate content - Canonical URL must be [default site language]/content/foo e.g. de/content/foo.

According to Google, we shouldn't be using canonical on multilingual sites. Instead, the attribute should be rel="alternate" with the hreflang set to the language. So, if you have 5 languages, you'd have 5 links, one for each language. I've done that on my site by using the following code:

Additionally, any translated strings on the page would be in English, not German, on en/content/foo.

Since the page is language-neutral, one of these language markings would be wrong, presumably the one for en/content/foo for a default German site. But there is no guarantee that a search engine would index the correct language marking.

That is, if Google (or whoever, but at the present time Google gives us the majority of visits) last visited en/content/foo, then de/content/foo would be marked as being in English, so that de/content/foo would not show up as a result in searches restricted to German (wrong) and would show up in searches restricted to English (also wrong).

The solution I implemented [yesterday] was to leave the canonical URL the way it is and for non-default languages to be marked with a meta tag for robots noindex on language-neutral pages.

That is, in the context of the current issue:
de/content/foo has canonical de/content/foo
en/content/foo has canonical en/content/foo AND has robots noindex

(Eventually, I'll change that to noindex, nofollow, but I have to wait until at least Google no longer has any of the pages that were indexed under the wrong languages.)

Additionally, if the user is authenticated (staff), we do the equivalent of redirecting en/content/foo to de/content/foo. That of course would have to be by role if the site allowed the public to have user accounts.

While this is outside the scope of the current issue - but related - #1518224 would not be appropriate for us.

If #1518224 is implemented, it needs to be a setting, not a given. That is, I see it as potentially problematic if:

de/content/foo has canonical de/content/foo
en/content/foo has canonical de/content/foo AND has robots noindex

in that Google (and other search engines) might inadvertently remove de/content/foo from the index due to the robots noindex tag.

I think we should promote usage of the Alternative hreflang module instead. FYI I've submitted a patch to fix its language selection to use the LANGUAGE_TYPE_CONTENT instead of LANGUAGE_TYPE_INTERFACE.