Duplicate Content Issue. What is the Issue?

This article addresses the problem of having identical or nearly identical versions of a web page available at more than one URL, on one site or many sites. This can occur unintentionally (because of site architecture, for example) or it can be something of which the webmaster is well aware (with syndicated content "scraped" or illegally copied content from other websites).

Canonical URLs

The problem with so-called canonical URLs is not too much of an issue today. Most webmasters are aware of it. Google even has a tool in its Webmaster Central application for webmasters to specify which version is the primary one -- the one with "www." or the one without. The webmaster does not have to make any changes to a site for that. He should, though, because other search engines do not provide a mechanism like Google's.

Some sites are configured to return the website whether or not the "www" is entered, e.g. http://www.domain.com returns the site and so does http://domain.com. Those are two distinct URLs and search engines treat them as different sites, because technically you could have completely different sites at each address, on different servers even. When the search engine notes that all pages are identical at both addresses, how to handle this is left to the search engine. However, you should not leave that choice up to the search engines. Make the choice for them! You should redirect one of the versions to the other. (This will also prevent PageRank leakage caused by people linking to your site via both types of URLs.) By making this choice proactively, the version that becomes "master" is up to you, and is not an SEO problem.

HTML Attribute Value "Canonical"

To solve the canonical issue with the www versus the non www version of a website, Google provided a tool via their Webmaster Central site, that lets webmaster specify, which version is the primary one, for the domains the webmaster registered with the Google Webmaster Central Service.

In February 2009, the three major search engines Google, Yahoo and Microsoft introduced a second HTML attribute value (next to attribute value NOFOLLOW that was introduced by Google in 2005) that is specifically for the use by webmasters to help with their search engine optimization efforts. It is called "canonical" and is like "nofollow" a HTML attribute value for the attribute "rel".

Unlike "nofollow", which is used for the "rel" attribute within the "a" (hyperlink) element or tag inside the body of a HTML document, "canonical" is used for the "rel" attribute for the "link" element or tag and inside the header section of the HTML document.

The "canonical" attribute value also serves a differnt purpose than "nofollow", which was designed to indicate to search engines that the link where the attribute value is used should not pass any link value (or "PageRank"), which is in one way or another the most important ranking factor in any of the major search engines today.

The new attribute value "canonical" provides webmasters the ability to indicate to search engines what the main or primary URL is for the current document and by extension, that any other URL variations to the same (duplicate) document should not be used.

Google engineer Matt Cutts elaborated in an interview with WebProNews during the SMX West conference where the new "tag" was introduced, how Google uses the information provided by webmasters who are using "canonical" in their web pages.

The tag can only be used to refer to the primary version of the URL to the document on the same domain name. That means that it cannot be used to refer to a version of the same document on another domain. Furthermore, if correctly implemented and indexed properly by the Google Bot, URL versions on the same domain to the same document will be treated like if those URLs were 301 redirected by the webmaster to the primary URL version.

This means that webmasters who have duplicate URLs to the same content on their website and used by other web sites to point to that content, should see an increase in PageRank and improved ranking in Google search results, with the implementation the new "canonical tag", without implementing any server side 301 redirects of their duplicate URLs.

Scraper Sites

A bigger concern is pages with the same content but more than one URL because of the site's architecture. With the increased popularity of syndication and aggregation and "mash-ups" it has become much easier for "black hat SEO" types to create scraper sites.

"Scraper sites" are sites that are thrown together as quickly and with as much automation as possible. They are designed to rank well and get users to click on contextual ads, such as Google AdSense links, and to generate revenue. The chances for generating revenue are high, because the ads are the only text that makes sense (compared to the gibberish produced by the scraper). Another goal of a scraper site could be to indirectly boost the ranking of a less visible site. A site that is included in the search index and starts to rank for terms can pass on "link juice" like any other website on the Internet. That means that if you link from your scraper site to another website of yours, it will boost the other site's ranking. This is very risky business, because search engines will not only get rid of the scraper site if they detect it, but also penalize your real site, if it is clear what you did.

Landing on a scraper site is, in most cases, a bad experience for the user. Search engines will try the get rid of scraper sites as efficiently as possible. The preponderance of scraper sites contributes to an overall sense of mistrust (by search engines) of webmasters who may very well be promoting legitimate website content.