Duplicate Content – The Problem and its Solutions

Lots of sites have duplicate content problems. For the most part, this is not a huge issue. When search engines find duplicate content they choose one of the pages to list in the index, and then will ignore the other. This assumes, of course, that the nature of the duplicate content is not so bad that it would lead to the search engine wanting to ban you. This can happen if a review of your situation causes them to believe that you are deliberately trying to rank multiple times for the same search terms.

Doorway pages are a classic example of this. An example of a doorway page is having a different domain that has some content on it, but which in fairly short order sends the user over to your “master” site. Another example would be if you had two different fully functional sites where the content is not identical, but it was substantially similar, and the search engine is able to figure out that you own both.

Assuming that you haven’t implemented some illicit scheme to dominate all the results for your search terms, what are the reasons why you may have duplicate content? Here are some of the most common reasons sites have duplicate content:

Affiliate Programs: If you have an affiliate program, chances are your affiliates use a URL that looks something like: http://www.yourdomain.com?affid=123456. Search Engines will see this as a different page than http://www.yourdomain.com, and they will see it as having duplicate content.

Syndication Deals: If you syndicate your content, or distribute it through article directories, you are creating duplicate content on other domains. Keeping in mind that the search engines want to select one copy of the page to list, be aware that they might not choose your copy.

Parameters Being Ignored by your Page Rendering Code: Perhaps you don’t have an affiliate program, but if someone links to a page on your site and sticks on an arbitrary parameter (http://www.yourdomain.com?gobbledygook=123456), your web site code may still render the same page as if there was no parameter. This would still be duplicate content to a search engine if they find the link.

Site Architecture with Multiple Paths to the Same Page: (http://www.yourdomain.com/prod=1&type=7 resulting in the same page content as http://www.yourdomain.com?type=7). Many sites have this type of problem. It’s considered duplicate content.

Subdomaining: On some sites http://subdomain.yourdomain.com renders the same content as http://www.yourdomain.com. This is also duplicate content if someone links to these pages.

Cross publishing of articles on your site(s): Some sites have perfectly innocent reasons for wanting to show the same article in more than once place.

Machine Built Web Sites: You can end up with duplicate content if you machine render your whole site with lots of pages that differ in insubstantial ways. For example, you could have an online bookstore which has the same text on every page, except you perform string substitutions of the book name. This is also duplicate content.

http://www.yourdomain.com is the same as http://yourdomain.com: Nearly everyone has this problem unless they explicitly address it. This is an easy one – Just 301 redirect from the “non-www” version of your site to the “www” version of your site (or vice versa).

So what’s the real downside of duplicate content? First, if there are significant portions of your site which have duplicate content (such as a machine built site), you may not be indexed at all. Google, in particular, does a good job of finding sites that have little added value and screening them out. If this is not your situation, then the most obvious downside is that the search engine will choose which page to put in the index, and ignore the other copies. However, there are 4 types of consequences that are less obvious:

The search engine might not choose the “right” copy of the page to index. This can be particularly painful in a syndication situation where your partner shows up in the search engine and you don’t. This does happen!

If you have an affiliate link that comes from a page that is seen as more authoritative than you, your site’s listing could show up in the search engine results with that affiliate link on it. For example, if you searched on the home page of your site, you would see something like: http://www.yourdomain.com?affid=123456 instead of http://www.yourdomain.com. This would result in your paying commissions to your affiliate for traffic that should be organically yours at no cost.

Crawl Budget: If the search engine comes to your site planning to crawl 1000 pages, and spends time crawling 500 pages that are duplicate content that it won’t index, you are not getting other pages with different content crawled, and will probably have fewer pages indexed than you could. That’s a shame.

Link Power Dilution: Some of the link value of your site will be spent on pages that don’t get indexed. If you eliminate the duplicate content, this link power will be distributed only among pages that will get indexed, resulting in a potential improvement in the ranking of those pages.

So what to do? All of the above problems have solutions, but the solution depends greatly on the exact nature of your problem. For example, if you are having problems due to affiliate links, the simplest solution is to require your affiliates to place a rel=”nofollow” on the affiliate links to your site. This is a bit of a shame, because you lose the link credit in the process. However, it will resolve any problem you have with affiliates who accidentally hijack your search results. One thing you can do is to reserve the right to make any affiliate who signs up with you add the rel=”nofollow” to their link to you upon your request.

For problems on your own domain, you need to make the time investment to close out the various potential duplicate content problems that you have. Then take the variations of the pages that you have eliminated and 301 redirectthem to the surviving version of the page. This is a critical step, as the redirect will tell the search engine that you have made the move. Do NOT use a 302 redirect, or a meta-refresh (except if you can’t access your server files to setup redirects, you may need to rely on a meta-refresh set to “0”)

As outlined above, there are lots of reasons to clean up your duplicate content problems. For many sites, this can result in more pages indexed by the search engine, or higher rankings for their pages. These are two very good reasons indeed.

Comments

I am webmaster for over one year and i publish +1500 words articles 3 times per week on my blog, and the problem is with visitors that copy/paste my article on their blog exactly and without any change , so what is my fault here to get penalized by google for duplicate content !!!

Karait, first there is no such thing as a Google duplicate content “penalty.” In other words, Google takes no actions against sites for duplicate content, except for “doorway pages” which are deliberate attempts to trick Google into ranking you more than once for the same keyword. However, that is an internal site problem, and not what you are talking about.

Your issue is scraped content, content being taken from your site and duplicated elsewhere. Unfortunately there is not much you can do about it, other than trying to contact the site owner and asking them to take it down, or at least link to your page as the original.

Here the only issue is which page will Google choose to put higher in its rankings. When Google sees multiple pages across the web with the same content, it doesn’t want to rank them all, so it chooses one. That chosen page will be chosen either because Google is able to detect that it is the likely original, or because of normal ranking factors.

In the former case, if the copies of your content contain a link back to your page (such as “originally published at….”) then that makes it easier for Google to see yours is the original that should rank. However, if they don’t, then it will be a matter of which page has higher Google ranking factors, such as the domain authority of the site, links to the site and page, and of course, many other factors.