Duplicate Content Demystified

Earlier in the week, Google announced that it will begin alerting webmasters about duplicate content issues. That’s very exciting news, but it presumes that everyone knows what duplicate content is. Since I’m sure not everyone is as big of an SEO nerd as me, I thought I’d take a few moments to explain what duplicate content is and why it’s important to your website.

What Is Duplicate Content?

According to Google, “duplicate content generally refers to substantive blocks of content within or across domains that either completely match other content or are appreciably similar.” Um, ok. But what does that mean? Basically, it just means you should use unique content on each of the pages on your site. And when I say, “unique,” I don’t mean changing a word here or there. I mean truly unique chunks of text. As a quick rule of thumb: if you think two pages are a little too similar, the search engines probably will too.

Why Do Search Engines Care About Duplicate Content?

A search engine’s purpose in life is to find the most relevant content that matches a searcher’s query. With that in mind, duplicate content is considered noise by search engines because searchers don’t want to see multiple versions of the same thing in their search results. Consequently, when search engines discover similar pages, they designate one of the pages as the original, and they ignore the duplicates (in some cases, they even remove the duplicates from the index). Then, when a searcher submits a relevant query, the original page is the only one that appears in the search results.

Why Should Webmasters Care About Duplicate Content?

When search engines identify similar pages, they use their own algorithms to determine which page is the original (you can help them make this decision, but I’ll discuss that in my next post). At first glance, this might not seem like a big problem, but let’s assume you have two pages that are considered by the search engines as duplicates. The first page is highly optimized for conversions and conveys a very effective sales pitch. It has flashy graphics and amazing testimonials. Just thinking about it brings a smile to your face. The second page doesn’t have any of these things. It’s a bare-bones summary of the first page, but the content is almost identical. What will happen if the search engines select the second page as the original? You guessed it! Wave goodbye to those conversions.

Another negative aspect of duplicate content is what I like to call the “dilution effect.” When you write a really popular piece of content, you expect it to receive a lot of links, retweets, likes, +1s, etc. However, if that content is duplicated elsewhere, all of those wonderful endorsements from others are going to be spread out across the duplicate versions of your content. Instead of having one incredibly popular piece of content, you’ll have many quasi-popular pieces of duplicate content. Sadly, your endorsements will be diluted.

What Are The Most Common Causes Of Duplicate Content?

Technically, the most common cause of duplicate content is spammers scraping content from legitimate sites. But since we’re not spammers, let’s focus our attention on much less obvious ways that your site might be generating duplicate content.

If you find a spammer scraping your site’s content, submit a DMCA request to claim ownership of your content.

WWW vs. Non-WWW – This is the most common (non-spammy) cause of duplicate content on the Web, and it’s also one of the biggest sources of confusion for webmasters. To help explain, let’s look at the URL for my favorite website (http://www.webgnomes.org). As you know, this is our site’s address. But what you might not realize is http://webgnomes.org is NOT the same address as http://www.webgnomes.org (yes, that WWW actually does make a difference). We’ve made sure the Non-WWW address maps to the WWW address (and I’ll discuss how to do that in my next post), but it’s important to realize that http://www.webgnomes.org and http://webgnomes.org are two distinct addresses because the search engines treat each address separately. Consequently, unless you’re mapping one of the addresses to the other, the search engines will consider one of the pages as a duplicate of the other.

Multiple Domain Names – This is similar to the WWW vs. Non-WWW issue above, but it’s slightly less confusing. Many people own numerous domains, and they use them all to serve up the same content. These domains might have unique TLDs (e.g., http://www.domain.com, http://www.domain.net, http://www.domain.org, etc.), or they might have unique root domains (e.g., http://www.one.com, http://www.two.com, http://www.three.com, etc.). Regardless, if each domain returns the same content, all but one of them is considered a duplicate.

Syndication – Many sites syndicate their content to other sites (e.g., blogs, RSS aggregators, etc.) because it’s a great way to get additional exposure and build a bigger audience. However, you have to be careful with syndication because if your content is not handled correctly on other sites, you can have a duplicate content situation on your hands. In the worst case, the search engines might even select the syndicated version of your content as the original!

CMS Issues – As more and more people rely on Content Management Systems to drive their websites, duplicate content issues arise from the various entry points that are introduced by these CMSs. For example, many CMSs will create archives based on various pieces of information (e.g., dates, categories, authors, etc.). These archives are simply reproductions of existing content on the site, and as a result, they are regularly responsible for duplicate content problems.

Mobile-friendly Pages – Every year, the amount of mobile Web traffic increases, prompting sites to create mobile-friendly versions of their content. Unfortunately, mobile-friendly pages typically only remove extraneous formatting, and the content remains largely similar (or even identical) to the original version of the pages.

Printer-friendly Pages – Similar to mobile-friendly pages, printer-friendly pages are almost identical to their originals; they simply lack extraneous formatting. Consequently, they are also a frequent contributor to duplicate content problems.

Now that you know what duplicate content is, why it’s important, and how it can happen, be on the look out for my next post about how to avoid it!

About The Author

Steve Webb is an SEO audit specialist at Web Gnomes. He received his Ph.D. from Georgia Tech, where he published dozens of articles on Internet-related topics. Professionally, Steve has worked for Google and various other Internet startups, and he's passionate about sharing his knowledge and experiences with others. You can find him on Twitter, Google+, and LinkedIn.

[…] If you have a website, it’s important to note the canonical version of the main company URL and use that in the master business listing. I.e. WWW vs. non-WWW. Learn more about canonical URLs and other duplicate content issues. […]