Fat Pandas and Thin&nbspContent

If you’ve been hit by the Panda update or are just worried about its implications, you’ve probably read a lot about “thin” content. We spend our whole lives trying to get thin, and now Google suddenly hates us for it. Is the Panda update an attempt to make us all look like Pandas? Does Google like a little junk in the trunk?

It’s confusing and it's frustrating, especially if you have real money on the line. It doesn’t help that “thin” content has come to mean a lot of things, and not every definition has the same solution. To try to unravel this mess, I'm going to present 7 specific definitions of “thin” content and what you can do to fatten them up.

Quality: A Machine’s View

To make matters worse, “thin” tends to get equated with “quality” – if you’ve got thin content, just increase your quality. It sounds good, on the surface, but ultimately Google’s view of quality is defined by algorithms. They can’t measure the persuasiveness of your copy or the manufacturing standards behind your products. So, I’m going to focus on what Google can measure, specifically, and how they might define “thin” content from a machine’s perspective.

1. True Duplicates (Internal)

True, internal duplicates are simply copies of your own pages that make it into the search index, almost always a results of multiple URLs that lead to the same content. In Google’s eyes, every URL is a unique entity, and every copy makes your content thinner:

A few duplicates here and there won’t hurt you, and Google is able to filter them out, but when you reach the scale of an e-commerce site and have 100s or 1000s of duplicates, Google’s “let us handle it” mantra fails miserably, in my experience. Although duplicates alone aren’t what the Panda update was meant to address, these duplicates can exacerbate every other thin content issue.

The Solution

Get rid of them, plain and simple. True duplicates should be canonicalized, usually with a 301-redirect or the canonical tag. Paths to duplicate URLs may need to be cut, too. Telling Google that one URL is canonical only to link to 5 versions on your own site will only prolong your problems.

2. True Duplicates (Cross-site)

Google is becoming increasingly aggressive about cross-site duplicates, which may differ by their wrapper but are otherwise the exact same pieces of content across more than one domain:

Too many people assume that this is all an issue of legitimacy or legality – scrapers are bad, but syndication and authorized duplication are fine. Unfortunately, the algorithm doesn’t really care. The same content across multiple sites is SERP noise, and Google will try to filter it out.

The Solution

Here’s where things start to get tougher. If you own all of the properties or control the syndication, then a cross-domain canonical tag is a good bet. Choose which version is the source, or Google may choose for you. If you’re being scraped and the scrapers are outranking you, you may have to build your authority or file a DMCA takedown. If you’re a scraper and Panda knocked you off the SERPs, then go Panda.

3. Near Duplicates (Internal)

Within your own site, “near” duplicates are just that – pages which vary by only a small amount of content, such as a couple of lines of text:

A common example is when you take a page of content and spin it off across 100s of cities or topics, changing up the header and a few strategic keywords. In the old days, the worst that could happen is that these pages would be ignored. Post-Panda, you risk much more severe consequences, especially if those pages make up a large percentage of your overall content.

Another common scenario is deep product pages that only vary by a small piece of information, such as the color of the product or the size. Take a T-shirt site, for example – any given style could come in dozens of combinations of gender, color, and size. These pages are completely legitimate, from a user perspective, but once they multiple into the 1000s, they may look like low-value content to Google.

The Solution

Unfortunately, this is a case where you might have to bite the bullet and block these pages (such as with META NOINDEX). For the second scenario, I think that can be a decent bet. You might be better off focusing your ranking power on one product page for the T-shirt instead of every single variation. In the geo-keyword example, it’s a bit tougher, since you built those pages specifically to rank. If you’re facing large-scale filtering or devaluation, though, blocking those pages is better than the alternative. You may want to focus on just the most valuable pages and prune those near duplicates down to a few dozen instead of a few thousand. Alternatively, you’ve got to find a way to add content value, beyond just a few swapped-out keywords.

4. Near Duplicates (Cross-site)

You can also have near duplicates across sites. A common example is a partnered reseller who taps into their customers’ databases to pull product descriptions. Add multiple partners, plus the original manufacturer’s site, and you end up with something like this:

While the sites differ in their wrappers and some of their secondary content, they all share the same core product description (in red). Unfortunately, it’s also probably the most important part of the page, and the manufacturer will naturally have a ranking advantage.

The Solution

There’s only one viable long-term solution here – if you want to rank, you’ve got to build out unique content to support the borrowed content. It doesn’t always take a lot, and there are creative ways to generate content cost-effectively (like user-generated content). Consider the product page below:

The red text is the same, but here I’ve supplemented it with 2 unique bits of copy: (1) a brief editorial description, and (2) user reviews. Even a unique 1-2 sentence lead-off editorial that’s unique to your site can make a difference, and UGC is free (although it does take time to build).

Of course, the typical argument is “I don’t have the time or money to create that much unique content.” This isn’t something you have to do all at once – pick the top 5-10% of your best sellers and start there. Give your best products some unique content and see what happens.

5. Low Unique Ratio

This scenario is similar to internal near-duplicates (#3), but I’m separating it out because I find it manifests in a different way on a different set of sites. Instead of repeating body content, sites with a low ratio of unique content end up with too much structure and too little copy:

This could be a result of excessive navigation, mega-footers, repeated images or dynamic content – essentially, anything that’s being used on every page that isn’t body copy.

The Solution

Like internal near-duplicates, you’ve got to buckle down and either beef up your unique content or consider culling some of these pages. If your pages are 95% structure with 1-2 sentences of unique information, you really have to ask yourself what value they provide.

6. High Ad Ratio

You’ve all seen this site, jam-packed with banners ads of all sizes and AdSense up and down both sides (and probably at the top and bottom):

Of course, not coincidentally, you’ve also got a low amount of unique content in play, but Google can take an especially dim view of loading up on ads with nothing to back it up.

So, how much is too much? Last year, an affiliate marketer posted a very interesting conversation with an AdWords rep. Although this doesn’t technically reveal anything about the organic algorithm, it does tell us something about Google’s capabilities and standards. The rep claims that Google views a quality page as having at least 30% unique content, and it can only have as much space devoted to ads as it does to unique content. More importantly, it strongly suggests that Google can algorithmically measure both content ratio (#5) and ad ratio.

The Solution

You’ve got to scale back, or you’ve got to build up your content. Testing is very important here. Odds are good that, if your site is jammed with ads, some of those ads aren’t getting much attention. Collect the data, find out which ones, and cut them out. You might very well find that you not only improve your SEO, but you also improve the CTR on your remaining ads.

7. Search within Search

Most large (and even medium-sized) sites, especially e-commerce sites, have pages and pages of internal search results, many reachable by links (categories, alphabetical, tags, etc.):

Google has often taken a dim view of internal search results (sometimes called “search within search”, although that term has also been applied to Google’s direct internal search boxes). Essentially, they don’t want people to jump from their search results to yours – they want search users to reach specific, actionable information.

While Google certainly has their own self-interest in mind in some of these cases, it’s true that internal search can create tons of near duplicates, once you tie in filters, sorts, and pagination. It’s also arguable that these pages create a poor search experience for Google users.

The Solution

This can be a tricky situation. On the one hand, if you have clear conceptual duplicates, like search sorts, you should consider blocking or NOINDEXing them. Having the ascending and descending version of a search page in the Google index is almost always low value. Likewise, filters and tags can often create low-value paths to near duplicates.

Search pagination is a difficult issue and beyond the scope of this post, although I’m often in favor of NOINDEXing pages 2+ of search results. They tend to convert poorly and often look like duplicates.

A Few Words of Caution

Any change that would massively reduce your search index is something that has to be considered and implemented carefully. While I believe that thin content is an SEO disadvantage and that Google will continue to frown on it, I should also note that not all of these scenarios are necessarily reflected in the Panda update. These issues do reflect longer-standing Google biases and may exacerbate Panda-related problems.

Unfortunately, we’ve seen very few success stories of Panda recovery at this stage, but I strongly believe that addressing thin content, increasing uniqueness, and removing your lowest value pages from the index can have a very positive impact on SEO. I’d also bet good money that, while the Panda algorithm changes may be adjusted and fine-tuned, Google’s attitude toward thin content is here to stay. Better to address content problems now than find yourself caught up in the next major update.

I think your post is even more useful than being just about Panda... it's a perfect small manual about what a duplicate content is; something easy to say but that I discovered hard to make understand for real.

About #7 - Don't forget you can use Webmaster Tools Parameter Handling to filter out those parameters you want Googlebot to ignore. This might be quicker to do than trying to NOINDEX etc. Coupling parameter handling with canonicalisation is a good step in the right direction.

There must be millions of ecommerce sites with 'Thin' content. This is a helpful post...

My only issue with GWT parameter handling (and this applies to seo-himanshu's comment as well) is that it's Google-only. It won't impact Bing or your SEO tools and analytics (including our own campaign tools). Google could also just up and change how it works, so it makes me a little jumpy.

That said, it does seem to be effective and often faster than other methods. I think of it more as a short-term solution, though - something to patch the leak while you rebuild the levee.

I would like to give one small tip here. Generally websites which duplicate its own contents through faceted navigation, filters, internal search, session IDs etc append some parameter at the end of a URL. For e.g.

http://www.abc.com/i-am-original.php

http://www.abc.com/i-am-original.php?_SID=U

http://www.abc.com/i-am-original.php?ocode=02916595

http://www.abc.com/i-am-original.php?ocode=02916592

..

Through Google webmaster tools 'parameter handling' feature you can sugges google to ignore such parameters. This method can quickly reduce large number of duplicate contents on your website in Google index. It is also more efficent than slapping noindex, canonical tags on each and every page esp. if your website is very big and produce/remove large volume of contents on daily/weekly basis.

Brilliant post. The visuals in this do more to explain duplicate content and solutions for it better than any overview I have seen. This also really speaks to a company's overall content strategy. By reading this I hope it becomes clear to many that usefulness is very important and probably more important than content saturation. I am certain the Google and the like will continue to change the rules making relevance and usefulness matter more and more. I love that with every update to their algorhythm there are fewer chances of "cheating the system".

What do you suggest for a website like hotels.com which shows pages based on listings. The listing differ based on query type, but essentially, they are the same.

Is syndicating content from other pages a good idea?

Example: NYC hotels and cheap NYC hotels are 2 different URLs and they pull content from a destination page which has around 7 parts. The NYC hotels page pulls content from 1 part of NYC writeup and links to that page with a read more link. Similarly, cheap NYC hotels will pull content from 2nd part and so on.

Does that make the pages thin? considering that the listing entries are different for NYC hotels and cheap NYC hotels and its ensured that there is syndication of content and no scraping?

That's a good (and very tough) question. Let me start by saying that it's not a level playing field. Big brands and high authority sites can get away with things that other sites can't.

Technically, I do think this is duplicate content, and I think it has low value for search visitors. These pages are created purely to rank on alternate terms. The thing is - it's probably working for them and it may be generating tons of traffic. Does that mean Hotels.com will be immune forever? Probably not, but you'd be hard pressed to convince them to cut out ranking pages today.

It's a risk calculation, from an SEO perspective. Originally, large-scale duplication just meant some pages might go supplemental or get filtered out. Not ideal, but not a disaster. Later, duplication started to creating indexation issues and stronger filters that started hurting pages beyond the duplicates. Still not a Capital-P Penalty, but the consequences could be serious.

Now, you've got Panda, and what really feels like a penalty to a lot of people. I don't think duplicates alone are the cause in most cases, but now you're looking at a situation where large-scale duplication could impact an entire site. So, when do you start to move toward the next algorithm change? That's a very difficult calculation.

websites (not necessarily hotels.com) but websites of similar status who have numerous number of legitimate social and website mentions. Folks refer to them (like we are doing here) that have these issues on a particular page.

My question is:

does google consider a webpage in full or it takes components? It happens that maybe a block of text has been syndicated, but the whole page on the whole is not similar to any other pages.

Would you consider a page with syndicated content from another page (which is about 10% of the real estate) and the rest of the area is informative listings which are not replicated elsewhere?

What are your thoughts on large ecommerce sites that have a large amount of affiliates, do you think this is a bad thing now after panda because of the potential duplicate content issuses to have any affiliates at all?

We've lost a lot of traffic after the Panda updates, so I've been trying to figure out how to eliminate duplicate content.

I just did a quick Google Analytics check to see how much sales were actually generated from users who landed on page 2+ of our category or search pages (the only pages on our site that have multi-page results). I ran a "landing page vs. sales" report with some simple filters. In our case, this was fairly easy, since everything is controlled by "cat=", "search=" and "page=" URL parameters.

I discovered that 1.41% of our sales (in dollars) over the past year were generated by customers who landed on a page 2+ category or search page.

After adding another filter to separate the category and search pages, I came up with these results:

1.25% of sales came from a page 2+ category landing page

0.19% of sales came from page 2+ search results landing page

This was just a quick and dirty test, I didn't bother to drill down and see whether all this traffic actually came from Google search (some of it came from other search engines and traffic sources).

However, since only a small percentage of sales came from page 2+ search results, it would probably be safe to add a NOINDEX tag to those pages. If adding the NOINDEX tag to those pages would result in regaining the traffic we've lost due to duplicate content, it would probably be worthwhile.

I've already added the "canonical" tag to those pages, but it's too early to tell if it has made any difference. If that doesn't work, I'll try the NOINDEX. From what I've read in the Google guidelines, you should only use one tag or the other and not both.

If you are running Google Analytics to track ecommerce, you should be able to do a similar analysis.

UPDATE: I just drilled down to filter only Google search traffic and discovered that only 0.02% of our annual sales came from landing pages that were page 2+ search results. In our case, this was just one sale.

Since only a negligible amount of sales are coming from these pages, it could mean that Google may not even have most of them indexed. Even if they are indexed, it certainly wouldn't hurt to add a "NOINDEX" tag, since they don't do much for sales and might be creating a duplicate content penalty.

Hi,After reading your article, it occured to me that pages such as monthly archives and category archives (which are standardly created in one way or another by all publishing platforms I can think of) may be viewed as spun content by Google. Since we use both monthly archives such as this:http://www.masternewmedia.org/2011/02/and category archives like this: http://www.masternewmedia.org/search_tools_and_technologies.htmWhat you discussed under point #3 Near Duplicates (Internal) means that these pages we had for years are now seen as partial copies of the articles.We already noindexed monthly archives and are about to do the same on the category archives.Is there anything else you would suggest doing on such pages or would you do something different?Also, if you look at any article, we have "related articles" section at the bottom of each article. That section (yet again) has excerpts from other related articles.If we show those via Javascript do you think Google may be happier that way?

I wouldn't worry too much about the excerpts (especially if they're short) - this isn't an all-or-none-thing, and excerpts like that have clear intent and some value. Meanwhile, hiding content with JS could prove more risky.

Category pages (which are essentially searches) are tougher, since some of them may rank. I usually start with the lowest value pages - like sorts and filters - and work up. Tackle pagination yet. I also think your aggressiveness in culling near-duplicates has to be weighed against risk. If you've taken a hit, it's worth being more aggressive. If you're trying to preempt future problems, then ease into it. There's a balance, and it's not easy to find.

I think we are slowly beginning to see some websites that are on the road to recovery from the Panda. maybe not the ones that were hit the most on the global level, but as far as i could tell most of them took some drastic emasures to improve the quality of the content and also to remove the duplicate content issues. the only thing bugging me is the fact that non-English websites (most of them) had yet to see the ligh of Panda as they were not yet affected, the question is if they will be at all. But for smart users this gives a time window to sort all the issues before Panda comes to their homes :)

I was very busy the last month and now that I'm back, I've found the new "thin content" expresion. I was searching, but I couldn't find a good definition of it. It has to be a language issue (I'm from Argentina, Patagonia). Could somebody tell me the meaning of it?

It really just means that the web pages in question are light on original content. Google talks a lot about "thin affiliates", for example. These are sites that repurpose other people's products for sale (with affiliate links) but then add essentially nothing of value to them. They buy a keyword-loaded domain, slap up some pages, and wait for the money to come in. Meanwhile, the content on those pages is on 100 other sites across the web. So, while the pages may be long or short, the original content is "thin".

We've been scratching our heads trying to figure out what to do after taking the pounding from the G-Panda.

Ours is an e-commerce website with about 4.2 million pages (unique URLs). All the content is from publishers since we sell books.

I fully agree with your solution of padding up the content....

"The red text is the same, but here I’ve supplemented it with 2 unique bits of copy: (1) a brief editorial description, and (2) user reviews. Even a unique 1-2 sentence lead-off editorial that’s unique to your site can make a difference, and UGC is free (although it does take time to build)."

.... but really strugglng to see how to do it with the number of pages we have.

There must be atleast 400 plus websites using the same content and this starts from the publisher to amazon to people like us.

This is exactly the problem and where your site brings nothing new to the index then you will always lose out the big authority sites like amazon. I am working with a client now who has a similar problem although not on the scale you are talking about here.

Really, you can look at this as a problem or an opportunity and whilst rewriting all that content is going to take forever, if you are the only vendor online with original descriptions for these products you will possibly be able to piggy back over lots of other more authoritative sites due to the unique content that makes for better search results.

4.2 Million products is a hell of a rewrite though so best of luck with that!

My site have been Google sent email that "Thin content with little or no added valueThis site appears to contain a significant percentage of low-quality or shallow pages which do not provide users with much added value (such as thin affiliate pages, cookie-cutter sites, doorway pages, automatically generated content, or copied content)"

I've posted a few articles How to/Tips format and send request review to Google but, after 7 days, but they have not indexed for my new post

Bottom line ... avoid low quality pages in the first place. But "search within search" is an interesting factor (and possible dilemma) to consider. Something you don't really think about until you get into something like a large e-commerce site.

I am working on one e-commerce website where we have added 300+ pages to target different local cities in USA. We have added quite different paragraphs on 100+ pages to remove internal duplicate issue and save our website from Panda penalty.

You can visit following page to know more about it. And, We have added unique paragraphs on few pages. But, I have big concerns with other elements which are available on page like Banner Gallery, Front Banner, Tool and few other attributes which are commonly available on each pages exclude 4 to 5 sentence paragraph.

I have compiled one XML sitemap with all local pages and submitted to Google webmaster tools since 1st June 2013. But, I can see only 1 indexed page by Google on Google webmaster tools.

Great Post. From what I can tell the biggest impact of Panda is keeping SEOs honest. A lot of the issues that Panda has rasied relates to us as SEOs not properly taking the time to diagnose and address duplicate content.

I strongly believe this is part of why Panda is not just one ranking signal, but a combination of inputs being fed into a maching-learning system. What you end up with is a sort of "PandaRank", and that becomes just one more ranking factor in the 200+. So, in other words, you have 200+ ranking factors, with Panda being one of them, and even Panda represents dozens of signals.

As you said, any given signal, especially user signals (like bounce rate), could false positive in some situations. Honestly, that's true of many ranking factors. That's why searche engines look at so many factors and combine them in such complex ways. Regardless, though, there are always situations where they get it wrong. It's certainly no coincidence that we're on Panda 2.3 at this point.

Agreed. The way Panda differs though is that in a way it is binary. You are punished or not. Where all other individual parameters may have an incremental effect on your ranking, Panda may sweep away almost all traffic in one go if you trigger it via any unknown combination of parameters.

I’m curious how directories affect duplicate content, specifically address and descriptions. Are unique descriptions necessary for each individual directory? Can the description be a snippet from the website, or does it need to be unique?

great article pete. i've bookmarked it for an indepth read over a cup of coffee for later on.

I have three ads per page all above the fold in their respective sizes (160,300,728).

One of my competitors that now lands on page 1/2 for most of the time has 8 ads and then has a click here to view the actual content that leads to another 12 ads on the actual page. 95% of the ads are adsense. I have no adsense ads on mine. And my pages that used to be on 1/2 are now on p13/17. I have seen some sites at the forefront with roughly more or the same as my ads but with less unique content.

So either ads are not that much of a concern to google or they are giving sites showing their ads preferential treatment.

I have tons of duplicate content on the web 100% of which are from fruitcakes copying my meta descriptions or pages.

One thing I've never done is rss feeds or social facebook, myspace etc. The new sites I'm seeing at the front appear to be more socially intertwined (facebook shares, bookmarks etc).

Some of the sites I am now seeing daily (for new content) are at page 1/2, are high traffic sites but pre panda used to rank on pages 3-5 so the social interaction must carry a lot of weight.

I have a feeling that once legitimate sites fix their crawl errors, duplicate content, dmca's etc then they'll begin to rise. I'm slowly beginning to see a slight increase with site fixes that I'm continuing to do now.

Unfortunately, the rules don't get applied evenly, and you can find an exception for any SEO best practice. Google has been vague about ad guidelines, and for obvious reasons - they make a fortune from Adsense. Still, they've been pretty clear about frowning on abuse, since it harms the buyer side.

How does this translate into organic SEO? That's a lot tougher. We saw some Panda cases where ad ratio seemed to play in, but it's not clear, and there were sites that seemed to be affected by Panda that didn't have this issue. There were also, as you said, sites with tons of ads that don't seem to be affected at all. We're getting to the point where these algo changes aren't just based on one single variable or an IF statement. They're getting more and more sophisticated, and even Google can't always predict the results when they roll complex chagnes out.

This post couldn't have come at a better time. Earlier this morning, I was researching why my friend's site in the UK (a comparison site) had been affected by Panda, after its traffic had dropped massively from mid-April onwards.

Your comments on the "Search within Search" issue got me thinking and I soon discovered and realised that some of their search results were getting indexed as well as the main category landing pages. We're going to work on noindexing them ASAP and hopefully it'll resolve the issue.

I think the sugestion of noindexing the search within search pages are the most valuable. I have a category page for manufacturers - one that will acutally be of use to my users, so I have addded lots of unique content around the listings to make it unique and 'fat'. The majority of my category pages and practically all of the categories on my clients sites are noindexed since long before Panda.

Good notes for Panda..thanks for that.what I am seeing if you are a original content creator and your domain has less authoritative,and a big scrapper website scrapped your content before Google index the originator and after seeing your content, identifying "you" as a scrapper,bad part of panda, there is no solution..till now...another point those who are using rss feed publication from major websites/blogs on their prospective industry verticals,G-panda actually giving preference to those website,where a big time professional writing website out ranked on SERP..incredible!huh!- ..another bad part is Google pandalized authoritative website's those who are sharing content's to there partner's, even a site like Digital Trends,some medical research website's got affected...third one is if some one re-write your 5 years old article's,then G panda outranking your five year old content by this re-written copy or stolen copy....filing about DMCA?its possible for those who are maintaining for 1000+ pages, but its near to impossible for those having pages 100k+....to some extend panda running similar treatment for content originator or scrapped content provider...

I wish we had good answers for people who were unfairly affected by Panda (such as being treated like scrapers when they were the content source), but we really don't yet. You can appeal to Google, you can build authority, and you can push the legal side, but any of those may be ineffective or may require a large investment of time and money.

Yes,indeed though we are digging the point of panda content classifier and ad placement policy besides scrapping issues....though again we wish when Google panda settle down after some slicing and dicing ,it will again identify the originator..

I am an ecommerce web developer and this is by far my biggest problem. I'm still not 100% sure that no indexing those duplicate pages are the best thing to do. Instead, buying plugins maybe the way to go for us ecommerce web developers.

In my friend's site's instance (see my comment below), deeper search pages are getting indexed, which I'm happy to noindex because they offer little to no value to people visiting from the search engines. We would rather the main category landing page (which is the start of the search results) be what they land on anyway.

That's what you have to think about at the end of the day:

Will removing these pages from Google's index affect people coming onto the site?

On the contrary, will it be better (they come through a main landing page rather than an odd search result)?

If it gets rid of thin/duplicate content issues while also improving customer experience then it's a double-win, but like you say, if you're worried it might kill traffic then checking Analytics for the offending pages should help to do the trick.

Search within search is tough, and blocking search pagination isn't the right solution for all sites. In general, though, I do think that you should focus on your top level search pages. Landing someone on Page 17 of results for subtopic Q isn't high-value, and those visitors are probably going to bounce. Usually, it makes more sense to focus your index and your internal link-juice.

As for plug-ins, too many of them copy content across pages (internally) or even across sites. There are exceptions, but I find that many of them don't add valuable content. In addition, if they're Javascript-based or use other dynamic technology, they may not even be crawled as content.

Great post. Any post about Panda these days is comforting. It shows you haven't forgotten those slapped by the evil Panda :-(

I lost 30% of my Google traffic from panda 2.0 in April and losing more every day.(after gaining 25% from Panda 1.0)

70% of my traffic (40k/day) lands on my category pages which contain a unique 100 word introduction and 10 listings per page. The listings are like a search result page, the item title, description and link.

How and why do I make these category pages unique? My visitors don't want to read a 300 word introduction. They are there for the listings
I can't Noindex the category pages because they are the landing pages for 70% of my traffic.

I'm a PPC manager and i've seen a global (15 countries) decrease of conversions on april.

I suspect Panda is guilty. Why? Because more then Google searches, Google has "search partners", the inside search in websites. It isn't a "content campaign". So, if these site were banned on Google organic, perhaps it can clarify me the decrease of conversions for almost all campaigns.

Even on the organic side, Google makes something on the order of a change EVERY day, and only a handful get names or press. Tie in the PPC side, and it's really tough to pin down whether any given algo change impacted someone.

It's possible that Panda impacted the content network dramatically, although I haven't seen direct evidence of that. On the Adwords side, though, you could see this by looking at placements. Were there dramatic shifts in which sites showed your ads? Maybe some of the high-convering sites went away, dropping your average conversion rate.

I had a similar thing happen to a client a year or two ago. They have an ambiguous name that's shared with other companies and entities, but they advertise a lot, so it tends to be cost-effective to bid on the broad term. One content network site was generating solid conversions, and then we saw a dip. We later realized that that site was originally a parked domain but then got bought out and became a dating site (no relevance to my client or their business), so just that ONE site caused a decent-sized drop in conversion rate.

Thanks for the post Dr. Pete. I work with a lot of project managers that ask about duplicate content issues all the time, and I've been putting off a manual like this for far too long.

I think you did an excellent job of explaining situations that we face with our clients on a daily basis, and this is a perfect way for our employees and clients to understand exactly how important unique content really is.

One issue that I am quite close to is that of additional attributes via faceted navigation. The ability to add Brand + refinement + refinement is very useful, not only from an on-site perspective but also for capturing the long tail. Unfortunately, there seems to be no good way to go about this, since many of the pages have the same products even if they are refined with additional attributes.

I hesitate to recommend noindexing those kinds of pages. On a page by page basis, they add little to traffic stats, but cumulatively speaking it would be deflecting a significant portion of traffic.

I suppose I am left with choosing important attributes on a category-by-category basis, and noindex, following those pages with less important attributes (such as price).

Internal search is a lot tougher than URL-based duplication, where canonicalization is kind of a no-brainer. I think you have to look at your data. If those deep search pages have traffic, that's important. If they have links, then you have to know that, too (and you may choose to canonicalize instead of noindexing).

I also think you have to separate facets from sorts and display options. For example, ascending/descending results or Show 10 vs. 50 vs. 100 per page are pretty useless for crawlers (while clearly valuable for visitors). Start with those low-value variations. Then, you may want to tackle paginated results. Then, you can look at deep subcategories and see if they have value. It's ok to do this in stages, and it's probably smart if your site is doing well.

You've definitely hit the critical "thin content" points. While there are other Panda factors, site owners should definitely pay attention to and address any / all of these you've covered that they might have on their sites. Really good work on this article.

I don't have clear data on that, but I strongly suspect that they do. They clearly have the technology, from the PPC side, and it's become pretty evident that Google can visually parse a page. We're seeing more and more that they can tell headers from navigation from ads from footers, etc. There's almost no way to do that without rendering the HTML. That's also why some tricks, like moving content around in source code with CSS, seem to be very low-impact these days. I think Google has a pretty good sense of what a page looks like.

This article does a great job of explaining duplicate content in all its forms, which is something that a lot of site owners don't fully grasp. There are varying degrees of "duplicate," and each of them can negatively impact your site's performance in Google. Thanks for not only pointing them out, but also offering suggestions on how to fix it.

I have to say, your posts are always right up my alley. I love a good technical SEO post.

I think it's important to point out that Webmaster Tools offers many different ways to direct the search engines what to index and what to avoid. While I am not exactly sure how Bing does it (do they use robots.txt when you "disallow" a certain type of URL, like "a /?tag/title"?), but they are many different ways to direct the search engines.

I just discovered in GWT the other day this ability as well. If you log in, then go to Site Configuration -> Settings -> Parameter Handling, there will be a list of uniquely generated parameters (from your site) that you can direct Google on how to handle these parameters! Genius!

I'd love to see someone write a full post about Google and Bing's different ways of disallowing content.

I am recombining pages that I split because long pages were bad. I used 301 directs, updated the sitemaps, and I think Google likes it.
Adding unique content as I check pages is helping, too.

Why? Our sales and page views are back up to more normal levels for this time of year. [Were we affected by Panda or is the economy finally hitting our niche? I think a bit of both.]

I think what Google wants isn't necessarily thin, but lean, mean, and muscular which can be bulky. It makes sense from Google's point of view. Why should they spend money, time, and effort spidering mass quantities of junk and repetitious junk at that?

I have experienced a lot of change in e-commerce solutions I work on lately, and the only reasonable answer I've been able to find is that content is considered as duplicate content, due to products that may be found in several categories. Local SEO experts here in Sweden argue that Google Farmer/Panda update will not look at internal duplicate information as duplicate information, but I am sure this is the case. Your article is actually the first one I've found that says the same. Internal duplicate content is bad and Farmer/Panda update will be an issue.

Running a few ecommerce sites I have seen product pages decrease in Goggle after the Panda update. The pages would fall under the above mentioned "Near Duplicates (Internal) format. I've often wondered how to get around this or if Goggle will tweak the algorithm for ecommerce sites. The reality is - most of the pages are the same. In the automotive industry each product is considered different and has a unique stock number so every item is listed. I may have 5 items that are identical and look like duplicate content with the exception of the stock number and VIN.