Solving Duplicate Content Issues Arising From Faceted Navigation

By Barry Adams on June 1, 2011

I’m a big fan of faceted navigation on ecommerce websites, also known as layered navigation. With faceted nav users can find exactly what they’re looking for with just a few clicks, even on websites that contain tens of thousands of products. A good implementation of faceted nav is a user experience dream come true.

Faceted nav also has SEO benefits, in that these facets serve as keyword-rich links and ‘tags’ of sort that add semantic relevance to the products contained within each facet.

But it’s not all good news: faceted nav can also result in problems with indexation, specifically duplicate content issues. As sometimes many different facets will contain nearly identical sets of products with little variation, search engine spiders could end up in crawling loops where they crawl slightly different product lists over and over again.

One example is an ecommerce site under development I came across recently. Build in Magento, this site uses faceted nav and contains about 1200 products. But when I unleashed Xenu on it, it kept finding new pages until I finally aborted the crawl at over 40,000 URLs crawled.

There are several different ways to solve these faceted nav indexation issues:

Block facets with robots.txt

Using robots.txt to block search engines from crawling faceted navigation pages is probably the most brute force approach to the problem. It will undoubtedly solve the duplicate content issue, as it will block search engines from crawling vast amounts of pages, but it has several side-effects that make this a less than ideal solution.

For one, it will mean that the flow of PageRank within your site will be severely distorted. A natural flow of PageRank within your site should come from a solid site structure. Blocking faceted nav pages with robots.txt effectively distorts your site structure as it is perceived by search engine crawlers, as large parts of your site are basically blacked out for search engines. Also, you lose any semantic SEO value the faceted navigation has.

One small upside is that if your site has a low PageRank and a large amount of products, you’re less likely to run out of crawl budget before all your product pages are indexed.

Verdict: don’t do this unless you really don’t have any other choice.

Nofollow your faceted nav links

You can tag your faceted navigation links with rel=nofollow, thus preventing search engines from indexing your faceted nav pages. A slightly less blunt instrument than the robots.txt blocking approach, this solution nonetheless suffers from similar problems: it distorts the flow of PageRank within your site, as nofollowed links cause PR to evaporate.

Verdict: don’t do this.

Use rel=canonical on all faceted nav pages

By using the canonical tag on all faceted nav pages and making sure they refer to the most relevant/important facet (or a ‘view all products’ single page), you can ensure the duplicate content faceted nav pages are not included in the search engines’ indices. The flow of PageRank is unaffected, and you also preserve the semantic value of your facets.

However search engines will still crawl all the duplicate content pages, which means your crawl budget could be used up before all product pages are indexed.

Verdict: best used in conjunction with one of the other preferred solutions.

Use JavaScript/Ajax to hide faceted nav links

With smart use of JavaScript or AJAX you can ensure that search engines don’t actually see the faceted navigation links at all, thus preventing the issue from occurring in the first place. What you do is load all the products in a single page, which you then paginate and divide in to facets with JS or AJAX. Search engines see the whole page with all products and will crawl & index all of them, while users are presented with the user-friendly faceted navigation.

This is a very solid solution, but it has one caveat: the semantic value of the faceted nav is lost.

Verdict: good solution if your main focus for faceted nav is user experience.

Meta noindex/follow tag on faceted nav pages

In order to prevent search engine crawlers from indexing all your duplicate content pages, you can tell them to keep these pages out of their indices but to still follow the links contained within. With the meta robots tag using the noindex,follow value, you do just that. The pages that have this meta tag will not appear in search engines, but crawlers will still find the products that are contained within these faceted nav pages. The flow of PageRank is preserved, and the semantic value of the facets is also intact.

However as with some other solutions, low PR sites may run out of crawl budget.

Verdict: a very good solution, especially when combined with canonical tags and static URLs.

Static URLs for faceted nav pages

Often a CMS that supports faceted navigation uses parameters in their URLs. Every time a facet is used to filter the listed products, another parameter is appended to the URL. As each URL is different, it will be treated as a separate webpage by search engine spiders, even if it contains the exact same products.

To prevent duplicate content issues arising from these parameter-driven URLs, you can configure the CMS to use static URLs for predefined facets, regardless of the order in which that facet was reached. This will drastically reduce the number of URLs on your site, and thus prevent duplicate content issues. So if a user refines a product listing first by price and then by colour, the URL of the page they end up on will be identical to the page reached by a user that refines first by colour and then by price.

Verdict: if you have faceted nav and you don’t do this, you’re an idiot.

Conclusion

Faceted navigation is a very potent instrument, but you need to implement it the right way. In my opinion the best approach to prevent duplicate content and indexation issues is using static URLs for your facets, combined with meta noindex,follow for facets that have no SEO value. Throw in rel=canonical meta tags that point to your core facets, and the result is the best of both worlds: a solid user experience and the full SEO value.

There are probably some other solutions out there to faceted navigation issues. If you know of other/better approaches, leave them in the comments.

24 Comments

Doc
June 1, 2011

Best analysis I’ve seen to date, Barry! I’ve been wondering about the best way to deal with this, as I’m working on a sizable ecommerce site now, and know it will become an issue if not handled correctly. Thanks for the insight!

Just had one quick question. Do you think there’s a chance on the horizon that the engines will be able to see through a Javascript/Ajax solution?

I ask because I recently advised a client to go with a static url/noindex/follow solution rather than JS/Ajax. My reasoning was a concern that somewhere down the line the latter may no longer work properly.

This concern was not born from any factual info though, just speculation. I guess it would depend on how clever the solution is. Would be great to get a bit more insight for future reference though.

In theory search engines can already crawl JS/AJAX code that’s embedded within a page. One way around this is to load the JS code from an external .js file, and put this file in a directory that you block with robots.txt. That should, theoretically, prevent search engines from seeing the JS – and thus the faceted nav – at all.

In practice, you never really know how search engines go about things. In my opinion as long as you have the best interests of the user in mind, and aren’t trying to deliberately deceive, you’ll be OK.

Hey Barry – great round up of all the solutions. The verdicts below each method are exactly what everyone needs to hear when dealing with eCom faceted navigation.

I have an eCom client that when we started working with them they had a number of “Googlebot found an extremely high number of URLs on your site” messages siting in Google Webmaster Tools. For anyone who hasn’t seen these messages they looking like this – [url]http://www.matthewsdiehl.com/wp-content/uploads/2011/03/googlebot-extremely-high-number-urls.png[/url]

By “extremely high” we were looking at several million URLs for a product base that was no where near that number.

We are working through the pairing of the rel=canonical and having the developers re-code how the faceted navigation URLs are generated (not a small task). The combination will hopefully clean-up what is a can of worms for the crawlers as the site sits today.

Excellent writeup Barry. I’ve recently shifted from heavy informational sites to e-commerce, and faceted navigation is one of the biggest challenges I’ve been facing. Magento has an extension called Mage Works, and the newest release has some nice controls for rel=canonical. There will always be some manual coding that goes in for exceptions, but out of the box it seems to help out a lot.

Hi Barry, I really liked this article, however i had one question. In the section “Static URLs for faceted nav pages” you suggecst to create many static URL’s. however sometimes even that cannot be achieved since there are so mane possible attributes a user can select. Can you please explain some more on that strategy.Thanks for a great article

@Michael Taouk: you are, of course, entirely correct. The pages will be indexed but the PR flow is messed up. My bad.

@Mark: yes sometimes static URLs for all facets is simply not feasible. In that case I would suggest you do implement static URLs for your most important facets (ideally facets that people would naturally filter by, such as product type and brand) and the rest can then be parameter-driven URLs. I would then use rel=canonical on these parameter-URLs that point to the most relevant core facet with a static URL. That way you’ll preserve PR flow on your core facets. The one downside here is that search engines still need to crawl all those parameter-URLs.

I must admit I didn’t quite understand the ‘Nofollow your faceted nav links’ points you made. I can’t really understand how adding rel=nofollow to the faceted links will prevent search engines from indexing the faceted nav pages. No follow will tell search engines not to follow the links but there is no guarantee these pages won’t get index. A noindex on the faceted pages would allow for the pages not to get indexed.

Thanks, Barry. Great articled. I encounter this duplicate content issue on my site, since each product page can be found by route of 3 different links. My solution is to use the static URL with .html suffix for the pages I want indexed, while those that I dont want indexed are rendered only by GET parameters

– this post assumes that blocking certain parts of your website (certain facets/filters) with robots.txt kill your pagerank flow. I’m not sure if that is the case.

– blocking parts of the site with robots.txt does have an awesome advantage: it’s damn cheap. That’s gotta count for something as well 🙂

– rel=canonical should never ever be used. It’s just a lame excuse for bad information architecture and more often goes wrong than right. It should have never been invented in the first place (even though it wasn’t intended for seo purposes, originally)

– the js option is probably best with regards to performance, but also quite difficult to implement and mostly not very scalable, especially when compared to the robots.txt version

For a recent client of mine i had to work with faceted search (it’s not navigation, but search. SEO’s use it for navigation, but hey, it depends on which viewpoint you take if it’s navigation or search ;)). We did it like this:– every facet had a nice url– all facets were prioritized, so the same facets of a resultpage were always displayed in the same order -> to prevent unnecessary duplicate urls– we looked at analytics extensively to see which types of facets were searched-for, so as to decide which types of facets we wanted to have indexed, and which types should be blocked. We chose about 3 facets that were allowed to be indexed– we created a url-scheme to recognize which kinds of facet-urls should be blocked and which not. We made it simple: if it had a parameter in the url, it would be blocked with a robots.txt wildcard; if it was a normal url, it would be allowed

Oh, and ofcourse we made sure that when peeps entered a search query in the search box on the website, they would not enter a shadowsite or something, but the faceted search, so it is fully integrated into the normal site. The ‘normal’ navigation also refers to the faceted urls.

Re: robots.txt & PR flow, we have to work on some basic assumptions here, and one that I have is that if a page is blocked with robots.txt it cannot distribute PR properly. So far I have yet to hear a compelling argument to the contrary, but if you have evidence that I might be wrong I’m very keen to see it. Never too old to learn and all that. 🙂

Re: [i]rel=canonical[/i], I disagree. It was invented for a purpose, and yes while it is an artificial fix for a real problem, it is often simply not feasible to re-design a site’s IA entirely to avoid having to use it entirely. Like you said about robots.txt, the rel=canonical solution is cheap and it works.

Re: JS option, yes it’s not always scalable but it’s definitely an option to consider, especially when building a new site from the ground up and you have the opportunity to implement it.

Re: faceted nav vs faceted search, I think it’s a mistake to approach these things from the coder’s angle. Yes it might function as a search in the back-end, but it’s the user experience that counts. And for a user it’s a form of navigation – clicking on links and all that. Putting the user central is vital imho, which is why I insist on calling it faceted navigation.

As always, YMMV, all roads lead to Rome, and all that jazz. Regardless of what the SEO theory declares to be best (my theories included), use whatever works. That’s the only real measure.

[quote name=”monchito”]with the robots.txt thingy it’s the idea that if you block the page, you don’t NEED to distribute pagerank from it[/quote]

i do know the difference between noindex,follow metatag and robots.txt -> what i meant is that with this method, these many, many pages accrue much less pagerank, and thus we don’t need to distribute it. It prevents a problem instead of fixing it

But it certainly doesn’t mean that this is the best way for all sites (or probably even for this site). There are many ways to Rome 🙂

I respectfully disagree with most of the information here. Selective robots.txt exclusion is likely the only instrument that will work in certain cases — esp. in the context of multiple selection.

Noindex/follow will not close off a spider trap, and canonicalizing variously filtered pages to the unfiltered state is not recommended by at least 1 Googler I’ve met. Filtered pages are not the same as unfiltered. Canonicalization is for substantially similar pages — a silent 301.

Again, it also won’t close a spider trap, esp. in the context of multiple selection. Multiple selection must be excluded. It’s something like N! + (N-1)! … 0! in scope.

I have had these same problems myself and decided to tighten up the robots.txt file a lot more a few days ago which is now blocking over 50,000 different pages according to webmasters tools. I cant see this as a bad thing as these are just bad pages that i want rid of and it will hopefully encourage the search engines to cache more of the good pages that our in our sitemap.xml It would be good to hear from anyone who thinks that after several months of blocking via the robots.txt file if the like of Google will start to ignore these pages and stop going back to them. some of the pages are tricky to block via noindex / nofollow for various reasons so i found that a more targeted robots.txt could block the bad pages much more effectively. I just want to get the pages tidied up so that we can get on and sell more dining tables!

While the meta noindex solution *might* be better from the PR flow perspective, it doesn´t solve the endless URL permutations and “crawler fatique” caused by faceted navigation. Because of that I support the robots.txt solution since it doesn´t allow the crawler to get lost in the faceted structure.