Spider Traps: An SEO issue that even Google suffers from..

A spider trap is an incredibly common problem that can be found on websites of all sizes and can cause rankings to plummet and could even be considered a security issue..

Spider traps can affect your search engine rankings is because it reduces the ability for search engine spiders to crawl your site efficiently and wastes crawl equity.

It can also massively increase the likelihood that your website will be seen to contain large volumes of duplicate content, cause keyword cannibalisation and maybe even allow anyone to create rogue pages on your website that can then be easily indexed by Google leaving you wide open to negative SEO attacks.

Understanding and identifying spider traps

To help you understand and easily identify spider traps I am going to share examples of common spider traps that I’ve come across in the past and then advise on how to address the problem.

I’veused some real world examples but it’s worth noting that these issues are sometimes very difficult to identify, and once they are identified, it can be difficult to fix especially on large sites so this is by no means a criticism of anyone else’s SEO strategy

Keyword search spider trap

Just about every website has a search function and often developers forget that these pages should not be crawled and indexed by search engines. This leads to one of the most common, and potentially the worst, spider trap issues as it sometimes allows others to easily add indexable content to your website without even being logged in.

An example of this on Google Trends

An example of how a spider trap may look found on Google Trends….

Notice how the ‘search term’ information is placed on the page and how a unique URL is generated. This means, if you wanted, you could manipulate the data in the URL and then get Google to crawl and index the page like what can be seen in the example below.

This URL was eventually crawled and indexed by Google…

This probably doesn’t cause too many problems for Google as they don’t need to worry about search engine rankings but you can imagine the problems this may cause for a smaller business who cares about their rankings on the SERPs. It is also worth considering how this could be exploited by someone with malicious intent.

As well as pages being indexed intentionally as a result of this spider trap, pages can also be indexed automatically by the search engines without any human intervention.

How to identify keyword search spider traps

If you are auditing an established website that has a keyword search function there is a chance that this problem may already exist. A quick way to identify a keyword search spider trap is by using Google using search operators.

During a site audit identify whether a website’s search function generates unique URLS when a search is carried out. Try to identify a common character or phrase that is included in that URL.

For example, the word ‘search’ may appear in a URL after a search is carried out. So in this case, the search operator you type into Google would look something like ‘site:websiteaddress.com inurl:search’.

If you don’t see any results but suspect that there may still be an issue then you can use Google Webmaster Tools to try and index a search result page.

I’m not going to go into detail on how to get these pages indexed on sites that you don’t have access to as it can be easily exploited but most experienced SEOs can probably easily work this out…

How to fix a keyword search spider trap

Fortunately, on most websites this one is easy to fix and this is how I’ve addressed it in the past.

Add noindex, nofollow meta data to search result pages and get the site re-crawled, this should hopefully remove some the results from the search engine result pages. You also have the option of manually removing pages via Google Webmaster Tools

Once the site has been recrawled and the offending pages have dropped out of the index I like to block the pages via Robots.txt to prevent further crawling

Dynamically inserted content spider trap

A dynamically inserted content spider trap is when you visit a URL that should 404 but returns a 200 OK status code and is similar to the keyword search spider trap discussed earlier in the article.

There are many scenarios where this could happen and is often a result of an oversight during the development process.

How to identify a dynamically inserted content spider trap

If you are working on a website that dynamically populates content based on the URL path then there is a chance that this type of spider trap exists.

An example of this you may have come across is on websites that have extremely similar (boilerplate) content across a section of the website.

Notice how by changing the URL in the example, the content changes? Often what you’ll find is that you can change the URL that whatever you want and the information in the URL will be dynamically inserted into the content area.

As with the issue mentioned previously in the article related to search, this is obviously open to abuse and could even generate a large volume of useless, near duplicate pages without any intervention.

How to fix a dynamically inserted content spider trap

Unfortunately, there is never usually an easy way to fix this without the help of a skilled developer. Ideally, what you want to be able to do is specify URLs that you want to return a 200 status code and make sure that any pages that shouldn’t exist return a 404.

Product category filter spider traps

This next spider trap is extremely common on ecommerce websites and I’ve seen it cause severe problems for websites in the past especially from a search engine ranking perspective.

The reason this is such a problem is because it can result in search engine spiders crawling and indexing a huge volume of duplicate and near duplicate pages which waste crawl equity and dilutes the amount of authority directed to important pages.

How to identify product category filter spider traps

If you are working on a website that has any type of filtering option that changes the page URL then there is potential for this problem to exist.

Notice how the ‘size’ filtering option is included in the URL which is crawlable and has been indexed by Google. In this example, the ‘international’ filter has also been included in the URL.

Notice how this spider trap has caused filtered pages to be indexed which could dilute the site’s ranking potential

To identify this issue I simply looked at how URLs are formed during the filtering process typed ‘site:johnlewis.com inurl:size=10 inurl:size=11’ into Google.

As you can imagine, there is potential for a huge number of URL variations to be indexed by Google which could have a detrimental effect on the website’s rankings. However, in this instance the negative effect from this is likely being offset by the fact that John Lewis is a massive authority website so is unlikely to suffer as badly as a smaller website would.

How to fix product category filter spider traps

Unfortunately, this problem can be fairly tricky to fix, and ironically, fixing the problem could actually cause you to drop some rankings so careful research has to be done to measure the volume of traffic landing on filtered pages.

For example, if there was a large search volume for some of the keywords contained on filtered pages, removing these pages or canonicalising them may cause the site to lose ranking so you really need to keep that in mind and come up with a strategy to maintain those rankings.

Here are the options for fixing this problem:

The option recommended by Google is to canonicalise these pages using the rel canonical tag but I’ve found that sometimes this doesn’t work and is actually being ignored by Google especially on ecommerce sites. This is possibly due to the mixed signals being presented to Google such as large volumes of internal links

Using noindex, follow tag on filtered pages has worked for me in the past but it is worth remembering that if you have a large site then the pages may still be crawled which could use crawl equity

Blocking these pages using Robots.txt is another option but the URLs may still appear in the index. I’ve found blocking in Robots.txt in the past has been a quick fix which resulted in positive results but it is worth remembering that Google appears to be changing their opinion on blocking pages

This can be a complex issue to fix, especially on a well-established website so it is best to try and tackle this during the development stage. However, the rewards for identifying and fixing this type of issue can be huge in terms of how it can improve your search performance.

Some quick tips

There are quite a large number of ways for spider traps to occur and I would love for you to share any that you have come across in the comment section below.

I’d also like to summarise some of those ways that I identify potential spider traps on a website.

Screaming Frog: Crawling a website using Screaming Frog can be a quick way to identify spider traps. If you have set the crawler so that it mimics search engine spiders and find that the crawl goes on forever it could be a good indication that a spider trap exists on your site.

Google Search Operators: Using the site: command can be a quick way to identify spider traps as mentioned previous in this article

If you use site:websiteurl.com and use search tools to view results from the past day or week it will show you pages that have recently been indexed which can be used to uncover recently indexed pages that can sometimes be a result of spider traps

The site:websiteurl.com inurl:insertfilternamehere command can be used to quickly see if filtered pages are being indexed by Google

Using site:websiteurl.com “insert snippet of content from your website here” can be used to determine whether or not multiple versions of the same content has been indexed which could be a result of a spider trap

To determine whether or not search results on your website are being indexed you can use site:websiteurl.com inurl:insertanystringnormallycontainedinsearchurl

IIS Toolkit is based on Bing’s crawler technology. As such it often provides a better overview of how a search engine views a site than might be found using typical SEO crawlers such as Screaming Frog.

Log file analysis: Spider traps can be identified through analysis of web server log files. By using tools such as Splunk, AWStats or even Excel for small log files. Identifying where spiders such as Googlebot are spending most of their time and it’ll becomes clear where your crawl budget is being wasted(thanks to Ian Daniels & Scott McLay at Yard for sharing the last two tips)

I am working on an ecommerce website that has got around 40000 page in xml sitemap, 32000 being indexed in google and millions of these being shown up in screaming frog.
I have got a premium screaming frog account but still the search never ends and it always ended up closing the screaming frog automatically. It is magento website and it has lots of similar products with almost the same content.
What would be best option to get it fixed?

State of Digital Events

Search other topics

Our Archives

What is State of Digital?

In a fast-changing digital environment, it is hard to keep up. Knowing what are the strategies and tactics you need to deploy to become better at your marketing is not an easy task.

State of Digital aims to help (digital) marketers from every level. Help them find their way in this changing environment. By sharing the latest strategies and tactics. By offering training. Using industry experts, State of Digital tries to help you become a better marketer.