6 Site Crawlability Mistakes & How To Fix Them

Learn how to fix six common issues that could be impacting your site's crawlability and damaging your ability to drive SEO performance.

By Jacob Stoops

September 8, 2009

So it’s been a couple weeks since you launched that flashy new website, and it’s still not indexed. What you may have is a nasty case of poor crawlability. POOR CRAWLABILITY IS THE #1 KILLER OF SEARCH RANKINGS. IT DOESN’T MATTER HOW PRETTY A WEBSITE IS IF NOBODY CAN FIND IT!

Crawlability refers to the ability of a search engine to crawl through the entire text content of your website, easily navigating to every one of your web pages, without encountering an unexpected dead-end. Basically what it means is that there is something that you’re doing in one or many potential areas that is restricting access to your website (thus eliminating any possibility that it will be scanned and indexed by search engines).

Since the only was people can find your site (other than directly typing it in or clicking a hyperlink on a website) is via search engines, this is a pretty big issue. If you have a good SEO, they should be able to catch this pretty quickly, but if you have a bad one…yikes!?!

Luckily, you have me. So if you’re having problems getting rankings, check to see if you’re guilty of doing one of the following 6 things.

#1. You’ve Disallowed Indexing via Your Robots.txt File

Your robots.txt file is a file that should be in your root directory that you can use to restrict access to certain areas or folders on your website (so that they won’t be indexed). However, if you’re utilizing this during a site build out for example and you’ve forgetten to fix it. Well, then there’s your problem.

Re-write it to ensure that you’re not restricting key pages on your site.

#2. You’ve Disallowed Indexing via the Noindex, Nofollow Tag

Sometimes, it’s as simple as a Meta tag. Check your website’s <head> section to see if you’re using a potentially restrictive Meta tag.

It will look like this:

<meta name="robots" content="noindex, nofollow" />

Change it to this:

<meta name="robots" content="index, follow" />

#3. Too Many URL Parameters To Pass

In today’s age, this shouldn’t be too much of a problem. Many websites use dynamic URLs for large inventories, etc. However, it is still something to check. If your URL’s makes a webcrawler pass through more than 3 parameters, chances are they need to be shorter.

Personally, I recommend just using keywords in your URLs instead, as this is a better SEO practice.

#4. Poor Internal Link Structure

Maybe you have 1 page indexed and that’s it. If this is you, better check your site’s internal links structure. Check to see if your homepage is linking to other pages on your website, and if in turn those pages are further linking to additional interior pages.

Remember, if you don’t link to your site’s internal pages then a webcrawler won’t be able to do its job (which is to crawl through your site’s links, scan the pages, and index them).

#5. Limiting Yourself With Technology

You would be surprised how many people have done this unknowingly. Every see a website whose homepage forces a user to fill out a form to go through to the site’s internal pages? Ever see a website whose content shifts via AJAX or Javascript, but the URL doesn’t change?

Avoid using forms as the only means to pass through to another page. Webcrawlers can’t follow forms, which means they can’t get to the pages that those forms lead to.

Stay away from designing your whole website using Javascript and AJAX. This isn’t to say you can’t use it at all, just be wary of it. Javascript, AJAX, and other such dynamic scripting techniques effectively reduce a webcrawlers ability to scan your site’s content. Think of them like the Great Wall of China. They can’t get in; content can’t get out.

#6. Server & Redirection Errors

This is what I like to think of as “Oh, Shit!” mode for your site. Ever seen a pesky 404-error page, or done a 301-redirect. Time to double-check your site to see if this is the problem. Here are some server codes and what type of problem they indicate…

100-199: SRCs provide confirmation that a request was received and is being processed.

100 – This is good. The request was completed and the process can move along.

101 – Request to switch protocols (like from HTTP to FTP) was accepted.

200-299 – SRCs report that requests were performed successfully.

200 – It simply means all is OK. What the client requested is available.

201 – This means a new address was successfully created through a CGI or posting form data.

202 – The client’s request was accepted, although not yet acted upon.

203 – The accepted information in the entity header is not from the original server, but from a third party.

204 – There is no content in the requested click. Let’s say you click on an image map section not attached to a page. This allows the server to just sit there waiting for another click rather than throwing an error.

205 – This allows the server to reset the content returned by a CGI.

206 – Only partial content is being returned for some reason.

300-399: Request was not performed, a redirection is occurring.

300 – The requested address refers to more than one entity. Depending on how the server is configured, you get an error or a choice of which page you want.

301 – Page has been moved permanently, and the new URL is available. You should be sent there by the server.

302 – Page has been moved temporarily, and the new URL is available. You should be sent there by the server.

303 – This is a “see other” SRC. Data is somewhere else and the GET method is used to retrieve it.

304 – This is a “Not Modified” SRC. If the header in the request asks “If Modified Since”, this will return how long it’s been since the page was updated.

305 – This tells the server the requested document must be accessed by using the proxy in the Location header (i.e. ftp, http.)

400-499: Request is incomplete for some reason.

400 – There is a syntax error in the request. It is denied.

401 – The header in your request did not contain the correct authorization codes. You don’t get to see what you requested.

402 – Payment is required. Don’t worry about this one. It’s not in use yet.

403 – You are forbidden to see the document you requested. It can also mean that the server doesn’t have the ability to show you what you want to see.

404 – Document not found. The page you want is not on the server nor has it ever been on the server. Most likely you have misspelled the title or used an incorrect capitalization pattern in the URL.

405 – The method you are using to access the file is not allowed.

406 – The page you are requesting exists but you cannot see it because your own system doesn’t understand the format the page is configured for.

407 – The request must be authorized before it can take place.

408 – The request timed out. For some reason the server took too much time processing your request. Net congestion is the most likely reason.

409 – Conflict. Too many people wanted the same file at the same time. It glutted the server. Try again.

410 – The page use to be there, but now it’s gone.

411 – Your request is missing a Content-Length header.

412 – The page you requested has some sort of pre-condition set up. That means that If something is a certain way, you can have the page. If you get a 412, that condition was not met. Oops.

413 – Too big. What you requested is just too big to process.

414 – The URL you entered is too long. Really. Too long.

415 – The page is an unsupported media type, like a proprietary file made specifically for a certain program.

500-599: Errors have occurred in the server itself.

501 – What you requested of the server cannot be done by the server. Stop doing that you!

502 – Your server has received errors from the server you are trying to reach. This is better known as the “Bad Gateway” error.

503 – The format or service you are requesting is temporarily unavailable.

504 – The gateway as timed out. This is a lot like the 408 error except the time-out occurred specifically at the gateway of the server.