Login

Google Sitemaps Review

You are an SEO; you want to know what Google’s search crawlers see, what errors the spider encounters, how it views and responds to your robots.txt, and most of all, you want to help it find more of your pages. Google Sitemaps has evolved over the past year and now offers helpful tools that webmasters should be using. SEO Chat spoke to Google about the services, so read on to see more.

Google Sitemaps keeps Google’s search results fresh. It helps them to deliver the most recent pages to searchers, but webmasters certainly benefit from it too. With a little extra work, sites have more visibility on Google, and they can push information at the same time as pulling statistics and errors to help increase indexing.

Sitemaps helps both new and existing sites improve their Google presence. It can help new sites with few links get more extensive crawls when Googlebot visits, and large sites with PageRank should see deeper crawls and speedier page discovery by showing Googlebot the layout of their site. Google representatives reviewed the highlights of the service for us and answered some questions.

To Create and Submit a Sitemap

So you’re intrigued by the idea of a Google Sitemap, and you want to build one. Great! I’d rather spend time talking about the features you get out of the service, but suffice it to say all you need to use the service is a Google account. If you have Gail, you can use that identity to establish a sitemap. Just visit http://www.google.com/webmasters/sitemaps.

In this beta period, there are also some limitations as to how many sitemaps and URLs you can submit. You can submit 200 sitemaps. If you have more than that, you can add them to a sitemap file, which can contain up to 50,000 URLs. Despite having 15 sites and hundreds of thousands of URLs, all of Developer Shed’s sitemaps are in the same account. The limitations shouldn’t stop anyone if they don’t stop our network.

Site Overview

In the site overview, you can see that you’re able to download the information as a .csv (comma separated value) file. Google has added this feature to most of their data in the service so you can collect to information and use it more flexibly. For example, you could use the .csv in a script to help track your statistics.

After we have logged in and entered SEO Chat’s sitemap, Google displays all the information in a tidy tabbed interface. The first tab, Sitemaps, shows us a few basic details of our sitemap file.

If there were errors in our sitemap, Google would tell us via the Sitemap Status. Also, they obviously show how frequently our sitemap is downloaded as Google watches it for changes. Google seems to watch the frequency of changes and considers it when deciding how often it crawls.

The XML sitemaps use open standards on the creative commons license, so Google has left the format open for other search engines to use the same file. Right now Yahoo! only appears to use text files full of URLs, but maybe the Google sitemap file will be useful for reporting links to other engines like Y! eventually. It can’t hurt to be ready.

{mospagebreak title=Google Sitemaps Statistics: Query Stats}

Google Sitemap Statistics: Query Stats

If your site is deeply crawled and Googlebot discovers your new pages fast enough, you are probably wondering why you would need sitemaps. What does it have that is worth the trouble of setting up an account and a sitemap? However, the Google Sitemap statistics are reason enough to sign up for the service, and they don’t even require uploading a sitemap.

The sitemap statistics are probably the best thing you get from interacting with Google. The first set of results, the Query stats, can help to show you details about your visitors that you probably would have never found out otherwise. They show what search queries and keywords people are looking for when they find you, and which ones you get the most clicks for. There are three sections of these stats:

Top searches

Top searches from mobile devices

Top searches from the mobile web

Top searches is the most interesting, but the mobile devices may pique your interest if you are running a site targeted specifically for them. There are two tables in the top searches. The first shows us the top search queries. These are the most common searches people perform and find you in the results. The second table is very similar, but it shows the most common terms people searched for when they clicked through you result listing.

Top Search Queries

Top Search Query Clicks

There are a few things this information can help with. It can be very interesting to see if people are finding you for keywords you have chosen, and you can also see if they click through for your keywords. If they aren’t, you have something going wrong, and Google’s statistics can be a red flag that something is wrong. Do search queries with this information to see the SERPs your searchers are seeing, and optimize your content to bring in the most relevant traffic. Finally, it might signal that you should reconsider picking new keywords or reworking your static content to reflect the kind of visitors you want to bring in.

Also, some sites have explained that their visitors have been finding them for only a small fraction of their content. If you are running a blog of consumer reviews, for example, you may notice that people are finding your site most often for your reviews of kid’s toys and Legos. While this is only a small part of your site, you might consider expanding on this content to keep visitors’ interest and maybe get a few more to bookmark the site. You might think of more creative things you can do with this information, but obviously it’s really helpful to see into the mind of a Google searcher.

You will notice too that the search rankings are not real-time. They are the average of your rank over three weeks, and Google said that this value is only updated once a day. With all the fluctuations is SERPs, it’s easier to deal with the information this way.

The Crawl stats are a little less interesting, but they give an overview of what is on other pages. Through distribution bars, like you can see below, the site shows the percentage of your pages that Google crawled successfully. It looks like 95% of SEO Chat’s links were successfully crawled. The ones that weren’t were restricted by robots.txt and nofollow tags for the most part.

Crawl Stats

A second graph shows the amount of PageRank your pages have. Again, it uses distribution sliders to display how many pages have PR that is: High, Medium, Low, and PageRank not yet assigned. You can also see which of your pages has the highest PageRank for the months you have used sitemaps. It should probably be your homepage.

Page analysis displays some more interesting keyword information. The two tables here go hand in hand with the query stats. They show what words are more common on your site and what words are most common in anchor text of links to your site. Of course, both help to determine what terms you show up for in search queries. So, they might give you a clue why your site shows up for any surprising keywords.

Page Analysis Keywords

In our case, there is really nothing surprising at all. To weed out less relevant keywords for SEO Chat, we might be interested in reducing the number of times we use “dev” on our site, but it’s part of our network name. We might also want to reduce how much we use “thread,” but that’s kind of hard to reduce in the forums.

This section also shows the type of content Google is indexing the most (HTML, PDF, plain text, etc.) and the encoding that is used most often on the site (ASCII, ISO8859, etc.).

Index stats has links to a few Google features you were probably using already. They link you to search queries of your site using site:, allinurl:, link:, cache:, info:, andrelated:. All SEOs knew about these Google tools before, but being linked within this category centralizes everything you can research.

{mospagebreak title=Error Reporting from Google}

Errors

If Google fails crawling some of your pages, you would probably want to know about it. It means that your pages might not be linked or set up correctly, and they aren’t getting indexed. Before Google Sitemaps you had to guess what the Googlebot what doing by looking at your webserver log files. Now, Google gives you these failures so you can troubleshoot them.

The first section gives the basic HTTP errors of pages that were linked within the site, in the sitemap you submitted, or linked from another site. You can look for the errors of the first two, but obviously can’t do much about the latter. Unreachable URLs shows when Googlebot encountered network issues, such as when it finds that it is overloading your server. URLs not followed are pages that Google could not reach because of your redirects. If you have pages in this category, you really need to work on troubleshooting them.

URLs restricted by robots.txt is pretty self explanatory, and it can help you to be sure that your robots file is blocking the right pages. They are not necessarily errors, but they explain why Google is not indexing the pages. We have an image of our robots blocking page below to demonstrate what all the error pages basically look like.

Robots Error Sheet

The details show the URL, the detail (in this case “URL restricted by robots.txt”) and the last date that Google tried to follow a link to the page.

In detailing robots, this much information is fine. However, with a section like HTTP errors, it would be more helpful to know the link referrer. Without knowing where the bad link is-or even if it is on your own site-you are left stabbing in the dark to solve the error unless there is something obviously wrong with the linked page.

Our robots information leads us onto the next section of information that Google Sitemaps gives us.

{mospagebreak title=Robots.txt Troubleshooting}

Robots.txt

The Google representatives we spoke to said that one of the worst problems people have with the robots.txt is that they accidentally block their home page. This can have some pretty harmful effects on getting spidered. Also, pages that appear both in a robots.txt file and your sitemap are excluded from Google’s index; the robots.txt is always obeyed. The robots analysis tool from Google Sitemaps makes it pretty obvious if you have something set terribly wrong. The header tells you right away if your robots.txt is blocking access to your homepage and when Googlebot last downloaded the file.

There’s a lot more than just this to the tool. You can also see Google’s view of your robots.txt and change it to see what effects it will have. This is Google’s editable view of your robots:

Google’s View of Robots.txt

You can edit the textbox above, which will help you with the robots tool below. Basically, Google allows you to test drive their various spiders: Googlebot, Googlebot-Mobile, Googlebot-Image, Googlebot-MediaPartners. First, you edit the robots.txt the spider sees above, then enter the URLs to query (seeing if they are blocked or allowed to be indexed), and choose the crawlers you want to test. The results end up looking something like the ones below.

Robots Test

As you can see, in the results at the bottom of the shot, SEO Chat’s homepage is allowed to be indexed, while my author bio page is blocked. In case the screenshot above is too small, here is the details Google gives me about the blocked page:

Blocked by line 2 : Disallow: /cp/bio Detected as a directory; specific files may have different restrictions

Sure enough, if you scroll up, line 2 of the robots file is: Disallow: /cp/bio.

Having options to change the user-agents allows you to test if the right crawlers are hitting the right ages, in case your robots.txt makes different demands to different spiders. This is great assurance that you aren’t blocking access to the wrong ones by accident, especially since Google is basically validating the instructions itself instead of using a third party script. SEO Chat doesn’t give different instructions to different crawlers, so all of them turn out the same for us.

While experienced SEOs might be comfortable with the robots.txt already, it’s definitely worthwhile to visit this anyway. Everyone makes typos, and not checking could create real indexing problems. For those a little unsure about whether you robots.txt file is working exactly like you want it to, this tool has everything you should need.

{mospagebreak title=Google’s Performance and Details}

Google’s Performance and Details

This much of Google Sitemaps is fairly introductory, and I’ve mostly just shown what you can get from it. However, after using the service for Developer Shed sites like SEO Chat, our web developers had a few questions. Thankfully, after Jill Lindenbaum and Shaluinn Fullove from Google showed me around the service, they helped to answer those concerns.

I first asked about a new site that started using Google Sitemaps and one backlink; after two weeks, the index page was crawled once and the spider did not go deeper. So how should we expect to see Google crawling new sites that use this program and have next to no back-links?

Google representatives explained that new sites shouldn’t expect the robot to crawl on demand. There’s no way to force the spider to your site, but Sitemaps makes the crawler more effective when it does come. It understands the structure of your site better, and the other pages of the site may be waiting in queue to be crawled. Sure enough, a week later, Googlebot returned and tried to browse pages that were on the site originally and had since been removed; whether from the initial crawl or from the original sitemap, Googlebot was trying to visit the old links it had stored in its database.

We also noticed that Google downloaded the sitemap of our new site at least once a day, but it was not crawling pages. Despite the “lastmod” date and the importance you set for your pages, Google only takes these factors into account as a part of their algorithm. Low PageRank and back-links will tell Google not to crawl often, even if you update the site a few times a day and report this to Google. Googlebot’s daily download of our new sitemap may have just been its way of watching the site to learn patterns.

Also, for very large sites that grow by thousands of URLs a day with respectable PR, even reporting all links to Google will not result in having them all crawled. One of our sites (Dev Archives) grows quickly, yet Google’s spider is far behind on indexing the pages. Again, Sitemaps is not an on-demand crawler, but the Google reps explained there was more to it than this. Googlebot tries not to overwhelm your server by requesting thousands of URLs a day. It tries to respect your bandwidth, and it should learn the patterns of your site over time. It hasn’t learned how to spider Dev Archives yet, but Sitemaps may help it given more time.

The Google Sitemap is considered in the search algorithm, but it does not work on demand. The best way to think of Sitemaps service is as a complement to the crawl, where you can communicate with Google on how to find your information most easily when it does arrive.

Webmasters being able to communicate with Google could promise great improvements to optimization efforts and search results in the future. Still, there are a few things we definitely would like in the future. Google could show more information about the query stats, such as the pages of your site that visitors most often find in their searches and the clickthrough ratio of your “top search query clicks.” Showing the pages that link to erroneous locations on your website might also speed up troubleshooting.

That’s not to say these features and others are not coming soon. Jill Lindenbaum has alerted me that Google Sitemaps is due for some new features in a few weeks. We don’t have information on what these features are yet, but we will be sure to show them off when we have more details.