The 18th annual International World Wide Web Conference, WWW 2009, was held this past April 20-24 in Madrid, Spain, and it has become the premiere event to publish research and development on the evolution of our favorite medium.

A fascinating (if you're a web geek) paper was presented by Uri Schonfeld of UCLA and Narayanan Shivakumar of Google called "Sitemaps: Above and Beyond the Crawl of Duty". The main thrust of the paper was that traditional web crawlers employed by search engines are becoming overwhelmed by number of new websites and pages appearing daily on the web; by one count, there are more than 3 trillion (!) pages that need to be indexed, deduplicated, and tracked for inbound/outbound links.

The Sitemaps protocol is becoming more and more important to search engines as they try to prioritize and filter this mound of information, and part of the problem is the rise of large-scale CMS systems, which dynamically generate pages regardless of whether there's any real content in them or not. They used the example of Amazon.com, which for any given product will have dozens of subsidiary pages, such as reader reviews, excerpts, images, specifications. Even if there is no content, the link to a dynamically generated page will still return a page with no data in it, creating literally tens of millions of unique URLs at amazon.com, which "dumb" crawlers must follow and index.

The Sitemap protocol defines an XML file format for search engines to use which not only lists all the URLs that should be indexed, but also provides information on how important the page is, how often it's updated, and when it was last updated. Search engines can use this file to rapidly index the important content and ignore what isn't there, improving the accuracy and time taken to index a website. Every site should have a sitemap, but as of October 2008 it was estimated that only 35 million sitemaps have been published, out of billions of URLs.

Amazon makes a concerted effort to publish accurate sitemap data, as it dramatically reduces the time required to index new content. Even so, Amazon's robots.txt file lists more than 10,000 sitemap files, each holding between 20,000 and 50,000 URLs, for a total of more than 20 million URLs on amazon.com alone! The authors note that there is still a lot of content duplication and null content pages there, but the number is staggeringly large. After monitoring URLs on another website, they also noted that sitemap crawlers picked up new content significantly faster over time than when using the simple "discovery" method used when there is no sitemap file.

We said before that every website should have a properly constructed sitemap, as it will improve the quality and accuracy of search engines as a whole. Beyond creating the sitemap, registering it with major search engines will provide valuable feedback for the webmaster on crawl and index rates, and provide insights into what the search engine "sees" when it looks at your website. Please create a sitemap for your website today, or just ask us if you need help!

Twitter...how is it being used profile you for search marketing? I signed up a new Twitter account and within an hour had a dozen mysterious "followers" of whom I'm sure I'm not acquainted with, all beguiling young women if you believe their avatars...hmmm.

Google's Webmaster Tools provides webmasters with a way to upload XML sitemaps to improve the accuracy of Google's index. Registering and maintaining an accurate sitemap (Google, Yahoo, and Microsoft all accept sitemap data) is important to proper indexing of your website pages, and Google provides two methods for notifying them when the sitemap is updated: manually through the Google website, and "semi-automatically" by sending an HTTP request that signals Google to reload the sitemap.

Ping me when you're ready

The second method can be automated through server-side scripting, so that when content on a website or blog is updated, the sitemap file is updated as well, and the update request is sent to Google at the same time. In theory, this should provide rapid updating of Google's index to include the latest content on your website.

Depending on a number of factors, Google will automatically reload your sitemap file without you specifically requesting it to do so. One factor is the content of the sitemap itself. Besides a list of URLs on your website, the sitemap file can also hold information about date the URL was last updated, and how frequently it is updated. For example, if your homepage content changes every day, you can assign a frequency of "daily" to that URL, telling the search engine it should check that page every day.

It should be noted that incorrect use (or "abuse") of a sitemap, such as indicating pages are new when the content hasn't changed, can cause problems if the search engine recrawls the page too many times without seeing any new data. Empirical data have shown that pages may be dropped from the search engine index under this scenario, and new pages added to this "unreliable" sitemap may be ignored or crawled more slowly.

It's a popularity contest

Another factor in sitemap reloading is link popularity. If a lot of websites are linking to particular pages on your website, search engine spiders will crawl those pages more often, and if the site is large, the sitemap will help prioritize which pages are crawled first.

To Submit, or Not to submit...

We have seen that once a sitemap is submitted and indexed by search engines, they will regulary come back and reload the sitemap looking for new URLs, whether you re-submit it or not. As your website's pagerank (on Google) and general link popularity grows, there's an increase in the frequency that the sitemap will be reloaded, without your taking any action...so do you need to submit it manually or automatically?

The answer is "it depends". Google itself warns webmasters not to resubmit sitemaps more than once per hour, probably because that's as fast as it's going to process the changes and redirect Googlebot to the URLs in the sitemap. If you are auto-submitting sitemaps more than once an hour, the "punishment" could range from the SE ignoring the subsequent re-submits, to something more dire...but no one really knows the consequences. It would probably be safer to resubmit sitemaps on a regular schedule, but we do not have any hard data about this at this time.

When you Should re-submit a sitemap

So when should you re-submit a sitemap? The obvious answer is whenever your content changes, but not more than once an hour. Google does not yet provide an API to query when it last loaded your sitemap, although you can see this data in its Webmaster Tools. If you have some very timely news that the SE really needs to know about, then resubmit the sitemap...it may not increase the crawl rate, but it may impact which URLs are crawled first.

The bottom line is that sitemaps are becoming increasingly important to search engines to help them prioritize the content they crawl, so use them, don't abuse them, help the internet be a better place!

Fascinating post on musicmachinery.com a couple days ago about Time Magazine's online Annual "100 Most Influential People" poll getting hacked by Anonymous. Time Magazine allowed users to vote on its website for the person they considered most influential in 2008, using a simple form. Anonymous seized the opportunity to skew the results by spelling out a message with the first initials of the top 21 entries:

Anonymous used an army of bots to overload Time's legitimate votes, and in an effort to stem the attack, Time first took the form offline, where it continued to be exploited, and then finally put reCaptcha, a popular anti-spam visual-text-matching system, on the form (SearchPartner.Pro uses reCaptcha on our contact us form). reCaptcha is quite effective at defeating known exploits that attempt to use OCR (optical character recognition) to read the image and translate it to text, so Anonymous resorted to a "brute force" attack using members (humans) to place as many votes as possible.

Anonymous also revealed many sophisticated techniques for defeating reCaptcha's pattern logic so that humans could submit entries faster. In the end, Time was unable to stop the hack and you can see the results in the image above. Time did not deny that it had been hacked and downplayed the importance of the results.

The news provoked a strong debate on the reCaptcha newsgroup. Was reCaptcha hacked? Typically, hacking a CAPTCHA would mean using a computer to defeat the protection, so that a human would not have to interact with the form. No one really knows if there is an OCR system that can do this right now, although hackers are constantly evolving this technology. Using brute force to defeat the system with human interaction is also quite common, and there are many teams of hackers in China, India and Russia (and elsewhere) that advertise these services, but this isn't so much of a hack as overwhelming a single point of protection.

The lesson learned here is that relying on a single technology for protection will inevitably fail, while adding additional steps can slow down brute force attacks by many orders of magnitude, for example by restricting the number of submissions by IP address, embedding hidden text fields on forms (that only a bot would see and try to add data to), adding two-factor verification (e.g. CAPTCHA and random problem match), etc.

Following up on yesterday's post about Googlebot and crawlers, I see that Googlebot is coming back to read the sitemap on a regular basis without needing to submit it...probably based on the update-frequency parameter specified in the sitemap...a good thing I hope, although Google is not adding more pages to the index yet.

Registering the sitemap with MSN/Live and Yahoo resulted in immediate crawls by the MSNbot and Slurp crawlers, which of course is a good thing...will see what the indexing rate is in a future post.

To follow up on my previous post, after the initial long delay before Google downloaded the sitemap.xml for searchpartner.pro, the sitemap was resubmitted 24 hours later and Google downloaded it within minutes.

A quick review of the server logs showed Googlebot hitting the website shortly thereafter, which corresponds to behavior reported by Adam at BlogIngenuity. Adam also reported pages quickly appearing in Google's index, but no additional pages appear to have made it in yet for searchpartner.pro. This could be attributable to time-of-day as well as how recently the website was added to Google's index. Presumably the page text is "in the hopper" and being processed (wouldn't a progress bar be a cool webmaster's tool?).

Appropriately tagging the sitemap file with date/frequency/importance data for each URL will probably build the site reputation in Google's index and hopefully priortize content indexing. We know that the better a website's reputation the faster Google will add pages to the index.

Launching the Searchpartner.pro website was an interesting experiment in measuring Google's crawl rate. The domain had been parked at a registrar for some time, nearly a year, so Googlebot and other crawlers would have known about it, but would not have found any content. This may have been a negative factor in the subsequent crawl rate.

Before launching the website, all the appropriate actions were taken to insure a rapid crawl and index rate:

Creation of all relevant pages, with informational pages of high quality and narrow focus

Implementation of appropriate META data

Validation of all links and HTML markup

Implementation of crawler support files such as robots.txt and an XML sitemap

Finally a sitemap was registered with Google and the site brought online...and then the waiting began.

It took more than two days (approx. 57 hours) after registering the sitemap for Google to actually parse it. Google found no errors.

It took three more days after parsing the sitemap for Googlebot to actually crawl the site.

More than 24 hours after crawling the site, Google had added only three pages to its index.

It seems that the days of "launch today, indexed tomorrow" are in the past. Even with publishing a website based on Google's best practices, it seems that Google is somewhat overwhelmed at this point and crawl rates for new sites are being delayed.

Two unknowns:

Does leaving a domain parked for a long time negatively impact the initial crawl rate?