Unofficial news and tips about Google

April 13, 2008

Google Starts to Index the Invisible Web

Google Webmaster Central Blog has recently announced that Google started to index web pages hidden behind web forms. "In the past few months we have been exploring some HTML forms to try to discover new web pages and URLs that we otherwise couldn't find and index for users who search on Google. Specifically, when we encounter a <FORM> element on a high-quality site, we might choose to do a small number of queries using the form. For text boxes, our computers automatically choose words from the site that has the form; for select menus, check boxes, and radio buttons on the form, we choose from among the values of the HTML. Having chosen the values for each input, we generate and then try to crawl URLs that correspond to a possible query a user may have made. If we ascertain that the web page resulting from our query is valid, interesting, and includes content not in our index, we may include it in our index much as we would include any other web page." For now, only a small number of websites will be affected by this change and Google will only fill forms that use GET to submit data and don't require personal information.

Many web pages are difficult to find because they're not indexed by search engines and they're only available if you know where to search and what to use as a query. All these web pages create the Invisible Web, which was estimated to include 550 billion documents in 2001. "Traditional search engines create their indices by spidering or crawling surface Web pages. To be discovered, the page must be static and linked to other pages. Traditional search engines can not see or retrieve content in the deep Web -- those pages do not exist until they are created dynamically as the result of a specific search."

Anand Rajaraman found that the new feature is related to a low-profile Google acquisition from 2005.

Between 1995 and 2005, Web search had become the dominant mechanism for finding information. Search engines, however, had a blind spot: the data behind HTML forms. (...) The key problem in indexing the Invisible Web are:

1. Determining which web forms are worth penetrating. 2. If we decide to crawl behind a form, how do we fill in values in the form to get at the data behind it? In the case of fields with checkboxes, radiobuttons, and drop-down menus, the solution is fairly straightforward. In the case of free-text inputs, the problem is quite challenging - we need to understand the semantics of the input box to guess possible valid inputs.

Transformic's technology addressed both problems (1) and (2). It was always clear to us that Google would be a great home for Transformic, and in 2005 Google acquired Transformic. (...) The Transformic team have been been working hard for the past two years perfecting the technology and integrating it into the Google crawler.

Seems like a win all around. The end user wins by getting what they're looking for, the publishers win by having their content more readily indexed, and Google wins by getting advertising revenues and increasing their search market share. +3

One (partial?) solution to the invisible web and how to populate forms is to "ask" the web site: define a standard way of exposing typical or possible results of a form. Add an extra element to the FORM tag, like "PRESULTS=/results.xml" which could be a URL to a static or dynamic list of URLs that might result from a search. For example: if you are searching by author in a box, and you search your database for the author, results.xml could contain a dynamically generated list of URLs for all authors in your database.

I think there is some possibility of people abusing this for black hat SEO, but it'd be a great tool for white hat SEO folks.

I'd call this evil. Let's assume the form on your website is used to collect information and store it in your database for future use. If the technique works as described, you will have garbage data in your database. Even if your form is not about collecting the data, but just searching, that search data is now less meaningful because it includes the queries google "generates."

I think that using link popularity as a way to determine what is seen or can be found on the internet is flawed. I am always annoyed at having to sort through the chaff served by google and others before I find something relevant. Sure its great for ad serving but it wastes huge amounts of time that might be better spent elsewhere. What if we asked page creators to imbed some sort of Dewey Decimal or Library of Congress classification in their headers? Might improve sorting and seeking? We use DNS to find pages on the internet, why not have some sort of information naming system to help with building a useful index. You could still do your ad serving but to a less frustrated and potentially more receptive audience.

Hitesh are you mad.... That is not working... I am bored of this complaints.... I am fighting from since 2008 and today is 2010 july.... I am thinking to join facebook..... Google If you are alive then listen carefully in to hell with your m.Orkut... Good bye