Google’s bots learn to read interactive webpages more like humans

Google can now index parts of the Web that weren't indexable before

Google feeds its search engine's index with site data from a virtual army of "bots"—Web-crawling applications that scour sites for content. But in the past, Google's bots hit a wall when they ran into interactive content that was loaded through JavaScript—especially on pages that use Asynchronous JavaScript and XML (AJAX) to allow users access to additional content without reloading pages. But now, according to Vancouver-based developer Alex Pankratov, it appears Google's bots have been trained to act more like humans to mine interactive site content, running the JavaScript on pages they crawl to see what gets coughed up.

Google has in the past offered up proposals to make AJAX content more searchable, but this put the burden on Web developers rather than on Google's bots—and the proposals didn't gain as much traction as Google had hoped. During the last quarter of 2011, Google finally started to figure out how to efficiently solve the problem from its end, and began to roll out bots that could explore the dynamic content of pages in a limited fashion—crawling through the JavaScript within a page and finding URLs within them to add to the crawl. This required Google to allow its crawlers to send POST requests to websites in some cases, depending on how the JavaScript code was written, rather than the GET request usually used to fetch content. As a result, Google was able to start indexing Facebook comments, for example, as well as other "dynamic comment" systems.

Now, based on the logs Pankratov has shown, it appears that rather than just mining for URLs within scripts, the bots are crawling even deeper than comments, processing JavaScript functions in a way that mimics how they run when users click on the objects that activate them. That would give Google search even better access to the "deep Web"—content hidden in databases and other sources that generally hasn't been indexable before.

Flash-based content such as an interactive Websites developed in Flash instead of HTML = yes. This has already been going on for quite a while. Several version back of Flash - Adobe added methods for development to allow for flash-content to be both Google accessible for SEO and also 508 compliant for screen readers etc...

But the coding aspect of it - ActionScript - based on the same stuff JavaScript is based on has never been readable by Google's bots - as they were developed to only scrape for readable text (and basically ignore the functional code).

Sounds like Gogole just finally got around to tweaking their algorithms and stopped being lazy about their search methods - rather than the fact that they were not able to get their search to index dynamic content until now.

Once they refine this - it will probably jump their indexing exponentially. And it's about damned time - I'm tired coding work arounds to JS implementations simply because it affects the way Google's indexing sees content. (kinda of like coding work around sfor IE6 and 7)

Quote:

Google has in the past offered up proposals to make AJAX content more searchable, but this put the burden on Web developers rather than on Google's bots—and the proposals didn't gain as much traction as Google had hoped.

Hate to burst your bubble here Google - but JavaScript existed a long time before you came around - and just because you're all big and grown up like other tech companies does not mean we developers are goibng to comform to you (like M$ has been trying for over a decade) - it's the other way around. Otherwise we'd never develop dynamic content - everything would be hand-coded and only static pages - now wouldn't that be fun.

I can't wait until these bots get smart enough with flash to start playing browser games."How's our bots doing?""Lets see... oh my... they've been doing nothing but playing Angry Birds clones for the past week!"

Any privacy issue with any website would then cascade onto Google's search results. Meaning that content that is not intended for the web-at-large could be permanently exposed by a security flaw in the programming of the website.

Any privacy issue with any website would then cascade onto Google's search results. Meaning that content that is not intended for the web-at-large could be permanently exposed by a security flaw in the programming of the website.

There's no 1-1 relation between dynamic content and private content. Web sites/apps can use fancy javascript/ajax code for public content, and they can serve private content as plain static resources that are only protected by user login and encryption. Plus, if your site's security depends on any extent on scripts which are served publicly and any kid can load and deobfuscate, U R doin it rong...

If I recall Turing theory correctly, you cannot predict the effect of a program written with a turing-complete language without actually running it. However, Google probably doesn't need to know all the effects of a program; it only needs a reasonable guess on some aspects. This then is not really about 'reading like humans', but more about 'being a smartly written program'.

Disregarding all the reasons why crawlers shouldn't submit forms it should be possible for google to do that unless the website uses captcha. But... Hypothetical question: couldn't Google leverage their reCAPTCHA to at least be able to bypass all captcha on forms that use this particular solution?

In any case I bet they'll stop at PayPal donation forms, though. Unfortunately. Would love to get me some spider money!

Just wait until Google's bots start to brute force website accounts and index all private data for even more better and relevant ads. Mistakenly, of course.

Why brute force, why not just use the users gmail password?

Considering the volume of users who reuse their passwords for all of their services, Google could potentially index a person's entire online presence. I would love to find things in my search results for "cats setting dogs on fire" like people's private FB messages, yahoo emails, ICQ conversations from 1998, descriptions of items in their tiger direct wish list, private secure messages from their bank's website, paypal transaction history... oh the lulz.

Sean Gallagher / Sean is Ars Technica's IT Editor. A former Navy officer, systems administrator, and network systems integrator with 20 years of IT journalism experience, he lives and works in Baltimore, Maryland.