I have a new (1 month-old) site. It's an online website security related service and it extensively uses AJAX. This morning, in the Analytics, I found that Google has sent me a visitor via a types of hidden spam query. I went back to Google and was glad to see that my site ranked #1 among the other 14,300,000 results for that search.

However, the strange thing was the search result linked to this "page": unmaskparasites.com/security-tools/find-hidden-links/site/?siteUrl=

I use this URL (or rather a part of the URL) in JavaScript to dynamically build customized links to display in reports. No pages on my or any other site link to unmaskparasites.com/security-tools/find-hidden-links/site/?siteUrl=

On the other hand, there is another static web page about finding compromised WordPress blogs with direct inbound links and similar text about common types of hidden spam links.

So why has Google preferred that incomplete dynamic URL, with no incoming links, buried inside JavaScript, and not a similar web page with static URL and direct incoming links?

I entered the site:unmaskparasites.com query to check whether the static page was indexed. It was. The site is very small and all pages are indexed. Moreover, in the results I found pages that were not supposed to be indexed, the URLs that were only used for AJAX requests (see the unmaskparasites.com/results/ and unmaskparasites.com/token/ results on the screenshot).

http://unmaskparasites.com/results/ and http://unmaskparasites.com/token/ are service URLs, used exclusively in AJAX (javascript) requests. There are no other links to these "pages." Here are the JavaScript snippets with these URLs:

$.get('/token/', function(txt){ ...

and

$.post("/results/", { ...

As you can see, it is not a trivial task to find links in the code. The links are relative and don't contain any "http://". One should be able to understand the code to distinguish such links from other non-link literals. Once parsed, Google adds the domain name to construct absolute links.

Links in JavaScript strings

The http://unmaskparasites.com/security-tools/find-hidden-links/site/?siteUrl= URL can also be found only as a part of the following string inside JavaScript:

When a crawler visits the page with this script the value of the id_siteUrl field is blank, and if you execute the JavaScript you will get the following string: '<a href="/security-tools/find-hidden-links/site/?siteUrl="'>' - the URL will be indexed by Google (again, with domain name added).

Google crawler's JavaScript is not the same as in your web browser

It looks like Google's crawler executes only the parts of your JavaScript that have to do with links and skips the rest code.

In my case, the cached page clearly shows that Google fetched http://unmaskparasites.com/results/ using the GET request with empty parameters. If it really executed all code it wouldn't be able to 1). pass validation and load the page, and 2). it would use the POST request.

So I assume Google's crawler is not equipped with a full-featured JavaScript interpreter. It just parses JavaScript, finds links, and maybe executes some reduced set of commands, for example, to concatenate strings.

jQuery

My other guess is Google knows how to interpret JavaScript based on known libraries. I use jQuery and load it directly from Google's servers:

http://ajax.googleapis.com/ajax/libs/jquery/1.2.6/jquery.min.js

This is the only external JavaScript file my pages load. So Google can be pretty sure that $post(...) and $.get(...) functions send AJAX requests and the $('#results').html(...) call adds HTML code to div with the "results" id.

Google Toolbar

I have Google Toolbar installed and it could send information about URLs I visit back to Google. This way Google could have learned about those JavaScript links. But there are still some facts that make me think that the toolbar is not to blame.

My AJAX URLs never appear in the address bar, so there is no reason to request PageRank info for them.

The toolbar reports URLs visited in real life. So the indexed page would have a URL like http://unmaskparasites.com/security-tools/find-hidden-links/site/ or http://unmaskparasites.com/security-tools/find-hidden-links/site/?siteUrl=example.com

Other visited "secret" maintenance URLs that are not mentioned in JavaScript are not indexed.

Did you ever see a web page with no incoming links on a one month-old domain rank #1 for a query with 14 million other results?

Some information from Google

I've just found some indirect proofs of my point in the official Google Webmaster Central Blog:

"One of the main issues with Ajax sites is that while Googlebot is great at following and understanding the structure of HTML links, it can have a difficult time finding its way around sites which use JavaScript for navigation. While we are working to better understand JavaScript, your best bet for creating a site that's crawlable by Google and other search engines is to provide HTML links to your content."

So they state it is "difficult," but don't say "impossible" and that they are "working to better understand JavaScript." And now 9 months later they seem to be able to understand some JavaScript.

This proves my point that googlebot executes JavaScript, but its support of JavaScript is limited.

"Regarding ActionScript, we’re able to find new links loaded through ActionScript."

If they can find links in ActionScript, why not find links in JavaScript too?

New era?

Flash, JavaScript. Is this the beginning of a new era of more sophisticated search engine spiders that can "see" web pages the way human surfers see them? Check your JavaScript. Maybe you expose too much to Google. I have just added a few more "Disallow" rules to my robots.txt.

Do you think I'm paranoid?

P.S. Google has indexed http://unmaskparasites.com/security-report/ , which only appears as an action parameter of my HTML forms. Action parameters of forms are followed?

P.P.S. Hopefully, despite my terrible English, you were able to find some interesting information in the article.

This was an excellent post!
I'll just add a couple of things to the discussion:
1.) To prevent Googlebot from reading your JS, place it in an external file that is disallowed via robots.txt. (In other words, don't just disallow the "hidden" URLs that the JS renders--disallow the actual JS so Googlebot never sees those "hidden" URLs in the first place.)
2.) I have also seen a lot of evidence that suggests that Google interprets JavaScript. The most startling example was when I discovered a bunch of indexed URLs that pointed to a 3rd-party ad server. Due to the way the ad server worked, the JS code would build a random dynamic URL using the Math.random() function. The overall effect was that the ad server inserted a random advertisement banner (out of dozens to choose from). This is the interesting part....
Google had apparently interpreted Math.random() as 0.5!
Due to the way the JS was coded, every page on the site had its own ID (to allow us to track which pages were getting the most banner clicks) that was built into the dynamic JS-created URLs. After I discovered that Google had indexed HUNDREDS of these random URLs, I soon realized that every single one had the same exact "random" number in the URL--the number that you would get if you plugged 0.5 into the JS!
(BTW... for those who don't know... Math.random() produces a random number between 0 and 1. So if Google was going to program some kind of interpreter, it makes sense that they would choose 0.5 as an arbitrary number.)

If most web pages include the same code, placing it in external files may also improve page load time since browsers usually cache static files.

However, if JavaScripts are short and differ from page to page, I'd prefer to keep them inside HTML files. Putting such scripts in external files may means more troublesome website development and maintenance. And, in this case, external js files will (slightly) increase page load times since external files require additional server requests and you can't take advantage of a browser cached copy (every page needs a unique js file).

There may be some compromise solution: I.e. compile a list of URLs used in JavaScript and place them all in an external .js (say const.js) file as named constants/variables. Then load this const.js before scripts that work with URLs and use the constants in scripts instead of actual URLs. As a bonus, if you decide to change some URL, you'll need to change it only once in the const.js. No need to mofidy every file that uses the modified URL.

We already do some pretty smart things like scanning JavaScript and Flash to discover links to new web pages...

There is also a description of the algorithm used by googlebot to explore HTML forms:

Specifically, when we encounter a <FORM> element on a high-quality site, we might choose to do a small number of queries using the form. For text boxes, our computers automatically choose words from the site that has the form; for select menus, check boxes, and radio buttons on the form, we choose from among the values of the HTML. Having chosen the values for each input, we generate and then try to crawl URLs that correspond to a possible query a user may have made. If we ascertain that the web page resulting from our query is valid, interesting, and includes content not in our index, we may include it in our index much as we would include any other web page.

"we might choose", "we may include"... Not very specific. And what do they consider to be high-quality sites?

Anyway, tell your webmasters not to use GET forms for requests with site effects. Otherwise googlebot may create new accounts, post comment, etc.

Google seems to have a very basic way of understanding/following javascript URLs. I think the implementation is TERRIBLE because of the following reasons:

- it fails to generate complete URLs. Since it doesnt actually execute the javascript, broken down URLs remain broken down resulting in invalid or unpredicted URLs

- it fails to make a distinction between link URLs and data urls. Since on a HTML page only <a... url's are followed, the same should be AT ALL TIMES valid for javascript: only an URL that results in an anchor tag should be followed. This is because many urls are destined to be passed to data exchange functions as XHtmlRequest that implement site internal functions. By following _any_ URL ever generated google could trash an entire website.

- robots.txt is useless because some of the links generated are unpredictable. For instance, I have a token-based AJAX service: each bit is assigned a distinct token (ID) for each request which is only valid for one postback. Needless to say this generates a different URL every time google crawls.

- there is no transparent documentation to this function and no way to turn it off. Awful.

Request:

Google should document this feature and BY ALL MEANS provide an opt-out; actually this function is so unwise that it would be better need an EXPLICIT OPT-IN.

dynu,
I agree 100% that Google needs to be more transparent about this issue. This would make an excellent YOUmoz post, which would also give it the exposure it deserves.
Or better yet... try submitting it to Matt Cutts so he can address it in one of his videos.
The "robustest" solution I could think of is to define all JS functions in an external file, and then Disallow that file in robots.txt. I proposed this solution to Matt Cutts, which he responded to here.

Yup Google is definitely crawling links inside JavaScript. I have seen loads of 404s reported in Webmaster Tools recently for template URLs that are only in JavaScript, which get parts substituted at display time depending on user actions. The URL is not valid in it's template format, only when it has had the appropriate bits added.

Google most definitely does parse and execute javascript. Not only that it also executes and renders flash objects. If you google slinkymedia a photography site slinkymedia.net comes up and shows an image on the right of the page WITH a photograph showing. The photograph is embedded within the SWF which in turn is loaded by SWF object. The only conclusion from that is that google does parse execute and render javascript and actionscript.

Awesome post. Your english is far from terrible. I've heard a variety of arguments both affirming and disconfirming google bot's ability to effectively index flash and javascript. Still, you have to think logically about this...would it beneficial or detrimental to the end user if Google (among other SE's) could effectively index flash? What would be the logical step for the Google team if there was value in it?

We'll certainly find out in the not to distant future right? No need for paranoia. positive - mental - attitude.

The point is not how good Google at indexing Flash and JavaScript. The point is it does try to find links in JavaScript and it may break your SEO efforts.

You may find maintenance (not for human) pages compete with your real web pages for position in SERPs.

The problem I see is there are no distinct guidlines from Google on how to webmasters can control what googlebot extracts from JavaScripts.

It's clear that Googlebot uses JavaScript for link discovery. But it's not clear when this happen, what sort of javascript code can be investigated by the bot and what cannot be. What can we do to hide certain URLs in JavaScript from Google. When you write "get('/token/')" it is impossible to add "nofollow" since it is not an HTML tag.

The only workaround I see is to Dissalow service URLs used in JavaScript in robots.txt.

Googlebot is definitely limited in understanding JavaScript. But since this affects my site representation is SERPs, I want some clear instructions from Google what sorts of JS-links are followed and what I can do to hide links in scripts from robots.

By the way, I've just found a blog post (in Russian) about how googlebot followed JavaScript redirects (document.location.href = "http://www.example.com"). It even followed encrypted links (that can't be revealed without executing JavaScript).

I have no way of confirming this at this time. But the one site I had used javascript for a lot of the links to turn full div tags into clickable banners aren't being picked up by google, so it's still working as intended for me.

Maybe there is a bug in your code or how it is being executed and googles seeing some thing weird.

Of course, there may be some bugs in my code. However I double checked everything. My site is small, the code is very short and pretty clear. And Google say in their blog that googlebot executes JavaScript. So I'm pretty sure Google used my scripts for link discovery.

But as you pointed out this doesn't happen for all sites. So the question is why googlebot uses javascript for link discovery on some sites and can we, webmasters, control this?

Excellent post! And I agree with the others that your English is not terrible at all.

If you have access to your web server log files, what referer is listed when GoogleBot crawls the pages in question? By knowing the page it was on before reaching your JavaScript linked pages should show conclusivley that Googlebot is processing JavaScript.

Anyone else wonder if the ability for Gbot to follow and index JS would cause innadvertant duplication when previously there was none? Or changes in link structure. Our site has experienced a lot of weird google behavior lately including a ton of 404 results, and pages that never used to be in the index now appearing. We allow users to select font size and to float this page using JS links as well as a lot of JS in the code.

For a small site its not a big issue to change - but for a large or very large site the ability for googlebot to follow and read JS links has potentially "added" a ton of links where there used to be none, diluting page link equity, causing duplication, and redirecting link flow and Googlebot in ways it never used to do.

Actually, Darren, that's terrible advice and a clear indication that you have no idea what you're talking about.
A page that is linked to with JavaScript can still be discovered.
A page with a noindex Meta tag can still be crawled.
A page that is disallowed in the robots.txt file can still be indexed.

That is general advice, good for a general lay person, who don't want to go into scripting, coding, etc.
No, Darren, that's NOT general advice, good for a general lay person. Wrong information isn't suitable for any audience.
I don't see u giving any advice?
Really? How far did you look? Is it too much trouble to scroll all the way up to here?
I went to your site and see that you do SEO for others. And you come to a community site to leave rude comments.
Different people have different styles of commenting. I like being rude--you like being wrong. To each his own.
So much for professionalism.
I find that professionalism is a trait that's usually emphasized by people who can't get by on skill alone.
Please stop talking nonsense here if you don't know how to communicate properly.
I appreciate your concern, but "talking nonsense" is actually a hobby of mine, and I just can't abandon something I'm so passionate about. As an "article distributor," surely you can relate.