Does Google Use Chrome to Discover New URLs for Crawling?

Many people assume that Google leverages all of its vast information resources in every way possible. For example, the great majority of people who think about it probably assume that Google uses their Chrome browser as a data source, and in particular, that they use it to discover new URLs for crawling.

We decided to put that to the test.

Brief Methodology Note

The test methodology was pretty simple. In principle, all we needed to do was setup a couple of pages that Google didn’t know about, have a bunch of people visit those pages from a Chrome browser, and then wait and see if Googlebot came to visit it. In practice, there was a bit more discipline involved in the process, so here is what we did:

Created four brand new articles as web pages. We used two of these as test pages, and two of them as control pages.

Uploaded them to a web site by direct FTP to a web server. By that, I mean we didn’t use any Content Management Systems (e.g. WordPress, Drupal, …) to upload them to make sure that something in that process didn’t make Google aware of the pages.

Waited a week to make sure nothing went wrong, causing Googlebot to visit. During that week, we checked the site log files every day to make sure that no Google bot visits occurred.

Enlisted 27 people to follow this process:

Open Chrome

Go into settings and disable all their extensions

Paste the URL of the first test page in their browser and then visit it

Paste the URL of the second test page in their browser and then visit it

Reenable their extensions after completing the steps above

Grab their IP address and send it to me so I could verify who followed all the steps and who didn’t

Checked the log files every single day until a week after the last user completed their steps

Note that the control pages were never visited by humans, and that’s what makes them controls. If something went wrong in the upload process, then they might get visited, but that never happened. No views of either control page ever occurred, either by humans or bots, so this confirmed that our upload process worked.

What do people believe about this?

In anticipation of this test, Rand Fishkin of Moz put out a poll on Twitter to see what people believed about whether Google uses Chrome data for this purpose. Here’s his result:

Click on image to see original tweet

As you can see, a whopping 88% believe Google sniffs new URLs from Chrome data, and the majority of those are sure they definitely do it. Let’s see how that compares with the results of our test.

The Results

The results are pretty simple: Googlebot never came to visit either page in the test.

As it turns out, two people in the test did not actually disable their extensions, and this resulted in visits from Open Site Explorer (someone had the Mozbar installed and enabled) and Yandex (due to a different person, though I’m not clear on what extension they had installed and enabled).

Summary

This is a remarkable result. Google has access to an enormous amount of data from Chrome, and it’s hard to believe that they don’t use it in some fashion.

However, bear in mind that this test was specific to testing if they used Chrome to discover new URLs. In an earlier test, we showed that Google does not use smartphone clipboards to discover new URLs either. My guess is that Google still is of the mindset that if there is no web-based link path to a page, it doesn’t have enough value to rank for anything anyway.

This does not mean that Google does not use Chrome data in other ways, such as to collect aggregate user behavior data, or other metrics. Google’s Paul Haahr confirmed that they do use CTR data as a quality control in highly controlled testing to measure search quality.

Note: He did NOT say that it was a live ranking factor, but more a way of validating that other ranking factors were doing their job well. That said, perhaps Chrome is a source of data in such testing. It could easily be made truly anonymous, and perhaps add a lot of insight into user satisfaction with search results.

In any event, this part of the conversation is all speculation, and for now, we’ve shown that Google does not appear to use simple Chrome visits to new web pages as a way to discover URLs for crawling.

Comments

We have a CMS that pushes Product pages live before completion. They are 200 status, not complete Product pages, and are not the final URL. I.e. once complete the URL 301 redirects to a friendly URL. I forgot the exact reason, but people internally search for the non-complete URL on browsers. Google is finding and crawling these URLs daily according to our log files, but not indexing them (a site: search returns 0 results). An example of such a URL would behttp://www.example.com/product/-/4568756. Eventually it would 301 redirect to something like http://www.example.com/product/brand-product-name/4568756. I don’t know how else Google would find these pages: they aren’t linked to internally, no one externally should know they exist and link to them, yet Google crawls about 100 to 150 of them daily.

HiBob – I don’t have a clear answer for you, but they have to be discovering them somehow. There are many ways that this can happen. For example, if you visit a page using a browser that has a toolbar that calls the Google Plus API when it loads. We’ve seen them crawl URLs when pages are exposed that way. Just not with Chrome by itself (with no active toolbars).

Hi Nick. We do repeat many of our tests where it makes sense, but since this is a belief that has persisted for years, a one time test probably suffices to lay it to rest, at least for a good while. It’s something Google has said they don’t do, and we’ve simply confirmed that. Of course, our methodology is all in the test report, and anyone would be welcome to reproduce the test on their own.

Actually, Google is very open that they do “read” your Gmail in order to serve you contextual ads and to perform personalized services such as events going automatically to your Google Calendar, and reservations showing up on your phone. I put “read” in quotes because no human actually looks at your emails; it’s all done by machines for those services. Also, as Eric said in his post, we are not making any claim about whether or not Google uses data from Chrome for anything else. Our test only checked on one very specific possible use.

Mark this case study is really weak. Instead you should create one epic webpage and share it on some Facebook close groups. And I bet Google will index those pages very fast based on popularity of that webpage.

I have seen many popular webpages index very fast with no links, no sitemap and no search console account.

Google main abstract is to crawl important webpages first, and they can’t simply rely on sitemap and links based crawler. There are lot’s of webpages out there which are popular and it should index very fast, and hence I think they may be have other type of crawler as well.

Goyllo, we respectfully disagree, as the test accomplished the thing it was designed to test: whether or not Google uses user behavior on its Chrome browser to discover new pages. We were not testing whether Google has other means (such as social media links) of discovering new pages, nor do we dispute that it does. In fact, in another test we demonstrated that just tweeting a link to a previously unrevealed page can cause googlebot to visit and crawl a page (no guarantee; it didn’t happen every time).

Furthermore, the pages were hosted on our site, which has established a very high crawl priority with Google. Normally, pages we haven’t intentionally hidden from Google get indexed within hours of their publication.

Hi,
Thanks for sharing your results.
So Googlebot didn’t show up but you do not say if those urls were indexed in Google SERPs. Did you checked that point?
Do the test urls contain enough quality content to be indexed?
I wonder if Chrome could replace the Googlebot as it has already all the data needed for indexation, and far more.

Hi Francois. No URL can get indexed for Google Search unless it is first crawled by the googlebot. That’s simply how Google Search works. And yes, the test pages contained well-written content, and were posted on a reputable site.

I have to disagree with the conclusions here – I’ve seen Google index sites on staging with no links pointing to them whatsoever and the site still shows up in in index and the only way Google could know that that page exists was if it discovered it via Chrome.

Of course, that was a while ago – it is possible that Google was using Chrome at one point, and simply stopped using it because it was getting a lot of noisy info.

Hi Jason – this is something we also tested 3 years ago, and we got the same results. Two other reasons pages may have been crawled, simply because you loaded them with a Chrome browser:

1. You had a toolbar running which included a Google Plus +1 count checker – we know this will spawn a crawl.
2. When you loaded the page, it had some other Google code on the page that ran, and Google used this for URL discovery.

What we did in our test is that we loaded pages that were simple HTML text files, with no Google code on them, and we made sure that everyone disabled all of their Google Add-in prior to loading the page.

What if Google’s threshold is 28 direct visits before they take notice and crawl?
Maybe 200, or 2000 would do it. I think you have proven that 27 direct visits don’t cause a crawl in a week. Maybe it happened on the 8th day….

Agreed, maybe there’s a threshold at 28, or 32, or 67, but I do think that those possibilities are a bit less remote. Maybe it’s 1000.

But consider this idea. Google doesn’t crawl things that don;t show some level of significance. A simple volume of vists, or social shares does not meet that test. If it did, ever cat or dog photo ever shared would be indexed, and prominent in the search results.

My belief is that they need other signals, such as links to decide to index something. If the content does not draw links, why index it? Lot of page views, or social shares could just be another cat pic.

Hi,
Thanks for sharing your results.
So Googlebot didn’t show up but you do not say if those urls were indexed in Google SERPs. Did you checked that point?
Do the test urls contain enough quality content to be indexed?
I wonder if Chrome could replace the Googlebot as it has already all the data needed for indexation, and far more.
thanks

Jim, it’s unclear just what you think is a rumor, myth, or “fake news.” Many SEOs and search experts have speculated that Google was using Chrome data to discover new URLs, simply because Google could do that. Our test shows that it is highly unlikely that they are. That’s it.

In the past our website had a folder for a new version of the website that had a robots.txt to block all bots and it was crawled and went on google serps, in spite of the folder url never been submited to Google at all…

Thanks for being mythbusters wuth this article. I was also preaty sure they use it but after reading this, it’s totally making sens. Imagine the open door to google bombing if it does. Many bots, many chrome instense… many mistake

You can learn what pages your site Google is crawling by looking into your log file. This is best using a tool called a log file analyzer (you can search on that to find some good ones).

As for other reasons you’re site is not completely indexed, the list pf possibilities is long. Google might see the content aas being “thin” (search on thin content to learn about that), or it may just not be attracting much attention from others on the web (e.g. it may have no high quality links to it), the Sitemaps file may have problems in it, it could be badly structured, or it couild one of many other potential problems. It’s hard to say without doing a detaled analysis of it.

I’m glad Google doesn’t use Chrome to discover URLs, because I sometimes temporarily put private or draft documents onto a server for myself or others to fetch later by direct URL and forget to write a deny entry in robots.txt (and if you’re on a shared server you might not even have access to robots.txt). As long as said private documents don’t link to external sites that might give the game away by publishing their referral logs, I can be reasonably confident that they won’t get crawled as long as I can trust everybody who knows about them to refrain from linking to them. (Somebody could brute-force fetch every possible URL, but if they’re going to go that far then there’s not much hope anyway: yes I could put it behind a password, but what’s the difference between brute-forcing a URL and brute-forcing a password? Thankfully the private documents I’m talking about are not worth going THAT far for.)

The answer is yes. In fact, all browsers, extensions collect user data. Currently there is only one open source project declaring that they do not collect any user’s information. That is Tor Project. (As I know)

Monjoli, read the study carefully. We are not denying that they have the ability to collect the data. That is not in question. The question is whether or not they use it for new URL discovery, and our test demonstrates that they do not.

I think if they did use it, they’d get a LOT of noise when you factor in dev sites, localhosts, session ids, cookie ids, and all the URLs that are normally blocked from the crawler. It would probably be a huge waste of resources to crawl them all. Besides, if they aren’t linked to on the web from anywhere, are they really important?