When I search a file on my HD in Windows 7 or Windows XP it takes some minutes to finish the process. If I fill in a search term in Google, the answer is on my screen in milliseconds

How is it possible for Google to search the Internet, which is many times larger than my hard drive, faster than my OS can search my computer? Is it only a matter of computing power and the right algorithm?

We're looking for long answers that provide some explanation and context. Don't just give a one-line answer; explain why your answer is right, ideally with citations. Answers that don't include explanations may be removed.

93

Have you tried indexing all the files on your drive and searching only the index? Try Everything and see.
–
KaranApr 3 '13 at 18:53

11

Google desktop "used" to do that for windows also...
–
rogerdpackApr 3 '13 at 20:39

14

Google searches through indices stored in RAM, not through files on a hard drive.
–
AriApr 4 '13 at 1:12

13

The index is important, but Google also uses a map-reduce algorithm to conduct a massively parallel set of operations. No matter how many cores you have in your computer, I guarantee Google has more.
–
Adam WuerlApr 4 '13 at 3:05

39

There's nothing precluding a desktop search implementation from using indexing. However, remember that Google has enough cash for a) lots of very fast CPUs/servers to parallelise a query; b) lots of very fast RAM to avoid having to access a disk ever; c) lots of hard drives much faster than the one you use; d) lots of very smart engineers to optimize the algorithms involved. (E.g. caching the results for (a lot of) frequently used queries and much much more.) It's not "only" a question of either of these, it's all of these acting in in concert.
–
millimooseApr 4 '13 at 3:14

10 Answers
10

Google is not searching the internet: it is searching an index. Google has huge server farms which are constantly scanning and indexing the internet. This process takes a lot of time, just like the search of your unindexed hard drive. In Windows 7, there is an option to index your hard drives. This process takes some time at first but once it is up and running the results of a search will be instantaneous.

Last paragraph: this link is much more authoritative and overall better.
–
ulidtkoApr 3 '13 at 21:03

4

Pardon my curiosity, but don't file systems already index the files on the disk? Isn't what you see in your file explorer a mere index of links to the actual physical sectors on the disk? Why do we, then, need to do even more indexing?
–
AdnanApr 4 '13 at 7:30

8

@Adnan the file systems index is designed to find the position where a file is stored on a physical media. It is like the index of a book that tells you on which page a chapter starts. A search index is designed to find content. A good search index not only indexes a files name but as well the content of known file types like pdf, doc, html, ... Advanced indexes use as well synonyms so if you search for "car" it might as well find results with the word "automobile".
–
SimonApr 4 '13 at 8:47

3

@Adnan, file system isn't really an "index", just a tree of file names. Searching such tree isn't fast, because it's structure isn't optimized for searching. OTOH google (and databases) uses specific sorted index structures which makes lookup for particular entry lightning fast. Even then, not all searches can benefit from such index and will be slow(er).
–
PiRXApr 4 '13 at 10:36

8

@Adnan In a sense, the FS Tree is optimised against searching. It's designed to allow addressing of known locations. From your root node, all you get is a list of directories and files under root. Every directory just knows about the files in it, and the directories below it. Accessing a known filepath is very fast under this, and it offers a lot of flexibility, but there does not exist a global listing of files to search through. You must always descend through the directory tree, and that makes for a lot of distinct lookups.
–
PhoshiApr 4 '13 at 15:28

Comparatively a hard drive search without an index has to read through every file on the drive and this can take a lot of time.

Additionally you can think of both a filesystem and an index as a tree. In the filesystem the root of the tree is the top-level folder and it can have branches (folders) or leaves (files) in that one folder. Each branch can have sub-branches for more folders and leaves for more files. To search this structure you have to 'walk' all of the branches (and sub-branches) to find the leaf you are looking for. An index flips this hierarchy around. The base becomes the alphabet and all of the sub-branches further refinements on this. The leaves are the location of the item you are looking for. Searching this structure allows you to prune (exclude) large sections of tree (eg. the first letter of your search term allows you to trim 25 other branches right away).

About 4 years ago I also asked myself the same question. But as I googled around doing my research I eventually read that besides the fact that they hire the best of the best to come up with some of the most sophisticated search algorithms and all of that.

One of the key design they used is similar to the idea of map reduce I think. You have a lot of cheap computers on farms. Let these computers have only about 80 gig of hard disk space and push hard to have about 16 gig RAM or even better 32 gig RAM on these computers(as much as possible). Remember that they are connected through some sophisticated system they designed. But the key idea here is that when a query is submitted, it is passed into their system where it will try and search the fresh data in RAM. Keep in mind they have a lot of these cheap computers. And since the data is in RAM, it is found a lot faster than it would be on a hard disk. But don't forget that they have a sophisticated(indexing and all those algorithms) system too that help greatly.

And this data doesn't have to be fresh, because we all know that Google stores everything. So as to what should be in RAM, the same principle with splay trees can be used, keep what ever people are searching the most in RAM and flush the least searched stuff to hard disk.

This little idea coupled with their indexing and all the other things others have mentioned in their answers, might be one of the reasons why it is faster than a hard-drive search.

The power to predict based on other searches.

The data is most likely in RAM which we all know is faster.

Use multiple systems to divide and conquer

Searching is their main priority.

Of course I could be wrong, but this made sense to me. And I was happy with what I learned.

You nailed it on some of the things that the other, more popular posters missed. Google doesn't search everything as often. Definitely not on the whole internet, and not even everything in its own caches. Moreover, when you search on Google.com, the actual search is not happening in real-time, just a quick copying and displaying of search results that have already been produced and organized in the past months by Google. It's extremely complicated to describe the producing/organizing process, but it can vaguely be called "indexing" as someone said.
–
Joseph MyersApr 3 '13 at 23:28

It's extremely complicated to describe the producing/organizing process.... Yep, that's what I refer to as the sophisticated part of it. Thumbs up, you summarized it well.
–
TouchApr 3 '13 at 23:34

@Touch I agree about searches in RAM. This was the fourth point in my post about caching
–
Brad PattonApr 4 '13 at 0:25

@Brad Patton True. I had to mention it because it was the basis of what I learned. And the part about indexing constantly, well the indexing part is kind of the organizing part. Therefore the statement holds that you search what has been organized and not what is being indexed at the moment. As for why the result is showing, stackoverflow has more credibility than many websites, therefore it's good to idea to index it more frequently. That's why it shows up. If it wasn't for that, you would have to wait a day or two before what you search shows up. I think that's what Mr JosephMyers is saying.
–
TouchApr 4 '13 at 0:38

Google uses an extremely sophisticated indexing system, parallel operations, and a number of load balancing techniques not available to a standard standalone computer. there is really very little similarity between a web search and a hard disk file search, and google optimizes heavily for their specific use cases.

In 2004, some Google employees published a paper: MapReduce and from that time on they improved that hundreds of times.

Also, they use Google File System(GFS) which is a distributed file system like Hadoop Distribud File System(HDFS) and extremely optimized for their purposes. Also as far as I know, GFS works maybe thousand time faster than HDFS.

Just adding something to the wonderful answers here.
Google use caching of popular search phrases.
The results of these searches reside in a memory. So if you search something that is searched a lot, the results will show up almost immediately.

To answer the question on a simplistic level: imagine you have a textbook with a keyword index at the back.

Searching a hard disk (naively, at least) is like going through the book, page by page, scanning each line for an occurrence of your keyword.

Using an Internet search engine is like looking up the keyword in the index, and then turning directly to the page number it gives.

In reality of course, it's a lot more complex than this. For example, you would usually search your hard disk for different kinds of information than the Internet. But the basic thing to take away is that the search engine is using an index. It has already gone through the "book", word by word, and it has compiled a list of those words along with where to find them, and it has organised the list in such a way that it can look up things in it very quickly.

For example, think about the organisation of an index in a book. Firstly, it is usually sorted alphabetically, and secondly it may have letter headings. When you look up a word in the index you can see straight away the list of words beginning with the letter you want. And because the list is sorted, it is easy to find the word you want within the list, or to tell quickly if it is missing.

So to summarize, it's like your hard disk just has a book, while the search engine has the index. Though as some others have pointed out, it's possible to use software to index your hard disk, and then you can use the index instead of the whole thing.

I guess one of the reasons Google emerged Auto Complete and used AJAX was speed problem. Now when you are typing, words are sent in background so Google can do part of job while you are not finished yet. Also indices are based on multiple word combinations (which you can find as suggestions at the bottom of page). Currently network speed is higher than hard-drives and probably much of those indices resides in RAM of the servers in their farm.