Supposed you were asked in an interview "How would you implement Google Search?"
How would you answer such a question? There might be resources out there that explain how some pieces in Google are implemented (BigTable, MapReduce, PageRank, ...), but that doesn't exactly fit in an interview.

What overall architecture would you use, and how would you explain this in a 15-30 minute time span?

I would start with explaining how to build a search engine that handles ~ 100k documents, then expand this via sharding to around 50M docs, then perhaps another architectural/technical leap.

This is the 20,000 feet view. What I'd like is the details - how you would actually answer that in an interview. Which data structures would you use. What services/machines is your architecture composed of. What would a typical query latency be? What about failover / split brain issues? Etc...

This question came from our site for professional and enthusiast programmers. Votes, comments, and answers are locked due to the question being closed here, but it may be eligible for editing and reopening on the site where it originated.

There are either too many possible answers, or good answers would be too long for this format. Please add details to narrow the answer set or to isolate an issue that can be answered in a few paragraphs.
If this question can be reworded to fit the rules in the help center, please edit the question.

1

That's quite an interview question. How much detail were they looking for?
–
PaddyJan 19 '11 at 16:38

1

As much detail as possible to build a Google maybe?
–
AnuragJan 19 '11 at 16:39

1

Actually, that's a question that I used when I did some interviews a while back. The beauty is that the amount of details you give is really up to you, and the time your interviewer wants to spend on this.
–
ripper234Jan 19 '11 at 16:40

2 Answers
2

A mammoth question like that isn't looking for you to waste your time in the nitty-gritty of implementing a PageRank-type algorithm or how to do distributed indexing. Instead, focus on the complete picture of what it would take. It sounds like you already know all of the big pieces (BigTable, PageRank, Map/Reduce). So the question is then, how do you actually wire them together?

Here's my stab.

Phase 1: Indexing Infrastructure (spend 5 minutes explaining)

The first phase of implementing Google (or any search engine) is to build an indexer. This is the piece of software that crawls the corpus of data and produces the results in a data structure that is more efficient for doing reads.

To implement this, consider two parts: a crawler and indexer.

The web crawler's job is to spider web page links and dump them into a set. The most important step here is to avoid getting caught in infinite loop or on infinitely generated content. Place each of these links in one massive text file (for now).

Second, the indexer will run as part of a Map/Reduce job. (Map a function to every item in the input, and then Reduce the results into a single 'thing'.) The indexer will take a single web link, retrieve the website, and convert it into an index file. (Discussed next.) The reduction step will simply be aggregating all of these index files into a single unit. (Rather than millions of loose files.) Since the indexing steps can be done in parallel, you can farm this Map/Reduce job across an arbitrarily-large data center.

Once you have stated how you will process web pages, the next part is explaining how you can compute meaningful results. The short answer here is 'a lot more Map/Reduces', but consider the sorts of things you can do:

For each web site, count the number of incoming links. (More heavily linked-to pages should be 'better'.)

For each web site, look at how the link was presented. (Links in an < h1 > or < b > should be more important than those buried in an < h3 >.)

For each web site, look at the number of outbound links. (Nobody likes spammers.)

For each web site, look at the types of words used. For example, 'hash' and 'table' probably means the web site is related to Computer Science. 'hash' and 'brownies' on the other hand would imply the site was about something far different.

Unfortunately I don't know enough about the sorts of ways to analyze and process the data to be super helpful. But the general idea is scalable ways to analyze your data.

Phase 3: Serving Results (spend 10 minutes explaining)

The final phase is actually serving the results. Hopefully you've shared some interesting insights in how to analyze web page data, but the question is how do you actually query it? Anecdotally 10% of Google search queries each day have never been seen before. This means you cannot cache previous results.

You cannot have a single 'lookup' from your web indexes, so which would you try? How would you look across different indexes? (Perhaps combining results -- perhaps keyword 'stackoverflow' came up highly in multiple indexes.)

Also, how would you look it up anyways? What sorts of approaches can you use for reading data from massive amounts of information quickly? (Feel free to namedrop your favorite NoSQL database here and/or look into what Google's BigTable is all about.) Even if you have an awesome index that is highly accurate, you need a way to find data in it quickly. (E.g., find the rank number for 'stackoverflow.com' inside of a 200GB file.)

Random Issues (time remaining)

Once you have covered the 'bones' of your search engine, feel free to rat hole on any individual topic you are especially knowledgeable about.

Performance of the website frontend

Managing the data center for your Map/Reduce jobs

A/B testing search engine improvements

Integrating previous search volume / trends into indexing. (E.g., expecting frontend server loads to spike 9-5 and die off in the early AM.)

There's obviously more than 15 minutes of material to discuss here, but hopefully it is enough to get you started.

This is a great asnswer, but I feel that it doesn't begin to address the scale issues with building Google. I think the more challenging part is in Serving Results part of your answer, and where a lot of Google's magic lies. I do have some idea about how to to architect something like that, but I'm interesting in hearing others.
–
ripper234Feb 2 '11 at 10:19

I asked this on Quora - I think it may have the audience to answer this question. quora.com/…
–
ripper234Feb 2 '11 at 10:38