Wednesday, October 04, 2006

The advantages of big data and big clusters

I particularly enjoyed Peter's thoughts on the advantages of big data and big clusters. Near the beginning of the talk, Peter said:

Rather than argue about whether this algorithm is better than that algorithm, all you have to do is get ten times more training data. And now all of a sudden, the worst algorithm ... is performing better than the best algorithm on less training data.

Worry about the data first before you worry about the algorithm.

Later, near the end of the talk, Peter extended this point:

Is it just that Google has more data and more machines? But it couldn't be just the more data because [in some competitions] ... everyone got the same data.

So, I think that having more machines is a very important part because it allows us to turn around the experiments much faster than the other guys. So, it's not the online performance where you are actually doing the search that matters, but it's the -- gee, I have an idea, I think we should change this -- and we can get the answer in two hours which I think is a big advantage over someone else who takes two days. And, I think it also helps that we took an engineering approach of -- well, we'll try anything.

Amazon was similar in many ways. While Amazon did not have Google's mighty parallel processing tools and massive cluster, Amazon did have big data (transactional and log data) which it used extensively for website visible features like personalization and search query refinements and backend work like supply chain optimizations. In addition, Amazon was very early if not the first to do website A/B tests, a framework for rapidly testing new algorithms and designs live on Amazon.com, which encouraged behavior like Google's "try anything" engineering approach.

I find Peter's words particularly interesting when thinking about the Netflix contest. Netflix may be demonstrating how to do a Google-like experimental effort if you do not have Google-scale resources. The Netflix contest uses other people's machine resources and the power of many minds "trying anything" to attempt to find improvements to the Netflix recommender system.

See also notes by Brian Mingus on what appears to be the same talk a few weeks later at U of Colorado at Boulder.

5 comments:

I believe he will be giving the same talk again, tomorrow (Thursday), at PARC. I will definitely be attending.

I had heard of a similar talk @ UMass by Googler Jay Ponte a few months ago. My question back then was "ok, well, that is fine when you have big data. But what about when you don't have big data?"

Norvig seems to answer my question, when he says that big clusters still let you do rapid iteration, and so you can try even more things until you find what works. Excellent point.

But the next question arises: With this rapid iteration, engineering approach, aren't you just overfitting the problem? Not overfitting the data.. I trust that everything is properly broken into train and test sets. But overfitting the problem. Coming up with an approach so specific that it only works for that particular sub-sub-task.

For example, suppose you were working on an IR system for retrieving newspaper documents. The Google approach might become so over-engineered so as to blow the competition away on newspaper documents, but then completely fall apart when retrieving, say, magazine articles. They might pick up on something so specific to the way a newspaper article is written that they overfit the problem.

As a result, we actually learn nothing new about the magazine search problem, or about search in general, and we also really don't make any advancement in the field of AI.

Thoughts?

Would also be curious, Greg, about your reaction, given the big but small data that you have to work with. On one hand, if you look at all the data you collect from Findory users, you probably have big data. But on the other hand, with your intense focus on personalization, you really only have as much data as any one person has data. Each person's data is separate, different, personal. The best algorithm for one person might be very different from the best algorithm for another person.

So even if you throw big clusters at your small ("per person") data, would the Google approach make sense? Suppose you optimized/engineered for 1 person, or for a set of 10 or 50 people.. would you trust that the solutions you came up with would be good for all the rest of the people using Findory?

Interesting question about potential overfitting of algorithms to problems, Jeremy, but I don't see much evidence that that is happening.

In fact, I would think that would be more likely to occur with tiny data, where handcoding domain-specific rules might be feasible, than for massive data, where there is too much going on to fiddle with things manually.

Greg, here is one quick, blaringly obvious example: Enterprise search. Google has optimized its search to the web, using pagerank, anchortext, etc. as strong features. So when those features are missing, such as in an enterprise document search domain, the Google enterprise search does not do as well as web search.

That is a real example, something I know for sure is different.. the web has hyperlinks and enterprise documents do not. Google's whole raison d'être was hyperlinks; everything is engineered toward that. What happens when hyperlinks are missing?

Well, you write, "In fact, I would think that would be more likely to occur with tiny data, where handcoding domain-specific rules might be feasible, than for massive data, where there is too much going on to fiddle with things manually."

That is exactly my point. You have two options when your data is small: 1. More intelligent, more complex mathematical models (e.g. IBM's translation model), or 2. More highly engineered programming solutions, "rules" you might even say, iterated and tested many times until the best solution is reached.

Google takes the second approach, it seems. So isn't there a risk of overfitting? Does Google's Arabic-to-English translation algorithms work, modulo character set issues, just as well for Japanese-to-Spanish, given equal amounts of training data? I feel more confident that IBM's algorithms would hold up under both. Not so sure about Google, given their approach.

Here is the real issue: Anyone who has worked with TREC data knows that a new algorithm or feature does not always work consistently from collection to collection. Something like stemming (conflating morphological variants) might work wonders on one set of 300,000 documents, but decrease precision on another set of 300,000 documents.

So if Google has engineered/optimized their 300,000-document search appliance on a set of Chrysler corporate documents, does it still work just as well for Ford or GM? Or for Starbuck's or SBC, for that matter?

Of course, neither of us can speak for Google, which returns me to my other question: Do you ever find that when you come up with a personalization algorithm for one user or set of users, it actually makes things worse for other users? Given that each user is almost their own, separate "vertical", with separate training data and separate quirks, what is your philosophical approach to developing a general solution?

For the moment, I think Google's experimentation is mainly intended to adapt and improve algorithms that work well on small data sets to the massive data sets they have. There has been surprisingly little research work on that, so much exploration is needed to learn what works and what doesn't.

I suppose it is possible that, as this continues, there could be some work that does less to advance the state of the art. For example, someone might run an automated, dumb process that, given a data set, tries every possible combination of every parameter on every conventional algorithm, finally picking the one that performs best on this data.

I don't think that is what Google is doing but, if someone did do that, it would be overfitting the problem.

Greg, I think you're right that the main focus of Google's current experimentation is throwing more data at existing algorithms. And I agree that it is very interesting and should indeed be done. Norvig even showed a data plot at PARC on Thursday showing 4 or 5 algorithms and their absolute performance scores as you increased the size of the data set. Things were all over. The worst algs on small data got better, the best algs got worse, the middle algs got worse then better, etc. Very useful, very good of Google to examine something like that.

That said, what prompted me to write so much was this comment from Norvig about what happens when you only have small data: Is it just that Google has more data and more machines? But it couldn't be just the more data because [in some competitions] ... everyone got the same data. So, I think that having more machines is a very important part because it allows us to turn around the experiments much faster than the other guys. So, it's not the online performance where you are actually doing the search that matters, but it's the -- gee, I have an idea, I think we should change this -- and we can get the answer in two hours which I think is a big advantage over someone else who takes two days.

While this rapid iteration is not "every possible combination of every parameter on every conventional algorithm", it certainly rapid iteration of a lot of combinations of a lot of algorithms with a lot of parameters.. simply because big machines on small data means they can. And do.