Wolfram Alpha a Google Killer? Not… Supposed… To… Be

I’m getting tired of reading about whether Alpha is a Google-killer. I’ve seen Stephen Wolfram’s presentations a couple of times; he’s quite careful to say that it isn’t. There’s a fundamental difference that many people out there are just missing. Google is a search engine. Alpha looks like a search engine, but it isn’t; it’s all about curated data, and the analysis of that data.

What's the difference? Look at one simple query: "earth circumference". Alpha gives you one result, translated into a couple of units, along with information about the exact data source. Google gives you “about 1,190,000” results. Some of the answer the question “what is the earth circumference”; some of them answer other questions, like “how did Fermi propose computing the Earth’s circumference”; some are cute, maybe even useful, to a particular audience (I’m sure there are elementary school papers and science curriculum assignments buried in there); and some are probably just plain bogus (I bet you could find pages from the Flat Earth Society in those 1.2 million).

Asking which result is “right” misses the point. Google is a search engine; it did exactly what it’s supposed to do. It isn’t making any any assumptions about what you’re looking for, and will give you everything the cat dragged in. If you’re an elementary school teacher or a flat-earther, you can find the result you want somewhere in the big, messy pile. If you want accurate data from a known and reliable source, and you want to use that data in other computations, you don’t want Google’s answer; you want Alpha’s. (BTW, the Earth’s circumference is .1024 of the distance to the Moon.)

When is this important? Imagine we were asking a more politically charged question, like the correlation between childhood vaccinations and autism, or the number of civilians killed in the six-day war. Google will (and should) give you a wide range of answers, from every part of the spectrum. It’s up to you to figure out whether the data actually came from. Alpha doesn’t yet have data about autism or six-day war casualties, and even when it does, no one should blindly assume that all data that’s “curated” is valid; but Wolfram does its homework, and when data like this is available, it will provide the source. Without knowing the source, you can’t even ask the question.

Collecting and curating all the world’s data is an insanely ambitious project, but that’s only the start. The bigger problem is creating a common taxonomy that makes data useful. It was trivial to ask Alpha the ratio of the Earth’s circumference to the Moon’s, because the data is stored in a way that makes it easily accessible for computation. You can ask Google for web pages that the same data, and but before you can use the data, you’ll have to do a lot of “screen scraping” that’s much more difficult than getting the data in the first place. Again, this isn’t to say that Google or Wolfram is right or wrong–it’s just that they’re answering different questions. I’m working with a couple of authors who’ve done some brilliant work with R that collects online foreclosure data and analyzes it. Most of the code, and certainly the most difficult code, is screen-scraping and data-scrubbing, not statistics or analysis. Search results, returned as a web page, and data that’s compute-ready aren’t the same thing.

Why would we care about making the world’s data accessible to computation? At O’Reilly’s FOO camps, we’ve been talking a lot about “citizen science“–for example, Cornell’s many birding projects. But citizen science is usually about creating the data–counting the birds in your back yard, and so on. That’s great, but the analysis is still done by professionals. Putting a computation engine together with a curated, structured data source takes citizen science a step further. With all the panic about Swine Flu, I’ve been thinking about data from the 1918 flu epidemic. With time sequence, location-specific data (how many people are sick at any given time in any given city), it would be fun to study how the flu spread. This particular data isn’t yet available in Alpha (Stephen Wolfram, take note!), but when the data becomes available, creating an animation that shows the geographical distribution of flu cases over time should be easy; you could watch the flu move from city to city (or not). If Alpha’s not up to the task, it can be done simply enough with a Mathematica/Alpha bridge. I’m not an epidemiologist, and I won’t pretend that this animation would reveal anything fundamental, but I also believe that the world is full of under-analyzed data. Citizen data analysis? This is a New Kind of Science indeed.

Wolfram|Alpha builds a big library of commonly used, composable transforms of that data, ultimately into presentations. No big deal there, either – “here’s two tables, join them”, “here’s a table, draw a line graph” etc.

There is a big combinatoric space of how to compose and apply the data transforms to the library of data. That space is roughly the space of Mathematica programs written over that library. That’s the target language. We’ll compile things to that target language and then Mathematica can run what we compiled.

On top of that we have a high level, precise query language. This is not the query language you type into the box. This precise language doesn’t necessarily make it possible to write programs describing all meaningful compositions of the mathematica library over the data but it covers a lot of cases.

The icing on the cake – sitting above the precise query language – is an NLP-based query language that looks at a query and, rather than compiling to exactly one “correct” translation into the precise query language offers a range of choices, ranked in heuristic order of what is likely to be most interesting.

On top of that, use feedback from people exercising the software stack to better tune the heuristics of how to rank the possible translations of a query and what to add to the library or the data.

Thus: it is a search engine but what it searches is the space of translations from ambiguous queries to real programs and secondarily it helps the authors to search the space of what to add to the mathematica library and the data sets.

It has many elegant touches like carrying attribution information through calculations. It comes with arguably problematic terms of service and limits on software freedom.

See my twitter posts for three errors in a row! WolframAlpha has a long way to go.

Falafulu Fisi

I believe that WolframAlpha is an inference engine (IE), that operates on a knowledge database for retrieval. The system becomes more smarter as more knowledge is being captured (knowledge acquisition) by the knowledge database. Google search engine doesn’t have an inference engine, while WolframAlpha has one (dub computational knowledge engine) and this is the main difference. Inferencing is closer to how human reasons everyday ie, both deductive & inductive (compared to non-inference system like Google PageRank) in conducting his daily business in order to survive, like cooking, driving to work, going to toilet, vacuum the house, etc,… These daily actions require knowledge nuggets already stored in the individual’s knowledge database (neurons) are being manipulated by a reasoning process (ie, inferencing in computing technology) where as a search engine doesn’t manipulate these facts and knowledge to synthesize entirely new knowledge.

Search engine is entirely different from a computational knowledge engine which is operated by an inference engine on a knowledge database. In short, WolframAlpha can reason (infer information or synthesize entirely new information from existing facts in the knowledge database), while Google PageRank cannot.

Jacem Yorob

Hi Mike, you are well to point out some key differences between WolframAlpha and a regular search engine, but some of your comparisons are a bit of a stretch, if not disingenuous:

Alpha gives you one result, translated into a couple of units, along with information about the exact data source. Google gives you “about 1,190,000” results

This comparison of 1 answer to 1.19 million isn’t really legit, because for most people and most queries, the first 3-10 answers (if not the first) are all people are going to need and use. It’s pretty much on the same order of magnitude, not the 6 orders of magnitude difference you imply.

Also, stating that google “isn’t making any any assumptions about what you’re looking for” is frankly rather insulting to the amazing work google’s search engine algorithm engineers have done and continue to do. Google assumes that you want the relevant answers to a query, and does quite an amazing job ordering answers according to relevance.

That said, having a single relevant answer versus a list of pages containing relevant answers is a critical difference, and you are right to point it out. Whether Alpha will be able to extend their functionality to other domains beyond their already admirable work is still up for debate, but if they are able to do so it would be an incredible resource. The mind boggles with potential applications to build on top of such a system if it were made general purpose enough. I for one intend to pay attention to Alpha’s progress.

Falafulu Fisi

Jacem said…Also, stating that google “isn’t making any any assumptions about what you’re looking for” is frankly rather insulting to the amazing work google’s search engine algorithm engineers have done and continue to do.

Jacem, you accuse Mike of insulting Google researchers, where the hell did you read that in Mike’s post? Now, I can see that you’re using your brain to infer (reason by deduction) that in fact Mike has said what you accuse him of? You’re using your brain’s computational knowledge engine to infer a wrong conclusion from correct inputs, I mean Mike posited facts in his post, then your brain’s knowledge engine twisted those inputs into wrong conclusions. If you cannot infer correctly, then you operate exactly how Google PageRank is, ie, no reasoning involved. You should debate the issues, the WHYs, the IFs between Google & WolframAlpha and not accuse anyone of insulting others.

Jacem said…Google assumes that you want the relevant answers to a query, and does quite an amazing job ordering answers according to relevance.

First you should try to understand what & how Google PageRank operates. In PageRank, the user assumes no prior knowledge about the data and this is called unsupervised learning or self-discovery process. The fact that your search retrieval from using Google are relevant doesn’t imply that you (the user) formulate a prior hypothesis about the data itself, ie there is no prior knowledge, prior assumption or prior hypothesis about the data, it is just pure blind faith in the realm of unsupervised learning domain.

Jacem Yorob

Hi Falafulu Fisi,

Thanks you for your reply to my comment. Since you are quite familiar with the principles behind PageRank, you’ll know that Google’s search algorithm is much more today than the simple PageRank version which launched in 1998. That added functionality is, in part, what I was referring to when I mentioned the engineering talent at google. Google today can display much more ‘reasoning-like’ functionality than it’s pure PageRank predecessor. Google’s search algorithm employs a lot of smart assumptions in order to answer queries effectively, and they are constantly improving them. We really have no idea precisely how much reasoning might be going on behind the scenes when a google query is processed, so don’t be too quick to dismiss Google (or myself) as not being able to ‘infer correctly’.

That said, I would contend that even plain old simple PageRank performs an inference task. The principle behind PageRank is basically this: if lots of people find a webpage to be relevant, you can infer that a new person will find it relevant.

In my view its very unique.
Google’s search algorithm employs a lot of smart assumptions in order to answer queries effectively, and they are constantly improving them.Again, this isn’t to say that Google or Wolfram is right or wrong–it’s just that they’re answering different questions