Jeopardy, IBM, and Wolfram|Alpha

January 26, 2011

About a month before Wolfram|Alpha launched, I was on the phone with a group from IBM, talking about our vision for computable knowledge in Wolfram|Alpha. A few weeks later, the group announced that they were going to use what they had done in natural language processing to try to make a system to compete on Jeopardy.

I thought it was a brilliant way to showcase their work—and IBM’s capabilities in general. And now, a year and a half later, IBM has built an impressive level of anticipation for their upcoming Jeopardy television event. Whatever happens (and IBM’s system certainly should be able to win), one thing is clear: what IBM is doing will have an important effect in changing people’s expectations for how they might be able to interact with computers.

When Wolfram|Alpha was launched, people at first kept on referring to it as a “new search engine”—because basically keyword search was the only model they had for how they might find information on a large scale. But IBM’s project gives a terrific example of another model: question answering. And when people internalize this model, they’ll be coming a lot closer to realizing what’s possible with what we’re building in Wolfram|Alpha.

So what really is the relation between Wolfram|Alpha and the IBM Jeopardy project?

IBM’s basic approach has a long history, with a lineage in the field of information retrieval that is in many ways shared with search engines. The essential idea is to start with textual documents, and then to build a system to statistically match questions that are asked to answers that are represented in the documents. (The first step is to search for textual matches to a question—using thesaurus-like and other linguistic transformations. The harder work is then to take the list of potential answers, use a diversity of different methods to score them, and finally combine these scores to choose a top answer.)

Early versions of this approach go back nearly 50 years, to the first phase of artificial intelligence research. And incremental progress has been made—notably as tracked for the past 20 years in the annual TREC (Text Retrieval Conference) question answering competition. IBM’s Jeopardy system is very much in this tradition—though with more sophisticated systems engineering, and with special features aimed at the particular (complex) task of competing on Jeopardy.

Wolfram|Alpha is a completely different kind of thing—something much more radical, based on a quite different paradigm. The key point is that Wolfram|Alpha is not dealing with documents, or anything derived from them. Instead, it is dealing directly with raw, precise, computable knowledge. And what’s inside it is not statistical representations of text, but actual representations of knowledge.

The input to Wolfram|Alpha can be a question in natural language. But what Wolfram|Alpha does is to convert this natural language into a precise computable internal form. And then it takes this form, and uses its computable knowledge to compute an answer to the question.

There’s a lot of technology and new ideas that are required to make this work. And I must say that when I started out developing Wolfram|Alpha I wasn’t at all sure it was going to be possible. But after years of hard work—and some breakthroughs—I’m happy to say it’s turned out really well. And Wolfram|Alpha is now successfully answering millions of questions on the web and elsewhere about a huge variety of different topics every day.

And in a sense Wolfram|Alpha fully understands every answer it gives. It’s not somehow serving up pieces of statistical matches to documents it was fed. It’s actually computing its answers, based on knowledge that it has. And most of the answers it computes are completely new: they’ve never been computed or written down before.

In IBM’s approach, the main part of the work goes into tuning the statistical matching procedures that are used—together in the case of Jeopardy with adding a collection of special rules to handle particular situations that come up.

In Wolfram|Alpha most of the work is just adding computable knowledge to the system. Curating data, hooking up real-time feeds, injecting domain-specific expertise, implementing computational algorithms—and building up our kind of generalized grammar that captures the natural language used for queries.

In developing Wolfram|Alpha, we’ve been steadily building out different areas of knowledge, concentrating first on ones that address fairly short questions that people ask, and that are important in practice. We’re almost exactly at the opposite end of things from what’s needed in Jeopardy—and from the direct path that IBM has taken to that goal. There’s no doubt that in time Wolfram|Alpha will be able to do things like the Jeopardy task—though in an utterly different way from the IBM system—but that’s not what it’s built for today.

(It’s an interesting metric that Wolfram|Alpha currently knows about three quarters of the entities that arise in Jeopardy questions—which I don’t consider too shabby, given that this is pretty far from anything we’ve actually set up Wolfram|Alpha to do.)

In the last couple of weeks, though, I’ve gotten curious about what’s actually involved in doing the Jeopardy task. Forget Wolfram|Alpha entirely for a moment. What’s the most obvious way to try doing Jeopardy?

What about just using a plain old search engine? And just feeding Jeopardy clues into it, and seeing what documents get matched. Well, just for fun, we tried that. We sampled randomly from the 200,000 or so Jeopardy clues that have been aired. Then we took each clue and fed it as input (without quotes) to a search engine. Then we looked at the search engine result page, and (a) saw how frequently the correct Jeopardy answer appeared somewhere in the titles or text snippets on the page, and (b) saw how frequently it appeared in the top document returned by the search engine. (More details are given in this Mathematica notebook [download Playerhere]. Obviously we excluded sites that are specifically about Jeopardy!)

If nothing else, this gives us pretty interesting information about the modern search engine landscape. In particular, it shows us that the more mature search systems are getting to be remarkably similar in their raw performance—so that other aspects of user experience (like Wolfram|Alpha integration!) are likely to become progressively more important.

But in terms of Jeopardy, what we see is that just using a plain old search engine gets surprisingly far. Of course, the approach here isn’t really solving the complete Jeopardy problem: it’s only giving pages on which the answer should appear, not giving specific actual answers. One can try various simple strategies for going further. Like getting the answer from the title of the first hit—which with the top search engines actually does succeed about 20% of the time.

But ultimately it’s clear that one’s going to have to do more work to actually compete on Jeopardy—which is what IBM has done.

So what’s the broader significance of the Jeopardy project? It’s yet another example of how something that seems like artificial intelligence can be achieved with a system that’s in a sense “just doing computation” (and as such, it can be viewed as yet another piece of evidence for the general Principle of Computational Equivalence that’s emerged from my work in science).

But at a more practical level, it’s related to an activity that has been central to IBM’s business throughout its history: handling internal data of corporations and other organizations.

There are typically two general kinds of corporate data: structured (often numerical, and, in the future, increasingly acquired automatically) and unstructured (often textual or image-based). The IBM Jeopardy approach has to do with answering questions from unstructured textual data—with such potential applications as mining medical documents or patents, or doing ediscovery in litigation. It’s only rather recently that even search engine methods have become widely used for these kinds of tasks—and with its Jeopardy project approach IBM joins a spectrum of companies trying to go further using natural-language-processing methods.

When it comes to structured corporate data, the Jeopardy project approach is not what’s relevant. And instead here there’s a large industry based on traditional business intelligence and data mining methods—that in effect allow one to investigate structured data in structured ways.

And it’s in this area that there’s a particularly obvious breakthrough made possible by the technology of Wolfram|Alpha: being able for the first time to automatically investigate structured data in completely free-form unstructured ways. One asks a question in natural language, and a custom version of Wolfram|Alpha built from particular corporate data can use its computational knowledge and algorithms to compute an answer based on the data—and in fact generate a whole report about the answer.

So what kind of synergy could there be between Wolfram|Alpha and IBM’s Jeopardy approach? It didn’t happen this time around, but if there’s a Watson 2.0, it should be set up to be able to call the Wolfram|Alpha API. IBM apparently already uses a certain amount of structured data and rules in, for example, scoring candidate answers. But what we’ve found is that even just in natural language processing, there’s much more that can be done if one has access to deep broad computational knowledge at every stage. And when it comes to actually answering many kinds of questions, one needs the kind of ability that Wolfram|Alpha has to compute things.

On the other side, in terms of data in Wolfram|Alpha, we mostly concentrate on definitive structured sources. But sometimes there’s no choice but to try to extract structured data from unstructured textual sources. In our experience, this is always an unreliable process (achieving at most perhaps 80% correctness)—and so far we mostly use it only to “prime the pump” for later expert curation. But perhaps with something like IBM’s Jeopardy approach it’ll be possible to get a good supply of probabilistic candidate data answers—that can themselves be used as fodder for the whole Wolfram|Alpha computational knowledge engine system.

It’ll be interesting to see what the future holds for all of this. But for now, I shall simply look forward to IBM’s appearance on Jeopardy.

IBM has had a long and distinguished history of important R&D—something a disappointingly small number of companies can say today. I have had some good friends at IBM Research (sadly, not all still alive), and IBM as a company has much to be admired. It’s great to see IBM putting on such an impressive show, in an area that’s so close to my own longstanding interests.

48 comments. Show all »

“And in a sense Wolfram|Alpha fully understands every answer it gives. It’s not somehow serving up pieces of statistical matches to documents it was fed. It’s actually computing its answers, based on knowledge that it has. And most of the answers it computes are completely new: they’ve never been computed or written down before.”

I would so love to understand how this is structured. Should have majored in computer science! Keep up the exciting work Wolfram! Someday we may be able to bite our own teeth

What an amazing universe we live in!

Also, the Times has a page where you can challenge IBM’s Watson to a Jeopardy match:

Re: the Bing stat question — I assume the statistic is properly interpreted as GIVEN the answer is in the search result page, what is the probability it is the very first item of the search result? In the case of Bing, 63% of the time the answer is in the search result page, and of those cases, 65% of the time it is the very first document.

The comparison is whether it appears on the search engine’s own results page (you can just look at the result page and see the answer) or whether it appears on the full document that the top result links to (if you know where to look).

In particular, for Google, it compares the two buttons on the front page, “Google Search” and “I’m Feeling Lucky,” in terms of whether the answer is on the resulting page.

In the first case its whether the answer appeared in the results or not. ASK example, 68% of the time the answer appeared in the results.

In the second case, “when the answer appeared, how often it was in the first result”. So in the ASK example, in the 68% of time the search the answer appeared, it appeared in the first result 51% of the time.

Hmm, I also talked to some IBM folks. Don’t know if they are involved in Jeopardy, but they are definitely not using statistics.

“you hear can me now”

Information is data in context
Context is organized data
Learning is the self organization of data/points

So one wants a natural self organizing algorithm and the above problem solution is a fall out. Seemed to me IBM was mostly interested in a codified version of the def., some work I had done on a Context Oriented Language.

Do you know IBM’s system works the way you outlined it or are you assuming it?

Those wondering whether IBM Watson’s approach is statistical. Of course it is, if looking at the way they score and rank answers during the Jeopardy show rehearsals does not convince you, you can dig into their technical documents, particularly their joint collaboration with the natural language group of Carnegie Mellon whose approach is statistical machine learning.

Concerning the fact that Watson is not connected to the Internet, that’s right, it basically has a copy of it (or a subset) over which it applies information retrieval technology (basically the same type of technology than search engines) to extract the relevant data.

As the Wolfram Research employee who conducted the search engine research, let me clear up a persistent misinterpretation: the first graph shows the probability of finding the answer on the first page of search results, the second graph shows the probability that the answer is on the “I’m Feeling Lucky” result — the page one finds by following the top link on the search engine results.

In general, this “I’m Feeling Lucky” approach doesn’t perform as well, presumably because the search results themselves sample a greater diversity of content, and therefore are more likely to contain the answer in snippet summary form.

There are plenty of interesting details to be found in the hyperlinked notebook. For those who would like to know more or see some more statistical details about the results (including entity classification of the types of answers).

@Taliesin Beynon: What you are claiming makes no sense. The graphs claim that, for bing, there is a 63% chance that the result is on the first page of search results, and a 65% chance that the result is in the first link itself. Since the first link is presumably in the first page of results, there is no possible way that it should be higher based on the description you are giving.

Hi Stephen, to me IBM’s Watson is much, much more impressive than Wolfram Alpha. In fact, from that is kind of what I expected from Wolfram Alpha, something that can actually answer questions. Unfortunately, it still feels a lot like just a search engine. Maybe slightly better than some of the current search engines, but still far away from answering questions.

To give a simple example, type into Wolfram Alpha: “which has the greater area, the atlantic or the pacific?” I think that something that can answer questions would answer Pacific. Instead, what Wolfram Alpha does is completely throws away the first part of the question, and just searches for the Atlantic and the Pacific. Somewhere in that data I can indeed find the answer, but along with tons of other information that I don’t care about, such as bordering countries and so on. So to me this is not impressive, and from what I saw in Jeopardy! I think that Watson would be able to answer this easily.

Also, since the first part of the question is just thrown away, I can write there anything I want. For example, if I ask, where is Cyprus, the Atlantic or the Pacific, the correct answer would be just neither. Instead, I again get a lot of useless information about the relative salinities of the two oceans, and such similar stuff.

Millions of people use our site every day to get answers to questions related to science, engineering, mathematics, and the humanities. For these people, Wolfram|Alpha is capable of answering their questions.

While it is true that Alpha can become confused by longer questions, it is perhaps worth going into detail about why. IBM’s Watson will often throw out the majority of information in a question and hone in on a few “clues”. Depending on the question, these might be dates, people, places, punning words, or question words such as “who” and “what”. These clues are then crunched using a variety of statistical algorithms to rank possible answers.

However, unlike Watson, Wolfram|Alpha insists on understanding in its entirety the question that was asked, so that we can give accurate and relevant results. If this does not work, we will sometimes return a “Did You Mean” result that will at least give the user some potentially relevant information, as well expose what we *do* know about a topic, to inspire further questions. It turns out our users prefer this fallback to a page that says “we don’t understand your question”.

But while Wolfram|Alpha might be more pedantic about understanding questions than Watson, this reflects a strength, not a weakness. Alpha can compute new answers based on an *exact* understanding of the question. But because Watson can only perform a variety of domain-specific scoring algorithms to select an existing answer, it can easily deliver non-sensical results.

For example, Watson responded to the the clue “To bring back someone to his original function or position” with “Reinstate 2″, no doubt because it performed a search on a dictionary and inadvertently included the beginning of the one of numbered definitions in its answer. It is, in a sense, an idiot-savant. We aim for Alpha to be a genuine genius.

We acknowledge that this approach is more difficult, but we believe it is far more likely to result in the long term in the kind of “giant artificial brain” of popular science-fiction, one that can answer any question with intelligent and useful computations, not just memorized answers.

As for your point that there are still many questions we cannot answer, we might point out in return that there are many more questions we *can* answer that we could not just one year ago. Alpha continues to improve, and unlike statistical approaches to question answering, is based on a solid foundation of curated symbolic knowledge that can continue to grow gracefully into the next decades of the 21st century.

Stephen, thanks for your post on this issue. Taliesin, thanks for keeping up with the comments.

I noticed in Tuesday’s Final Jeopardy (where Watson was given the clue “This U.S. city contains one airport that is named for a World War 2 general and another that’s named after a World War 2 battle”) that Watson returned the question of “what is Toronto?”

Since this result clearly didn’t meet any of the stipulations of the clue (1. Toronto isn’t a U.S. city, 2. Toronto’s major airport isn’t named after a WW2 general, 3. Toronto doesn’t have a second major airport), I’m left to wonder just how Watson arrived at this result.

Out of curiosity, I asked Alpha the same question. It responded with “World War II”. How would Alpha go about coming up with the proper answer (“Chicago”) in this situation?

thanks for your reply. I have no doubt that millions of people ‘use’ your site every day to ‘get answers’ to their questions, but that is much different from the site actually answering their questions (which is why I added the emphasis).

Indeed, billions of people use Google, Bing, Wikipedia… and many many other sites to get answer to their questions every day. But I thought Wolfram Alpha was supposed to be different than just a search engine. Sure, sometimes it is more useful than other search engines. I definitely find it very useful from time to time. But that doesn’t mean it’s displaying any intelligence, or understanding of the question. To me it seems to just have a more restricted and numerically oriented database that it searches.

If I look up some word in a dictionary (an actual paper dictionary), and I find the answer inside, sure you can say that the dictionary answered my question. But I can see no way that you can claim the dictionary ‘understands’ my question. And Wolfram Alpha seems just like an extended dictionary, which returns a bunch of numbers instead of definitions.

As I have pointed out before, and as Decade and other users have also pointed out, after entering a question into Alpha, it just strips away most of the question, and tries to find one or maximum two keywords. And that’s it. By no stretch can you then say that it “insists on understanding in its entirety the question that was asked, so that we can give accurate and relevant results”. It insists only on understanding one or two words in the question, and in that way it’s just like a dictionary or encyclopedia entry. Maybe you can claim it understands the answer (not the question) it gives, but only if your definition of understanding is “being able to perform mathematical operations”, because that’s all it can do with the answer.

Try this very simple example, so you see it’s not about the length of the question. Type in the word ‘big’. It gives you the definition of big. Then type ‘not big’. I would expect some information on the antonym of big. Instead it just gives the definition of ‘not’ instead, and completely throws away ‘big’. On the other hand, typing ‘antonym of big’ gives me what I want. So it has the necessary data to provide what I want, but is not able to understand a very simple negation.

Contrast that with the behavior of Watson. Sure it made some mistakes. Maybe as Douglas Hofstadter predicted it will turn out that if we are looking to create artificial intelligence it will have to make mistakes sometimes. But instead of focusing on the 3 or 4 mistakes it made, focus on all the subtleties of language it has to understand just to be able to know what it is we are asking him for. And it does it so well that it’s able to give the right answer the overwhelming majority of the time. Look at the final Jeopardy clue: “William Wilkinson’s “An Account of the Principalities of Wallachia and Moldavia” inspired this author’s most famous novel.” To even understand what the question is asking for, an author who wrote a book that was inspired by another author’s book, is really subtle for a computer. And it cannot be arrived at by just throwing away 90% (or more) of the question like Alpha does.

Sorry for the long message, I would write even more but it’s already too long. But I’m not trying to put down your work, I find Alpha very useful from time to time. And I honestly do wish that it becomes better than Watson, since I can easily access it for free on the internet. It would be extremely useful if a website that is able to actually answer questions was available. But sadly right now Alpha is nowhere close to what we saw Watson do.

First, let me say that I (and probably most people) think Alpha is a completely brilliant product with many uses and it is a revolutionary system. That being said I find your answers to its criticism both defensively unapologetic and unadmitting of its flaws. Every system has flaws. It’s how we correct those flaws and make better systems that lead to “genuine genius”. Instead of defending Alpha’s short comings by saying millions use it, and it actually understands, etc., you should be looking at why it fails when it does.

To have a question “Which is greater, a or b” and Alpha replys with the standard information for both a and b; this isn’t a particularly great answer. Instead of defending this answer, you should acknowledge it’s limitations and take that as a challenge to improve Alpha, so it can properly answer questions of this nature.

I look forward to seeing future revolutions from Wolfram Research. Thanks for the great work!

Undoubtedly there is a need for precise, accurate and up-to-date information, and that seems to be the realm that Alpha addresses. What I need in a personal research assistant is actually a blend of Watson, Alpha and an agent-based system. Your description actually underscores how complementary Watson and Alpha seem to be.

When Alpha can’t provide the answer (as is often the case–you really have to know the nature of the questions Alpha will answer to find the tool useful), something like Watson could guess. As you point out, Google’s guesses are often useful, but Watson could deliver an answer and state the sources, rather than you having to dig through the sources to get to a likely answer.

I’m wondering how tools like Siri also fit into the mix here, as the point of all of this research is to take effective action, and getting machines to take more of the mundane actions for you would certainly help you be more effective.

Well I’m sure this is an impressive piece of engineering, but it has nothing to do with creative intelligence. Until computers start displaying creativity and intelligence above insect-level (to say nothing of Einstein-level), they remain glorified slide rules which pose no threat to the thinking members of our species. What humanity needs far more than more powerful computers is more powerful thinkers who can revolutionize human life with new energy sources, new space propulsion systems, new forms of computing, etc. — i.e. we need more physicists and fewer computer scientists, new Feynmans more than Watson/Wolfram Alpha/Google. Even in this age of Moore’s Law-driven cyber-hype, it’s still all about the human brain!

I think that we are misunderstanding the strengths of WolframAlpha. As an experiment, I asked both Wolfram and Google “Differences between apples and lettuce”. While Google spit out bunch of search results which provide some textual answers of recipes and many somewhat related responses that only dealt with one or the other, Wolfram supplied me with the data-based comparison of nutritional (and other) values. A device such as Watson would have a hard time quantifying which of the hundreds of differences between the two food items.

When using any search service you must know how to operate within its constraints. The strength of Google (and Bing, et al) is that it has access to a catalog of the world’s knowledge which it can present to you. Watson is able to decide which result best matches your answer. Wolfram is able to quantify and compute the results.

Some of the criticisms of W|A would not arise if it was recognised that W|A is poor at understanding freeform input.
However in its response it shows its interpretation in its own fixed-form. It also shows its assumptions as to the meaning of some words which are ambiguous and allows you to choose from the alternatives. With this feedback we can interact until it understands me correctly most of the time.

I would like to see W|A offer fixed-form input as an alternative. It would be unambiguous and based on grammatically correct English. An ambiguous word in my input would trigger a popup menu of alternative definitions from which I could choose.

Some criticisms do not recognise the complexity of the questions W|A can answer. It can answer one layer of questions and then answer a higher layer until it reaches the top the layers being signified by use of braces(().

I find this a strange comparison as the three competitors, Watson, Search Engines & Wolf. Alpha. solve very different problems,

Watson solves the PR problem that IBM is no longer perceived as being at the cutting edge of tech, as it once was. Search Engines solve a navigation problem analogously to the index in the back of a book. And Wolfram seems to solve the same problem as those txt message services people use to cheat in pub quizzes.

Hi Dr. Wolfram,
I was also quite fascinated by Watsonm’s success at Jeopardy, and also intrigued about what maybe possible with the resources Watson may be adapted for building Wattminder.
Wattminder is a web startup that is focused on detecting and diagnose faults in
solar power plants, over the web. It is based on a ‘narrow’ knowledgebase and mathematics. It would be an almost trivial first commercial application for Watson, or Wolfram; given Wattminder’s web infrastructure of sensor instrumentation, and algorithms.
I don’t have a way to approach you or your team at Wolfram.
Please let me know if you may refer my request to explore there maybe interest and synergy to justify a collaboration with Wattminder?
Thanks and May the Sun always warm your face and your panels!
-Steve aka solarMD, PVSleuth

Many knowledge frontiers are now multi-disciplinary. Addressing a question that spans many domains needs the systematic integration of relevant knowledge from multiple ‘secondary’ domains into the primary domain of the investigator (which if done satisfactorily enables agreement and accelerated progress across the relevant team and sponsors); both Watson and W|A will be useful today but it seems more probable that W|A’s future growth will lead to an exponential increase in its abiliity to support these emerging requirements.
DavidPJ

Caught this from the Reddit AMA from Wolfram. The only thing I might add to the discussion is a review I came across from a Librarian at an actual Library. It was her opinion that students are idiots finding bad results on their searches having no original interpretive context to work with. ie which sites are reliable, which sites are opinion, what informational is factual, what information is heavily slanted politically, or socially. Basically determining fact from fiction on their internet searches. The reliable vs the unreliable.

This is children mind you. They just don’t have the cognitive ability to discern what is likely reliable information from what is not reliable on the internet. The key point the Librarian was pointing out is that most children, in her experience helping them find things on the internet, was that they just don’t have a framework of who to trust or why to trust. They are an “open book” to what they see on the internet and that was a real problem from her perspective.

I’m into education, so I just thought I would throw this out there. I mean, just imagine countries that censor/block parts of the internet they “don’t like”. Combine this with the above and it becomes an even worse situation. Learning to properly search and find good answers should be a class learned early, early on IMO.

There is a third approach, represented by neither of these, yet smarter in the long run, and represented by cloud computing engines such as BioBike and SalesForce, which long predated either Watson (which is not a cloud engine) or WA. In this approach You have an open computing platform loaded with the data, and having numerous tools, which might include the Watson approach or the WA approach as possible tools, or, more interestingly, combinations thereof created by the user. What you cannot say to either Watson or WA, and which is critical to being what one might call a “Really Useful Engine” (hey, maybe we should call it Thomas! is:

Please do this arbitrary computation…
[Which in the very simplest caes could be
a lookup as Watson or a computation, as WA]
Now take THAT RESULT and do the following with it…

and so on.

The ability to do open-ended NEW and specific computations, and then to use the results of these in later computations, is what would make a Really Useful Engine.