Overcoming Artificial Stupidity

April 17, 2012

Today marks an important milestone for Wolfram|Alpha, and for computational knowledge in general: for the first time, Wolfram|Alpha is now on average giving complete, successful responses to more than 90% of the queries entered on its website (and with “nearby” interpretations included, the fraction is closer to 95%).

I consider this an impressive achievement—the hard-won result of many years of progressively filling out the knowledge and linguistic capabilities of the system.

The picture below shows how the fraction of successful queries (in green) has increased relative to unsuccessful ones (red) since Wolfram|Alpha was launched in 2009. And from the log scale in the right-hand panel, we can see that there’s been a roughly exponential decrease in the failure rate, with a half-life of around 18 months. It seems to be a kind of Moore’s law for computational knowledge: the net effect of innumerable individual engineering achievements and new ideas is to give exponential improvement.

But to celebrate reaching our 90% query success rate, I thought it’d be fun to take a look at some of what we’ve left behind. Ever since the early days of Wolfram|Alpha, we’ve been keeping a scrapbook of our favorite examples of “artificial stupidity”: places where Wolfram|Alpha gets the wrong idea, and applies its version of “artificial intelligence” to go off in what seems to us humans as a stupid direction.

Here’s an example, captured over a year ago (and now long-since fixed):

When we typed “guinea pigs”, we probably meant those furry little animals (which for example I once had as a kid). But Wolfram|Alpha somehow got the wrong idea, and thought we were asking about pigs in the country of Guinea, and diligently (if absurdly, in this case) told us that there were 86,431 of those in a 2008 count.

At some level, this wasn’t such a big bug. After all, at the top of the output Wolfram|Alpha perfectly well told us it was assuming “‘guinea’ is a country”, and offered the alternative of taking the input as a “species specification” instead. And indeed, if one tries the query today, the species is the default, and everything is fine, as below. But having the wrong default interpretation a year ago was a simple but quintessential example of artificial stupidity, in which a subtle imperfection can lead to what seems to us laughably stupid behavior.

Here’s what “guinea pigs” does today—a good and sensible result:

Below are some other examples from our scrapbook of artificial stupidity, collected over the past 3 years. I’m happy to say that every single one of these now works nicely; many actually give rather impressive results, which you can see by clicking each image below.

There’s a certain humorous absurdity to many of these examples. In fact, looking at them suggests that this kind of artificial stupidity might actually be a good systematic source of things that we humans find humorous.

But where is the artificial stupidity coming from? And how can we overcome it?

There are two main issues that seem to combine to produce most of the artificial stupidity we see in these scrapbook examples. The first is that Wolfram|Alpha tries too hard to please—valiantly giving a result even if it doesn’t really know what it’s talking about. And the second is that Wolfram|Alpha may simply not know enough—so that it misses the point because it’s completely unaware of some possible meaning for a query.

Curiously enough, these two issues come up all the time for humans too—especially, say, when they’re talking on a bad cellphone connection, and can’t quite hear clearly.

For humans, we don’t yet know the internal story of how these things work. But in Wolfram|Alpha it’s very well defined. It’s millions of lines of Mathematica code, but ultimately what Wolfram|Alpha does is to take the fragment of natural language it’s given as input, and try to map it into some precise symbolic form (in the Mathematica language) that represents in a standard way the meaning of the input—and from which Wolfram|Alpha can compute results.

By now—particularly with data from nearly 3 years of actual usage—Wolfram|Alpha knows an immense amount about the detailed structure and foibles of natural language. And of necessity, it has to go far beyond what’s in any grammar book.

When people type input to Wolfram|Alpha, I think we’re seeing a kind of linguistic representation of undigested thoughts. It’s not a random soup of words (as people might feed a search engine). It has structure—often quite complex—but it has scant respect for the niceties of traditional word order or grammar.

And as far as I am concerned one of the great achievements of Wolfram|Alpha is the creation of a linguistic understanding system that’s robust enough to handle such things, and successfully to convert them to precise computable symbolic expressions.

One can think of any particular symbolic expression as having a certain “basin of attraction” of linguistic forms that will lead to it. Some of these forms may look perfectly reasonable. Others may look odd—but that doesn’t mean they can’t occur in the “stream of consciousness” of actual Wolfram|Alpha queries made by humans.

And usually it won’t hurt anything to allow even very odd forms, with quite bizarre distortions of common language. Because the worst that will happen is that these forms just won’t ever actually get used as input.

But here’s the problem: what if one of those forms overlaps with something with a quite different meaning? If it’s something that Wolfram|Alpha knows about, Wolfram|Alpha’s linguistic understanding system will recognize the clash, and—if all is working properly—will choose the correct meaning.

But what happens if the overlap is with something Wolfram|Alpha doesn’t know about?

In the last scrapbook example above (from 2 years ago) Wolfram|Alpha was asked “what is a plum”. At the time, it didn’t know about fruits that weren’t explicitly plant types. But it did happen to know about a crater on the moon named “Plum”. The linguistic understanding system certainly noticed the indefinite article “a” in front of “plum”. But knowing nothing with the name “plum” other than a moon crater (and erring—at least on the website—in the direction of giving some response rather than none), it will have concluded that the “a” must be some kind of “linguistic noise”, gone for the moon crater meaning, and done something that looks to us quite stupid.

How can Wolfram|Alpha avoid this? The answer is simple: it just has to know more.

One might have thought that doing better at understanding natural language would be about covering a broader range of more grammar-like forms. And certainly this is part of it. But our experience with Wolfram|Alpha is that it is at least as important to add to the knowledgebase of the system.

A lot of artificial stupidity is about failing to have “common sense” about what an input might mean. Within some narrow domain of knowledge an interpretation might seem quite reasonable. But in a more general “common sense” context, the interpretation is obviously absurd. And the point is that as the domains of Wolfram|Alpha knowledge expand, they gradually fill out all the areas that we humans consider common sense, pushing out absurd “artificially stupid” interpretations.

Sometimes Wolfram|Alpha can in a sense overshoot. Consider the query “clever population”. What does it mean? The linguistic construction seems a bit odd, but I’d probably think it was talking about how many clever people there are somewhere. But here’s what Wolfram|Alpha says:

And the point is that Wolfram|Alpha knows something I don’t: that there’s a small city in Missouri named “Clever”. Aha! Now the construction “clever population” makes sense. To people in southwestern Missouri, it would probably always have been obvious. But with typical everyday knowledge and common sense, it’s not. And just like Wolfram|Alpha in the scrapbook examples above, most humans will assume that the query is about something completely different.

There’ve been a number of attempts to create natural-language question-answering systems in the history of work on artificial intelligence. And in terms of immediate user impression, the problem with these systems has usually been not so much a failure to create artificial intelligence but rather the presence of painfully obvious artificial stupidity. In ways much more dramatic than the scrapbook examples above, the system will “grab” a meaning it happens to know about, and robotically insist on using this, even though to a human it will seem stupid.

And what we learn from the Wolfram|Alpha experience is that the problem hasn’t been our failure to discover some particular magic human-thinking-like language understanding algorithm. Rather, it’s in a sense broader and more fundamental: the systems just didn’t know, and couldn’t work out, enough about the world. It’s not good enough to know wonderfully about just some particular domain; you have to cover enough domains at enough depth to achieve common sense about the linguistic forms you see.

I always conceived Wolfram|Alpha as a kind of all-encompassing project. And what’s now clear is that to succeed it’s got to be that way. Solving a part of the problem is not enough.

The fact that as of today we’ve reached a 90% success rate in query understanding is a remarkable achievement—that shows we’re definitely on the right track. And indeed, looking at the Wolfram|Alpha query stream, in many domains we’re definitely at least on a par with typical human query-understanding performance. We’re not in the running for the Turing Test, though: Wolfram|Alpha doesn’t currently do conversational exchanges, but more important, Wolfram|Alpha knows and can compute far too much to pass for a human.

And indeed after all these years perhaps it’s time to upgrade the Turing Test, recognizing that computers should actually be able to do much more than humans. And from the point of view of user experience, probably the single most obvious metric is the banishment of artificial stupidity.

When Wolfram|Alpha was first released, it was quite common to run into artificial stupidity even in casual use. And I for one had no idea how long it would take to overcome it. But now, just 3 years later, I am quite pleased at how far we’ve got. It’s certainly still possible to find artificial stupidity in Wolfram|Alpha (and it’s quite fun to try). But it’s definitely more difficult.

With all the knowledge and computation that we’ve put into Wolfram|Alpha, we’re successfully making Wolfram|Alpha not only smarter but also less stupid. And we’re continuing to progress down the exponential curve toward perfect query understanding.

9 comments. Show all »

I may have overlooked the problem, but to me it seems to me that the issue is as much a question of knowing too few AND knowing too much.

You outline in the whole article the fact that W|A doesn’t know what a plum is, doesn’t know what a guinea pig, etc… While this is true, to me there’s another problem: it knows too many things nobody else know. I didn’t know about the Polar BEAR project. I didn’t know about the Plum crater.

More important than knowing everything about the things the person who wrote the query knows is, I think, knowing all the things he’ll never ask you a question about, because he’s unlikely to know about the subject.

If my brother ask me a question about CKY, I’m not going to explain everything I know about the algorithm because I don’t expect him to know about that. I may just ask if it’s possible he speak about that ir say I don’t know anything about that if he replies no.

More about the “knownability” problem: the fact someone may know about something is evolving rapidly. For exemple, imagine only 10 people knew about the Plum crater today but, that in a month, we discover some primitive form of extraterest life in that crater. In one day, the news would spread and amost anybody could ask questions about it.

This is a known problem for search engines, and I wonder if W|A couldn’t find an interest in working toegether with a search engine (Bing+Powerset or Google+Freebase) to learn about “how many people search a certain keyword” and, if possible, dissambiguations used for those keywords. This thing evolves at a rapid pace and maybe W|A is simply too small to know about thoses. It may also help to answer questions on subjects WA simply don’t know about.

Maybe creating profiles would help, too. Not everybody is interested in astronomy, but maybe people doing astronomy are more likely to know about the Plum crater.

Here’s a behaviour I’ve been wondering about for a while that you seem to touch on in this article.

Say I ask Alpha “number of words in the english language”. Alpha is able to give a perfectly good answer – it’s clear that it understands the question and it has the knowledge to answer it. But if I try ostensibly the same question with a different language, say german, Alpha appears flummoxed.

This is interesting because it seems to the user that Alpha hasn’t understood the question, whereas because I asked the first question I can infer that the real problem is it doesn’t know enough about German. I wonder if providing feedback to the user about to what extent the input has been understood is something that you would consider important to include in a knowledge engine, as it seems to be important in human to human communication. For example could Alpha ask an intelligent question to clarify what the user meant in certain circumstances?

Also I wonder if this behaviour is just an artifact of Alpha trying a bit too hard to give some kind of answer or is it’s interpretation of the language semantics inextricably linked to what knowledge it has?

Yes, the success rate for Wolfram|Alpha query responses is remarkable as a somewhat arbitrary 90% numerical milestone and an even more remarkable milestone for the progress of a New Kind of Science (NKS). A scant decade ago NKS computational irreducibility of natural language seemed to explain the discouraging prospects for solving the artificial intelligence problem, while at the same time NKS offered a vague promise that mining the computation universe might find simple rules that could simulate or even process natural language. Today NKS has brought us to the point were we can now dare to speak confidently about an 18 month half-life for banishing artificial stupidity. This rapid progress is illustrated by a brief timeline of Stephen’s public statements about the role of NKS in Wolfram Alpha:

2007. Quest for Ultimate Knowledge, Celebrating Gregory Chaitin’s 60th birthday, http://www.stephenwolfram.com/publications/recent/ultimateknowledge/
“If one chooses to restrict oneself to computationally reducible issues, then this provides a constraint that makes it much easier to find a precise interpretation of language. … I believe we are fairly close to being able to build technology that will [...] take issues in human discourse, and when they are computable, compute them. … And the consequence of it will be something [...] of quite fundamental importance. That we will finally be able routinely to access what can be computed about our everyday world.”

2009. First killer NKS app, http://blog.wolfram.com/2009/05/14/7-years-of-nksand-its-first-killer-app/
“Wolfram|Alpha is [...] still prosaic relative to the full power of the ideas in NKS. … It is the very ubiquity of computational irreducibility that forces there to be only small islands of computational reducibility—which can readily be identified even from quite vague linguistic input. … For now, for the first time, anyone will be able to walk up to a computer and immediately see just how diverse a range of possible computations it can do.”

2011. Computing and Philosophy, http://blog.stephenwolfram.com/2011/05/talking-about-computing-and-philosophy/
“[The NKS principle of computational equivalence implies] there is no bright line that identifies “intelligence”; it is all just computation. … That’s the philosophical underpinning that makes possible the idea that building a Wolfram Alpha isn’t completely crazy. Because if one had to build the whole artificial intelligence one knows that one is a long way from doing that. But in fact it turns out that there’s a more direct route that just uses the pure idea of computation.”

2012. Overcoming Artificial Stupidity, http://blog.stephenwolfram.com/2012/04/overcoming-artificial-stupidity/
“One might have thought that doing better at understanding natural language would be about covering a broader range of more grammar-like forms. … But our experience with Wolfram|Alpha is that it is at least as important to add to the knowledgebase of the system. … As the domains of Wolfram|Alpha knowledge expand, they gradually fill out all the areas that we humans consider common sense, pushing out absurd ‘artificially stupid’ interpretations.”

Regarding SELF AWARENESS, I am sure you have heard at least SOMETHING about the potential for higher but attainable forms of SELF AWARENESS unrelated to external computation, data gathering or analysis.
In particular the METHOD OF GURDJIEFF (sadly distorted and misunderstood by almost everybody who knows the name) is a profound way (I hesitate to say “method” because of the limitations it implies) to be SELF AWARE at a “quantum level” higher than the state of consciousness in which most of us pass our daily lives.
For someone with a “big mind” and likely a correspondingly “big” power of attention and concentration, the learning (NOT EASY, but oh so SIMPLE) of this method, experience of first “results” of application, and the ongoing and very demanding EFFORT to sustain this form of “more objective” self-consciousness, can lead to something far beyond the highest aspirations of “mind” -even the most brilliant of minds- in its present condition. It is not a matter of intelligence, per se, but one with intelligence, unquenchable wish, and relentless perserverence will find rewards not measurable in earthly or conventional religious ideation.
Hoping you will consider this, if you have not already done so.
With very best wishes, Mike Beigel

Fremy Company has a good point that the issue is as much a question of knowing too much as too little. WA needs to be able to establish a context for selecting appropriate input interpretations.

There are a number of possible approaches, such as learning a user’s areas of interest (as does Google, and various social networking sites), but it might be easier initially to provide a context selection field (perhaps as a branching tree of topics of increasing specialization) for the user to point WA in the right direction. Given a contextual field of interest, WA could then rank the input keywords & phrases with regard to their likelihood and so probable meaning in that context.

Suitable ranking data could be gleaned by an on-going automated search system, much like the popular search engines, trawling the internet noting the frequency of usage of words and phrases in various fields and assessing the level of specialization of the sites involved (this would be the tricky part). There is probably already a considerable amount of this kind of data collected for other purposes, e.g. search engines, so the technology is available to make this possible.

When WolframAlpha launched it got a lot of attention and people entered questions on a very wide variety of topics into it. Many of them did not e answers and presumably stopped using the site for those queries, but returning frequently for the things they saw the system could answer. This process would result in a growth in the percentage of answered queries even if no additional knowledge had been added.

seems to me that is where the problem is (uptake of input into tokens alpha knows of, lack of ability to sectionalize). to pick which data, the idea of “best google hit” (DBM search)

google doesnt need to understand you to be smart, it only needs to allow you to choose find a needle in a haystack. a proper library search would not hit only by content, but by all bibliography entries: ie, “word1 AND word2 in TI” (in title, or formula)