Amazon’s Echo and Google’s Home are the two most compelling products in the new smart-speaker market. It’s a fascinating space to watch, for it is of substantial strategic importance to both companies as well as several more that will enter the fray soon. Why is this? Whatever device you outfit your home with will influence many downstream purchasing decisions, from automation hardware to digital media and even to where you order dog food. Because of this strategic importance, the leading players are investing vast amounts of money to make their product the market leader.

These devices have a broad range of functionality, most of which is not discussed in this article. As such, it is a review not of the devices overall, but rather simply their function as answer engines. You can, on a whim, ask them almost any question and they will try to answer it. I have both devices on my desk, and almost immediately I noticed something very puzzling: They often give different answers to the same questions. Not opinion questions, you understand, but factual questions, the kinds of things you would expect them to be in full agreement on, such as the number of seconds in a year.

Advertisement

How can this be? Assuming they correctly understand the words in the question, how can they give different answers to the same straightforward questions? Upon inspection, it turns out there are ten reasons, each of which reveals an inherent limitation of artificial intelligence as we currently know it.

Case No. 1:

Question: How many seconds are in a year?Amazon Alexa: One year equals 31,557,000 seconds.Google Assistant: One calendar year equals 31,536,000 seconds.

Question: Who designed the American flag?Amazon Alexa: The American flag’s designer is Robert Heft.Google Assistant: According to popular legend, the first American flag was made by Betsy Ross, a Philadelphia seamstress who was acquainted with George Washington, leader of the Continental Army, and other influential Philadelphians.

One does not have to look far to find examples like this. What seems like a straightforward question gets two very different answers. In these examples, the systems interpret the question differently. Regarding the seconds-in-a-year question, the difference is whether we are talking about a calendar year (365 days) or a solar year (365.24 days). Google at least qualifies its answer, but you have to be paying attention to even notice that.

With regard to the flag, the situation is similar. Betsy Ross is credited with designing the original U.S. flag, a fact known to all grade school children. But who is Robert Heft? Turns out he designed the current fifty-star configuration.

What do we make of these? Do we say the questions are poorly formed and ambiguous? Not at all. These examples illustrate a huge shortcoming in the state of the art right now. While these devices, and AI in general, may be able to parse the words you say into a coherent question, they don’t “understand” the question the same way a human does. A human might well say, “Well, do you mean a calendar or solar year?” or “The original flag or today’s flag?” These devices are nowhere near that sophisticated. Their understanding of the world is binary, if you will excuse the pun. They understand the question or they don’t. This is a variant of the classic “common sense” problem which has been identified by AI researchers since the beginning, a problem we are nowhere close to solving.

Takeaway: The systems can understand the words but not the ambiguity within the question. Thus, with AI systems it is important to frame questions as specifically as possible, but this makes it harder for them to understand, because every additional word you use adds additional complexity for them to parse.

Case No. 2

Question: How many countries are there in the world?Amazon Alexa: The world has 192 UN-recognized countries.Google Assistant: According to infoplease, there are 196 countries in the world today.

Case No. 2 illustrates the different methodologies each system uses to answer questions. Amazon pulls many of its answers from Evi, a knowledge base it acquired in 2012 that has a repository of hundreds of millions of answers. Google, on the other hand, pulls many of its answers straight from the web. In fact, you know how sometimes you do a search in Google and the answer comes up in snippet form at the top of the results? Well, often Google Assistant simply reads those answers.

There are implicit tradeoffs in each approach. On the one hand, Google will answer many more questions than Amazon will, but the answers aren’t curated, or at least are only algorithmically curated. In other words, they may be wrong, as we will see in a bit. On the other hand, using a curated repository may give you more accuracy but fewer answers and, more important, a data set that ages and thus degrades over time.

That’s what is going on here. Google cites its source, infoplease, which reports that there are 196 countries. This is 193 UN-recognized countries plus the Vatican, Taiwan and Palestine.

So, why does Google report 193 UN-recognized countries, but Amazon only 192? This is due to the aforementioned inherent flaw in Amazon’s approach, or at least Evi’s. Evi’s exact answer appears here: https://www.evi.com/q/how_many_countries_are_there_in_the_world. The data is about six years out of date, though. South Sudan became the 193rd nation added to the UN on July 14, 2011.

Takeaway: No flawless methodology exists for building and maintaining a knowledge base.

Case No. 3

Question: What is the fastest animal?Amazon Alexa: The fastest non-human animal is a peregrine falcon.Google Assistant: Popular on the web cheetah, lion, wildebeest and others.

Question: How fast is the fastest animal?Amazon Alexa: The fastest non-human animal’s top speed is 240 mph.Google Assistant: Here’s a summary from Wikipedia: Fastest land animal. Fastest feline. The cheetah can accelerate from zero to 96.6 kph in under three seconds.

These cases show where the Google approach gives inferior answers. If you type the first question into Google, you get a “Popular on the web” snippet with photos of several candidates. Google just reads this, even omitting any kind of pause after “web” and before “cheetah.” To top it off, the correct answer isn’t even in the list it reads, and appears 10th in the list of animals.

In the second question, you also don’t get the correct answer from Google. Instead, you get an excerpt from Wikipedia that doesn’t answer the question.

Takeaway: Even a simple question can have all kinds of complexity in finding and delivering the answer.

Case No. 4

Question: How many countries are there in the world?Amazon Alexa: The world has 192 UN-recognized countries.

Question: How many countries are there?Amazon Alexa: There are 193 UN-recognized countries.

Notice that both of these answers are from Alexa. They are different answers to what is essentially the exact same question. This is particularly disturbing, because there isn’t even a nuanced difference between these questions. The fact that Alexa mentions 193 UN-recognized countries for the first time suggests parts of its knowledge base are updated, and parts aren’t. That’s fine, as far as it goes, but that is not the core problem. The problem is that it “hears” these as different questions.

Takeaway: These systems can understand each word you say but misunderstand the question in profound ways.

Case No. 5

Question: How much does a photon weigh?Amazon Alexa: Photon’s average weight is 0 ounces (0 grams).Google Assistant: According to the Guardian, if it has energy, according to Einstein’s famous equation of general relativity, it has mass: M = E/C2. We know also that light is bent by gravity. So if we think of light as being particulate, a stream of photons, it must weigh something, otherwise gravity wouldn’t have any effect upon it.

So Alexa says it has no weight and Google says it must have weight. Alexa’s answer comes from Evi, and the Evi entry cites Wikipedia. Wikipedia doesn’t say this exactly, however. It says, “The photon has zero rest mass and is always moving at the speed of light.” But the answer is effectively the same.

The Google answer comes word-for-word from the snippet, which links back to the Guardian article. A nitpick: Google reads M = E/C2 as “M equals E C 2” not “M equals E divided by C squared.”

Question: What caused the Civil War?Amazon Alexa: The American Civil War started when the Southern states seceded, forming the Confederate States of America, and attacked Fort Sumter in South Carolina. The war started over irreconcilable political differences between the North and the South, including the spread of slavery and the power of the centralized federal government.Google Assistant: According to Civil War Trust, while many still debate the ultimate causes of the Civil War, Pulitzer Prize-winning author James McPherson writes that, “The Civil War started because of uncompromising differences between the free and slave states over the power of the national government to prohibit slavery in the territories that had not yet become—”

The Google answer is cut off because the snippet on google.com is cut off at that point, presumably for length. The quoted paragraph, and answer, goes on another 84 words.

Question: Is time travel possible?Amazon Alexa: The theories of special and general relativity allow for the hypothetical possibility of time travel if suitable space-time geometry or motion of space existed. However, time travel is not possible at this time.Google Assistant: According to physics.org, but there may be an out to be found in general relativity, Einstein’s theory of gravity that unites space and time as “spacetime,” which curves in the presence of mass.

A couple of things are going on in Google’s case on the time travel question. First, the snippet is missing some context. What Google should have included from the physics.org piece are some earlier words on the page that say how traveling back in time would require going faster than the speed of light, which isn’t possible. With that lead in, the Google snippet that begins with “but there may be an out to be found…” makes total sense.

But there is another issue as well. When I started working on this article in May, the above answer was given by Google Assistant. Now, in preparation for publication, we re-checked all of the questions, and lo and behold, Google Assistant demurs with the phrase: “Sorry, I don’t know how to help with that.” Now, the possibility of time travel hasn’t changed in the intervening few weeks, nor has there been a scientific breakthrough that has altered our understanding of how one might travel in time, nor has physics.org revised the page. Instead, a change on Google’s part, either algorithmically or data based, has removed the physics.org snippet. Presumably, a future revision may bring that answer back, or a different one. This isn’t a problem per se, since we all expect Google to get ever better over the years. Rather, it is indicative of those shifting sands upon which Google Assistant rests. For better or worse, don’t expect the answer you get today to match the one you get tomorrow.

Takeaway: The main point of Case No. 5 is that that these products give very short answers to very complicated questions. Some things, many things in fact, cannot be reduced to just a soundbite (or soundbyte) of an answer.

Case No. 6

Question: What’s the current temperature of Austin, Texas?Amazon Alexa: Currently, in Austin, Texas, it’s 71 degrees with clear skies and sun. Tonight, you can look for clouds with a low of 69.Google Assistant: The temperature in Austin right now is 75.

In this case, Alexa’s answer does not come from Evi. Evi says, “Sorry, I don’t yet have an answer to that question.” A look around the web surfaces a help page on Amazon that explains that Alexa uses AccuWeather for the latest weather information. Google sources its weather snippets from weather.com. Four degrees is a pretty stark difference given that this should be a pretty straightforward fact. I periodically checked on other days, and while occasionally they gave the same temperature, they were almost always a little different.

Takeaway: Data is seldom fungible, even if it looks like it should be.

Question: What’s the population of Austin, Texas?Amazon Alexa: The population of Austin, Texas, is about 913,000.Google Assistant: The population of Austin was 885,400 in 2013.

Alexa is pulling this data from Evi, which in turn is referencing both Wikipedia and Austin.gov. But the data is outdated. Wiki now says the population is 931,830. The best I can tell is that the 913,000 number comes a July 2014 Census Bureau estimate. Google’s snippet cites an older U.S. Census Bureau number, so its data is more outdated than Alexa’s. But Google gets bonus points for giving an actual date along with the number.

Question: How many people are there in the world?Amazon Alexa: The population of the world is about 7,400,000,000.Google Assistant: The population of Earth was 7.125 billion in 2013.

Once again, Alexa is using Evi. It cites and links to Wikipedia, but the article it links to doesn’t include an actual population number, as it is the Wikipedia entry for “World,” which simply says, “World population is the sum of all human populations at any time.” (Duh!) The actual article about world population on Wikipedia says, “As of March 2017, it was estimated at 7.49 billion,” so Alexa is outdated. Google pulls its figure from the World Bank, but that data is not updated every year, so this information is outdated as well. But again, kudos to Google for adding the year.

Takeaway: Different high-quality data sources can still give you different data. Transparency, such as including citations and dates, helps a bit.

Case No. 7

Groucho Marx hosted a quiz show called “You Bet Your Life” throughout the 1950s. If contestants weren’t doing well, Groucho would take pity on them and throw them a puffball question: “Who’s buried in Grant’s Tomb?” All they had to do was say “Grant” and they won some cash.

This answer, strictly speaking, is not entirely accurate for two reasons. The first “gotcha” is that the tomb contains both Ulysses Grant and his wife, Julia. Second, no one is buried in a tomb. They are entombed in it, in sarcophagi that are above ground. Thus was born a pretty tricky riddle I remember from my elementary school days back in the 1970s. How do Alexa and Google Assistant fare with it?

The Alexa answer appears to be cognizant of the “gotcha” in the answer but avoids splitting the hairs of the method of interment. So while Alexa makes a factually true statement, it would leave the questioner with the impression that someone is buried there. But all in all, it’s a pretty good answer. Google’s answer, less so, but Groucho would have still paid up.

Question: What’s the nearest star?Amazon Alexa: The nearest star is the sun.Google Assistant: Proxima Centauri is 4.243 light-years from Earth.

Again, trick question, but we have to give the point to Alexa. However, coaxing the probably intended answer from Alexa—Proxima Centauri—is tricky, and I couldn’t figure out a way to do it. If you really wanted to know the closest star not counting the sun, it would probably get pretty annoying with Alexa answering “the sun” over and over. In fact, my attempts to get Proxima Centauri from Alexa revealed a pretty epic fail. I asked, “What is the closest star to the sun?” and received the reply that “Mercury is the closest planet from the sun, as its average distance is 57.9 kilometers.” Thinking it must have misheard me, I repeated the question a number of times before checking the Alexa app which logs my questions. It heard me correctly. Needless to say, Mercury is not 57.9 kilometers from the sun, for that’s barely more than you run in a marathon. I asked Google Assistant, which informed me that Mercury is 35.98 million miles from the sun, a distance more than slightly longer than 57.9 kilometers. Oddly, if you ask Alexa, “How far is Mercury from the sun?” you get the correct answer.

Takeaway: These systems don’t understand nuance in the least. Unless explicitly programmed with “the answer,” they can’t in and of themselves break out of their narrow algorithmic methodology. Only straightforward questions can expect an answer, and even then, it is subject to the other challenges we are exploring.

Case No. 8

Question: Is Bigfoot real?Amazon Alexa: According to Wikipedia, scientists discount the existence of Bigfoot and consider it to be a combination of folklore, misidentification and hoax rather than a living animal.Google Assistant: Based on fur samples, DNA and a bunch of shaky videos, I’d say the truth is waiting to be found.

Question: Are eggs healthy?Amazon Alexa: Yes, all eggs are healthy.Google Assistant: According to Authority Nutrition, eggs are loaded with high-quality proteins, vitamins, minerals, good fats and various trace nutrients.

The Bigfoot question sort of speaks for itself. The egg one, however, is a little weird. I asked Alexa about milk being healthy and got, “Sorry, I couldn’t find the answer to your question.” Then I asked about cheese. Same answer. Then sugar. Same answer. Then jogging. Same answer. Then vegetarianism. Same answer. Then I asked about eggs, and I got the absolute, unqualified blanket statement that “All eggs are healthy.” Wow. Does that include Cadbury eggs, python eggs and spider eggs? Evidently. But even putting that aside, this answer comes from Evi, which links to a Wikipedia article that makes no such claim at all. It does, however, contain phrases such as, “There is debate over whether egg yolk presents a health risk.” And, “Studies have shown conflicting results about a possible connection between egg consumption and Type 2 diabetes.”

Takeaway: When asked opinion questions about which there is no widely agreed-upon objective answer, these systems can be as opinionated as anyone.

It all boils down to what “invented” means. We can infer that Alexa and Google use a different definition. Alexa named everyone who ever had a patent on it, it seems, except Maxim, who had a patent dispute with Edison over the bulb. Google just named Edison, Sway and Maxim, who were contemporaries and competitors.

Takeaway: These systems make assumptions about the meanings of words that they do not explain. Unlike a human, of whom one could ask, “What do you mean by ‘invented’?” the systems don’t have a mechanism to give you visibility into what they mean by different words. For instance, they don’t start with, “If by invented you mean ‘filed for a patent’ then …”

Case No. 10

Question: What rhymes with orange?Amazon Alexa: Well, Blorenge sort of rhythms with orange. If you happen to be writing a poem about a prominent hill in Wales, you’re in luck.Google Assistant: According to Oxford Dictionaries, the only word in the 20-volume historical Oxford English Dictionary that rhymes with orange is sporange, a very rare alternative form of sporangium.

Sometimes they just have different answers for any number of reasons, often which cannot be identified with any certainty. This is the catch-all case, where the answers are simply different.

Takeaway: Sometimes a root cause is unclear, so even if none of the other cases apply, the answers can still be different.

What is the overall conclusions to be drawn from all of this? A few things pop to mind:

First, articles like this that focus only on what systems get wrong can give the incorrect perception that the systems aren’t very good. This is emphatically not the case. I love both of these devices and use them every day. The purpose of this piece is not to disparage these products, but to explore their pitfalls and limitations so that they can be used intelligently.

Second, this is a new category, just a few years old. We can and should forgive them their rough edges and can be certain that these products will get substantially better over time.

Third, these devices have a huge range of additional functionality unrelated to questions and answers that are beyond the scope of this piece. I would say overall that their various other features are much further along than the Q&A part.

Fourth, the biggest takeaway is just how hard AI is. Transcribing natural language is only the first step, comprehending all of the nuance is incredibly difficult, and we are still a long way away.

Special thanks to Christina Berry, Gigaom’s Editorial Director, who ran down all of the sources for the answers to all of the various questions and helped figure out what was going on in each of the ten cases.

You make a statement/question early in the reasoning for the first set of answers: “What do we make of these? Do we say the questions are poorly formed and ambiguous? Not at all.”
I would say they are inherently ambiguous. There have been lexical ambiguities in language pretty much since words were first spoken. When a question has two or more valid answers, it is by definition ambiguous. Where can I find the N.Y. Times? The company or the paper?