Bug lets humans grab Daily Double as Watson triumphs on Jeopardy

Though Jennings got the final Jeopardy question right, he knew he'd been defeated when Watson scored the last Daily Double. His answer shows his concession.

Note: In this article, Jeopardy's "answers" are referred to as "questions" and vice versa.

The humans tried to hold on in the second game of Jeopardy against the IBM computer, but ultimately were no match. Watson finished with a two-game total of $77,147 to Ken Jennings' $24,000 and Brad Rutter's $21,400. Jennings and Rutter managed to make a larger dent in Watson's progress in the second game, but the computer managed to take both Daily Doubles away from the human contestants, not affording them enough of an opportunity to make up for Watson's $25,000 lead from the first game. Still, there were a few aspects of the game that gave the humans some ins, including a bug that let Ken Jennings score the first Daily Double.

During a panel at Rensselaer Polytechnic Institute's Experimental Media and Performing Arts Center, Dr Chris Welty, a member of Watson's algorithms team, noted that the start-and-stop nature of filming the episode got Watson mixed up and allowed a bug to surface. Watson begins every round looking for Daily Double clues, because they are crucial to progress in the game. After one filming pause in the first round when Watson had been made to stop and then pick up again, Welty said Watson began again thinking the Daily Double had already been found. So it stopped looking for the clue, allowing Jennings to find it first.

"They were having a lot of problems in that particular round and they kept stopping," Welty said. "There was still a Daily Double left in that round, and the front end that keeps track of the game state had thought the Daily Double was already revealed." Because Watson thought the Daily Double was gone, it started working its secondary strategy of selecting the lowest level clues to allow it to learn about a category. This left Jennings free to sort through the remaining higher value clues where the Daily Double was, allowing him to pick it up while Watson was cherry picking the top rows.

Another of Watson's biggest weaknesses was laid bare by a category from the first round, "Actors Who Direct." The questions in the topic were shorter than standard clues, usually only the names of two movies pointing to one man, and didn't give enough time for Watson to process and hit the buzzer first. "The answers were not ready in time because the questions were so quick," said Chris Welty. "One of the things that Watson actually doesn't know is that it's losing the buzzer because its answers aren't ready."

Not only was this bad from a score standpoint, but it formed a vicious circle for Watson's clue selection. Welty pointed out that Watson will select clues from categories based on where it's getting responses correct, which it was in the case of Actors Who Direct, but Watson doesn't get any information on whether its right answers are actually allowing it to buzz in first and get the points."It's going to keep going back because it's getting all the right answers," Welty said.

Aside from issues of timing, Watson's algorithms worked well in the sense that it was very rarely certain of a wrong answer. On answers it was certain of, it nearly always beat Jennings and Rutter to the buzzer; if the answer didn't turn up a high-confidence response, as was often the case with subtly worded questions, Watson would remain silent.

That's not to say there weren't outliers—Watson was occasionally unsure of answers that were correct. For example, in a Daily Double question on art from the first game, Watson came up with the correct answer, Baghdad, but with only 32 percent confidence. And as happened with the infamous Final Jeopardy question from the first game, Watson seems to struggle with the relationship that categories can have to a correct response. In the topic "On the Keyboard" during the second game, the clue "A loose-fitting dress hanging straight from the shoulders to below the waist," prompted Watson to ask "What is a chemise?" The correct response was the dress shape and keyboard key "shift."

But in regular Jeopardy rounds, Watson was able to learn during the game based on previous answers in the category what type of answer was required. For example, in the first Jeopardy game, Watson eventually figured out—albeit a bit late—that the "Name that Decade" category did, in fact, want a decade as the answer. Even Watson's handlers were impressed: "It actually kind of figured out on its own that decades were important," Dr. Adam Lally, a senior software engineer from IBM, said.

Towards the end of the panel, Welty and Lally were prompted to discuss the choice of gender for Watson's voice, which is currently of the smooth, genial male variety. "We did experiment a lot with female voice as well," Welty said. "But the speech software we had, the way you could change the settings of the voice, and I mean this in the best possible way, it just was not possible to get a female voice that wasn't a little bit grating." This drew sounds of ire from the crowd, but Welty added that having the voice operate in lower ranges made it easier to soften, and that both men and women on the development team preferred the male voice.

Watson's machine learning may come in handy in the future that its creators are envisioning for it, which include medical diagnoses and tech support. Of course, phone or voice input is currently out the question, as parsing sounds isn't something Watson can currently do. But with text input, Watson could be able to do great things from an information standpoint, especially given that it is able to find high-level connections between tiny details.

As a result of Watson's two-game win, 100 percent of its prize money, $1 million, will be donated to charity. Jennings and Rutter walk away with $300,000 and $200,000, respectively, and each is donating half of their prize to a charity of his choice.

60 Reader Comments

Jennings and Rutter walk away with $300,000 and $200,000, respectively, and each is donating half of their prize to a charity of his choice.

Everyone in my family watching this (the interest in this challenge from all generations was astonishing in itself) kind of shook our heads at that. "Those nice boys should have donated it all" was a comment from the grey-haired section of the room.

"The correct response was the dress shape and keyboard key "shift," offered by Rutter after Watson got it wrong."

Hate to nitpick, but I'm pretty sure no one got that question correct. I seem to recall Trebek revealing the answer.

Watson's language comprehension is certainly impressive, but I would have liked to see more linguistically challenging categories. I know that the point of the show was to demonstrate what Watson could do, so they're not going to include anything that it definitely couldn't handle, but I still would have liked to see it attempt something like a fill-in-the-blank category, one with multi-part answers, or one where the answer is two rhyming words. All are staples of Jeopardy, and would have been fascinating to see how Watson handled them.

Edit: Also well done to Ken Jen for the Deep Space Homer reference, and kudos to ars for the picture.

In case anyone is curious, a kindly YouTube user (who is doubtless an affiliate of Sony Television and doing so in an official, sanctioned capacity) has uploaded the Watson episodes. They can be found here: http://www.youtube.com/user/Rashad8821

Yeah -- unless Watson had been specifically instructed as to what the "Before & After" category means, I think it would have bonked hard on it.

Actually he was programmed to be able to handle that category, as well as I think 6 others. They did added special profiles for common categories that were hard for the standard profile to handle. They did discover that for the most part having those was of little use. Apparently at one point in the process of making Watson he had a lot more special profiles but it actually lowered his accuracy. Remember this game had to meet US game show regulations so IBM had no way of knowing what questions they would be receiving, plus they trained him on every question on jeopardy for the past 20 or so years....thousands of questions of all types.

Of course, phone or voice input is currently out the question, as parsing sounds isn't something Watson can currently do. But with text input, Watson could be able to do great things from an information standpoint, especially given that it is able to find high-level connections between tiny details.

I will donate my G2 for its voice-to-text capabilities, if IBM thinks it'll help..

Jennings and Rutter walk away with $300,000 and $200,000, respectively, and each is donating half of their prize to a charity of his choice.

Everyone in my family watching this (the interest in this challenge from all generations was astonishing in itself) kind of shook our heads at that. "Those nice boys should have donated it all" was a comment from the grey-haired section of the room.

Those nice boys also have a life and other things going on, so the money was a motivation to actually do the show and to work hards towards winning while still providing something for charity. Also there is nothing that says they can't give the money toward charity at a later date.

phone or voice input is currently out the question, as parsing sounds isn't something Watson can currently do.

Given that:1: most smartphones today can do reasonably good voice recognition (camera OCR too), and2: IBM has 20 years of experience with voice recognition software, thenthere's no valid technical reason why Watson couldn't have done AV input. Money isn't a valid reason either; it would take maybe a few weeks of one mid-level developer's time to write, and a cheap laptop (how about a Thinkpad?) to sit on the podium, which is trivial compared to the millions spent building and programming Watson itself.

plus they trained him on every question on jeopardy for the past 20 or so years....thousands of questions of all types.

I find the fact that you used him instead of it quite interesting. Watson is clearly an example of what modern AI can do, and gives us a real example that scifi depictions of self-aware AI may indeed someday be possible; in response it is easy for us humans to lend Watson human pronouns such as him.

A seemingly innocuous point I know, but yet still very compelling IMO. Cool stuff indeed.

I would have liked to have seen more categories like before and after, rhyme time, not an X (not sure how to name this, but where they list a few things and the correct question is the one that is not an X) that presumably would have been harder for Watson to interpret.

Towards the end of the panel, Welty and Lally were prompted to discuss the choice of gender for Watson's voice, which is currently of the smooth, genial male variety. "We did experiment a lot with female voice as well," Welty said. "But the speech software we had, the way you could change the settings of the voice, and I mean this in the best possible way, it just was not possible to get a female voice that wasn't a little bit grating."

They should have just gone the SHODAN/GLaDOS route and instead of trying to make the voice pleasing, left it as grating as possible. Poor inflection and random tonal shifts are a bonus. That would have let those meat sacks know what they were up against...

More seriously, I find the proposed uses to be quite interesting. Obviously, there would be problems it could not solve, but if having a Watson-like system as a sort of level 0 support cut down on a significant number of phone calls it could be very beneficial. Of course, you would have to be careful that it did not give away (at least immediately) that it was a robot, to avoid the "Oh, God, I am not good with computers" reactions.

phone or voice input is currently out the question, as parsing sounds isn't something Watson can currently do.

Given that:1: most smartphones today can do reasonably good voice recognition (camera OCR too), and2: IBM has 20 years of experience with voice recognition software, thenthere's no valid technical reason why Watson couldn't have done AV input. Money isn't a valid reason either; it would take maybe a few weeks of one mid-level developer's time to write, and a cheap laptop (how about a Thinkpad?) to sit on the podium, which is trivial compared to the millions spent building and programming Watson itself.

Voice Recognition in the context of a game show is almost as difficult a task as the NLP Watson does. Not only would it have to merely recognize the words (easy) it would have to know when it was being talked to, or about. You neglect the role of body language (and where our eyes are looking, etc.) that human use to determine if we're being talked to or not, and if we're understanding what we're hearing correctly or not.

Maybe next year though they'll add in voice and image/video recognition (i personally would watch WAAY more episodes with Watson).

plus they trained him on every question on jeopardy for the past 20 or so years....thousands of questions of all types.

I find the fact that you used him instead of it quite interesting. Watson is clearly an example of what modern AI can do, and gives us a real example that scifi depictions of self-aware AI may indeed someday be possible; in response it is easy for us humans to lend Watson human pronouns such as him.

A seemingly innocuous point I know, but yet still very compelling IMO. Cool stuff indeed.

While you do make a good point, I would like to point out that I do a lot of programming and as a cultural artifact we do tend to ascribe gender and intention to programs even though we know good and well they don't. What I found interesting that that people without my background do use him as well and I find myself wondering if maybe this program might have the tiniest sliver of intelligence after all even though it lacks understanding, intention or self awareness. I often find the definition of intelligence as human-like understanding to be a little weak.

Towards the end of the panel, Welty and Lally were prompted to discuss the choice of gender for Watson's voice, which is currently of the smooth, genial male variety. "We did experiment a lot with female voice as well," Welty said. "But the speech software we had, the way you could change the settings of the voice, and I mean this in the best possible way, it just was not possible to get a female voice that wasn't a little bit grating."

They should have just gone the SHODAN/GLaDOS route and instead of trying to make the voice pleasing, left it as grating as possible. Poor inflection and random tonal shifts are a bonus. That would have let those meat sacks know what they were up against...

Having Watson sound like the classic Cylon would have put a nice spin on things..

Towards the end of the panel, Welty and Lally were prompted to discuss the choice of gender for Watson's voice, which is currently of the smooth, genial male variety. "We did experiment a lot with female voice as well," Welty said. "But the speech software we had, the way you could change the settings of the voice, and I mean this in the best possible way, it just was not possible to get a female voice that wasn't a little bit grating." This drew sounds of ire from the crowd, but Welty added that having the voice operate in lower ranges made it easier to soften, and that both men and women on the development team preferred the male voice.

Another of Watson's biggest weaknesses was laid bare by a category from the first round, "Actors Who Direct." The questions in the topic were shorter than standard clues, usually only the names of two movies pointing to one man, and didn't give enough time for Watson to process and hit the buzzer first. "The answers were not ready in time because the questions were so quick," said Chris Welty. "One of the things that Watson actually doesn't know is that it's losing the buzzer because its answers aren't ready."

Not only was this bad from a score standpoint, but it formed a vicious circle for Watson's clue selection. Welty pointed out that Watson will select clues from categories based on where it's getting responses correct, which it was in the case of Actors Who Direct, but Watson doesn't get any information on whether its right answers are actually allowing it to buzz in first and get the points."It's going to keep going back because it's getting all the right answers," Welty said.

This, along with some of the other problems it had, would be easily fixed by adding some feedback loops. For instance, in this case Watson would only need to know that A) it is consistently finding the correct answer within the time allotted for a response and B) it is not buzzing in early enough to capitalize on that.

Then program it to, after maybe two such clues, insta-buzz on that category regardless of confidence (or at a much lower confidence), then use the available time to further compute before responding.

Of course, improving the game AI would be beside the point. Watson still did what they set out to have it do, for the most part.

While you do make a good point, I would like to point out that I do a lot of programming and as a cultural artifact we do tend to ascribe gender and intention to programs even though we know good and well they don't.

While you do make a good point, I would like to point out that I do a lot of programming and as a cultural artifact we do tend to ascribe gender and intention to programs even though we know good and well they don't.

In the topic "On the Keyboard" during the second game, the clue "A loose-fitting dress hanging straight from the shoulders to below the waist," prompted Watson to ask "What is a chemise?" The correct response was the dress shape and keyboard key "shift."

My wife the fabric geek pointed out that Watson at least got the dress type right: a chemise is a shift.

In the topic "On the Keyboard" during the second game, the clue "A loose-fitting dress hanging straight from the shoulders to below the waist," prompted Watson to ask "What is a chemise?" The correct response was the dress shape and keyboard key "shift."

My wife the fabric geek pointed out that Watson at least got the dress type right: a chemise is a shift.

Sorry, but the discussion around this questions/answer was useless for actually understanding what was going on. This made 3x more sense to me:

Watson was not infallible, however, answering one clue incorrectly — ironically, in the category "Also on your computer keys." Watson answered the clue, "A loose-fitting dress hanging straight from the shoulders to below the waist," with, "What is a chemise?" The correct answer was "What is "shift?"

In the topic "On the Keyboard" during the second game, the clue "A loose-fitting dress hanging straight from the shoulders to below the waist," prompted Watson to ask "What is a chemise?" The correct response was the dress shape and keyboard key "shift."

My wife the fabric geek pointed out that Watson at least got the dress type right: a chemise is a shift.

Ok, I didn't see it, and I'm NOT a fabric geek, but I just can't grok this question/answer..

The clue was: A loose-fitting dress hanging straight from the shoulders to below the waist.

And the correct answer would have been: What is a chemise and a shift???? What is a shifted chemise?? None of the discussion of this gives me any hint as to what they were looking for...

I think the idea was that the correct response would be a word that matched both the description in the clue and would be the name of a key on the keyboard. So "chemise" was "A loose-fitting dress hanging straight from the shoulders to below the waist" but there's no "chemise" key on your keyboard (maybe on one of those Optimus Maximus keyboards). "Shift" is both the key and a term for a loose-fitting dress. Sadly I was stuck at work when the episode aired locally so I don't know what else they had in the category. My guesses would be Insert, Home, F1, Tab and maybe Control or End.

In the topic "On the Keyboard" during the second game, the clue "A loose-fitting dress hanging straight from the shoulders to below the waist," prompted Watson to ask "What is a chemise?" The correct response was the dress shape and keyboard key "shift."

My wife the fabric geek pointed out that Watson at least got the dress type right: a chemise is a shift.

Sorry, but the discussion around this questions/answer was useless for actually understanding what was going on. This made 3x more sense to me:

Watson was not infallible, however, answering one clue incorrectly — ironically, in the category "Also on your computer keys." Watson answered the clue, "A loose-fitting dress hanging straight from the shoulders to below the waist," with, "What is a chemise?" The correct answer was "What is "shift?"

3x more sense despite them getting the category wrong? The category was, as the Ars article stated, "On the Keyboard". In many categories in Jeopardy! you have to combine the clue in the answer with the clue in the category to get the right question. So while "chemise" might fit the answer, it doesn't fit the category. Not unless you have a "chemise" key on your keyboard.

Edit: oh never mind, Aleph_Xero beat me to it. I had restarted my reply after I noticed you had edited yours.

plus they trained him on every question on jeopardy for the past 20 or so years....thousands of questions of all types.

I find the fact that you used him instead of it quite interesting. Watson is clearly an example of what modern AI can do, and gives us a real example that scifi depictions of self-aware AI may indeed someday be possible; in response it is easy for us humans to lend Watson human pronouns such as him.

A seemingly innocuous point I know, but yet still very compelling IMO. Cool stuff indeed.

While you do make a good point, I would like to point out that I do a lot of programming and as a cultural artifact we do tend to ascribe gender and intention to programs even though we know good and well they don't. What I found interesting that that people without my background do use him as well and I find myself wondering if maybe this program might have the tiniest sliver of intelligence after all even though it lacks understanding, intention or self awareness. I often find the definition of intelligence as human-like understanding to be a little weak.

"Then program it to, after maybe two such clues, insta-buzz on that category regardless of confidence (or at a much lower confidence), then use the available time to further compute before responding."

I would have liked to have seen more categories like before and after, rhyme time, not an X (not sure how to name this, but where they list a few things and the correct question is the one that is not an X) that presumably would have been harder for Watson to interpret.

Gotta give the meatbags a chance, after all..

It definitely would've been interesting, as the weakness of the current Watson is that it doesn't really "figure things out" other than well known linguistics rules and more complex questions would push up against what rules they infused him with.

However the producers or whoever did claim that no special consideration was given to the selection of categories/items other than avoiding the video/audio questions.