How to build your own Watson Jeopardy! supermachine

To rule humanity, download the following open source code...

Ask mom for her credit card...

However, unless you're a geek sheik or a billionaire bit-basher, you're obviously not going to buy all this iron. But you could use your mom's credit card if you're working from your basement bedroom – or your wife's if you're working from your man cave in the garage – to reserve some server instances on Amazon's EC2 compute cloud.

We would go for the Cluster Compute Instances that Amazon announced last July. The Cluster Compute Instances deliver 33.5 EC2 compute units of power, running in 64-bit mode. They present 23GB of virtual memory to the operating system (that's not much) and the processors used in the physical hardware underneath the CCI slices are in a two-socket x64 server based on Intel's 2.93GHz Xeon X5570s.

That means each slice has 8 cores, 16 threads, and 23GB of memory. The nodes are interconnected with 10 Gigabit Ethernet switches. To match the core count of the Power-based Watson machine, you'd need 360 of these slices. To match the thread count, you'd need 720 slices. And to match the aggregate main memory, you'd need 712 boxes. So it looks like 720 boxes will do the trick, provided that the overhead of the Xen-based Amazon EC2 hypervisor is not too high. At $1.60 per hour for the CCI slices, you are in for $1,152 per hour. Trust me, your ma or your wife won't mind. It's all for the benefit of science.

The thing that makes Watson a question-and-answer machine and not just a cluster running Linux is a mountain of code that IBM has developed called DeepQA. You can see what little IBM has to say about the DeepQA stack here. Two key elements of the DeepQA stack are open source programs available through the Apache Software Foundation.

The first is Apache Hadoop, the open source distributed data-crunching system created by Doug Cutting after he read about Google's back-end infrastructure. Hadoop joined the Apache incubator program in 2005 and was a workable system by around 2008 or so.

The other key piece of code in the DeepQA stack that Watson ran is Apache UIMA – Unstructured Information Management Architecture – which is an information-management framework created by IBM database gurus back in 2005 to help them cope with unstructured information such as text, audio, and video streams. The UIMA code performs the natural-language processing (NLP is the term of art in AI) that parses text and helps Watson figure out what a Jeopardy! clue is about.

IBM has embedded UIMA functions in various systems programs it sells, the first being the OmniFind semantic search engine that Big Blue put into its DB2 data warehouses. IBM has proposed UIMA as an OASIS standard, and took it open source to get people on board with its way of creating frameworks for managing unstructured data. UIMA has frameworks for Java and C++, but could no doubt be extended to whatever language you wanted to code your Watson QA machine in.

Gondek tells El Reg that IBM used Prolog to do question analysis. Some Watson algorithms are written in C or C++, particularly where the speed of the processing is important. But Gondek says that most of the hundreds of algorithms that do question analysis, passage scoring, and confidence estimation are written in Java. So maybe you want to use a RHEL-JBoss stack for your Watson.

Now here is the real problem with a DIY Watson: the algorithms that IBM's DeepQA team created to teach Watson how to play Jeopardy! consist of about a million lines of code. That's going to take you and your friends a bit more than a few weekends to create. But, if you do it, you can launch a deep analytics startup and sell it to HP or Microsoft for ba-zillions.

Let me offer you a few pointers from Gondek for when you build your machine. First, don't stuff it full of anything you can find on the Internet. In creating Watson, IBM's researchers figured out that authoritative texts like the Oxford English Dictionary, Bartlett's Familiar Quotations, Wikipedia – yes, Wikipedia – and various encyclopedias were the best data sets suited to playing Jeopardy!. You want precise data, to be sure, but you don't want to surround it with so much extraneous text that the machine will be churning through tons o' text to find an answer.

For example, you don't put in Moby Dick, but instead lots of authoritative texts that talk about Moby Dick and pull out the important passages. As it turns out, Watson needed about 200 million pages of text, or about the equivalent of 1 million books, to play Jeopardy!.

The other key insight that Gondek offers is to really focus on the question-parsing algorithms. By finding out what the key words are in any sentence and dispensing with the noise, you can not only get to the answer faster, but do a better job of coming up with the correct answer.

These two insights are what turned Watson from a crap Jeopardy! player into a champion. Good luck building your own. And dominating the world. ®