As one of the founders of data warehousing back in the mid-1980s, a question I increasingly ask myself over 25 years later is: Are our prior architectural and design decisions still relevant in the light of today's business needs and technological advances? I'll pose this and related questions in this blog as I see industry announcements and changes in way businesses make decisions. I'd love to hear your answers and, indeed, questions in the same vein.

Dr. Barry Devlin is among the foremost authorities in the world on business insight and data warehousing. He was responsible for the definition of IBM's data warehouse architecture in the mid '80s and authored the first paper on the topic in the IBM Systems Journal in 1988. He is a widely respected consultant and lecturer on this and related topics, and author of the comprehensive book Data Warehouse: From Architecture to Implementation.

Barry's interest today covers the wider field of a fully integrated business, covering informational, operational and collaborative environments and, in particular, how to present the end user with an holistic experience of the business through IT. These aims, and a growing conviction that the original data warehouse architecture struggles to meet modern business needs for near real-time business intelligence (BI) and support for big data, drove Barry’s latest book, Business unIntelligence: Insight and Innovation Beyond Analytics, now available in print and eBook editions.

Barry has worked in the IT industry for more than 30 years, mainly as a Distinguished Engineer for IBM in Dublin, Ireland. He is now founder and principal of 9sight Consulting, specializing in the human, organizational and IT implications and design of deep business insight solutions.

Recently in Data and content Category

The announcements by HP yesterday set the Web rippling with the opinion that HP is pulling out of the consumer-facing business by dropping WebOS and TouchPad and spinning off its PC business. Probably of more interest to readers of BeyeNetwork, though, is HP's decision to acquire Autonomy for a cool $10.2B. Following on from HP's February purchase of Vertica, it seems fair to say that HP is moving (or returning?) strongly into the enterprise information management business.

As a long-time proponent of the view that the divisions between different "types" of data are breaking down rapidly, the move is not surprising. Autonomy uses the tag-lines "meaning based computing" and "human-friendly data" and focuses on what I call soft (or, unstructured, as it's usually misleadingly called) information. As I discussed in my last couple of posts on IDC's Digital Universe Study, this type of information represents an enormous and rapidly growing proportion of the information resource of the world, and one that requires a very different way of thinking about and managing it. And much of the interest in big data stems directly from the insight one can gain from mining and analyzing exactly this type of information. The acquisition of Autonomy gives HP a significant foothold in this soft information space, given Autonomy's positioning as a leader in the content management and related spaces by Gartner and Forrester.

I have long characterized the traditional approach to computing as being partitioned between operational, informational and collaborative. In the past, these areas have been developed separately, built on disparate platforms, supported by different parts of the IT organization and end up on users' desks as three sets of dis-integrated applications. Business intelligence, although receiving all its base data from the operational environment, operated as a stand-alone environment. HP bought into that environment with its Vertica acquisition. With the Vertica Connector for Hadoop, HP already has access to some of the big data / collaborative data area. However, the Autonomy acquisition takes the use and analysis of soft, collaborative information to an entirely new level. And we can speculate just how far HP will be able to go in aligning and perhaps integrating the functionality in these two areas.

While operational data is still very much the preserve of SAP and similar tools (not to mention home-grown applications from previous generations), the informational and collaborative world are growing ever more intertwined. It's in this converging arena that HP is clearly now throwing its hat, and competing against the big players such as IBM, Microsoft and Oracle, who already have offerings spanning both areas, although with varying levels of integration. Teradata has also seriously entered this field with its recent acquisition of Aster Data. This arena is already populated with strong players.

So, while HP has acquired a strong and well-respected tool with inventive developers in Vertica and now a major player in the content market, I believe there remains a serious question about how easy it will be for them to gain traction in the information management market. I'll be looking out for some seriously innovative developments from HP to convince me that they can gain the respect of the BI and content communities and compete seriously with the incumbents.

"Quickly Watson, get your service revolver!" Is Watson about to put business intelligence out of its misery? Is the good doctor about to surpass Sherlock Holmes in his ability to solve life's enduring mysteries? Or are we in jeopardy of falling into another artificial intelligence rabbit hole?

Yes, I know. Although I haven't found a reference to prove it, I'm pretty sure that IBM Watson, the computer that recently won "Jeopardy!" is named after one of the founding fathers of IBM--Thomas J. Watson Sr. or Jr.--rather than Sherlock Holmes' sidekick. But, the questions above remain highly relevant.

IBM Watson is, of course, an interesting beast. The emphasis in the popular press has been on the physical technology specs--10 refrigerator-sized cabinets containing approximately 3,000 CPU cores, 15 TB of RAM and 500 GB of disk running at about 80 teraflops, and cooled by two industrial air-conditioning units. But, in comparison to some of today's "big data" implementations, IBM Watson is pretty insignificant. eBay, for example, is running up to 20 petabytes of storage. As of 2010, Facebook's Hadoop cluster was running on 2300 servers with over 150,000 cores and 64 TB of memory between them. The world's current (Chinese) supercomputer champion is running at 2.5 petaflops.

On the other hand, a perhaps more telling comparison is to the size and energy consumption of the human brain that Watson beat, but certainly did not outclass, in the quiz show!

However, what's really more interesting from a business intelligence viewpoint is the information stored, the architecture employed and the effort expended in optimizing the processing and population of the information store.

We know from the type of knowledge needed in Jeopardy! and, indeed, from the possible future applications of the technology discussed by IBM that the raw information input to the system was largely unstructured, or soft information, as I prefer to call it. During the game, Watson was disconnected from the Internet, so its entire knowledge base was only 500 GB in size. This suggests the use of some very effective artificial intelligence and learning techniques to condense a much larger natural language information base to a much more compact and usable structure prior to the game. Over a period of more than four years, IBM researchers developed DeepQA, a massively parallel, probabilistic, evidence-based architecture that enables Watson to extract and structure meaning from standard textbooks, encyclopedias and other documents. When we recall that the natural language used in such documents contains implicit meaning, is highly contextual, and often ambiguous or imprecise, we can begin to appreciate the scale of the achievement. A wide variety of AI techniques, such as temporal reasoning, statistical paraphrasing, and geospatial reasoning, were used extensively in this process.

Dr. David Ferrucci, leader of the research project, states that no database of questions and answers was used nor was a formal model of the world created in the project. However, he does say that structured data and knowledge bases were used as background knowledge for the required natural language processing. It makes sense to me that such knowledge, previously gathered from human experts, would be needed to contextualize and disambiguate the much larger natural language sources as Watson pre-processed them. Watson's success in the game suggests to me that IBM have succeeded in using existing human expertise, probably gathered in previous AI tools, to seed a much larger automated knowledge mining process. If so, we are on the cusp of an enormous leap in our ability to reliably extract meaning and context from soft information and to use it in ways long envisaged by proponents of artificial intelligence.

What this means for traditional business intelligence is a moot point. Our focus and experience is directed mainly towards structured, or hard, data. By definition, such data has already been processed to remove or minimize ambiguity in context or content by creating and maintaining a separate metadata store, as I've described elsewhere.

However, there is no doubt that the major growth area for business intelligence over the coming years is soft information, which, according to IDC is growing at over 60% compound annual growth rate, about three times as fast as hard information, and which already accounts for over 95% of the information stored in enterprises. It is in this area, I believe, that Watson will make an enormous impact as the technology, already based on the open-source Apache UIMA (Unstructured Information Management Architecture), moves from research to full-fledged production. There already exists a significant pent-up demand to gain business advantage by mining and analyzing such information. Progress in releasing the value tied up in soft information has been slowed by a lack of appropriate technology. That is something that Watson and its successors will certainly change.

While I have focused so far on the knowledge/information aspects of Watson--that being probably the most relevant aspect for BI experts, there is one other key feature of the technology that should be emphasized. That is Watson's ability to parse and understand the sort of questions posed in everyday English with all their implicit assumptions and inherent context. Despite appearances to the contrary in the game show, Watson was not responding to the spoken questions from the quiz master; the computer had no audio input, so the exact same questions were passed to it as text as were heard by the human contestants. In fact, speech recognition technology has also advanced significantly to the stage where very high levels of accuracy can be achieved. (As an aside, I use this technology myself extensively and successfully for all my writing...) The opportunities that this affords in simplifying business users' communication with computers are immense.

It seems likely that over the next few years this combination of technologies will empower business users to ask the sort of questions that they've always dreamed of, and perhaps haven't even dreamed of yet. They will gain access, albeit indirectly, to a store of information far in excess of what any human mind can hope to amass a lifetime. And they will receive answers based directly on the sum total of all that information, seeded by the expertise of renowned authorities in their respective fields and analyzed by highly structured and logic-based methods.

Of course, there is the danger that if a given answer happens to be incorrect, it is difficult to see how the business user would discover that error or be able to figure out why it had been generated.

And that, as Sherlock Holmes never said is far from "Elementary, my dear Watson!"

Synchronicity is a wonderful thing! I get yet another follower notice from Twitter today, and for the first time in ages I am curious enough to check the profile. It turns out that @LaurelEarhart is marketing director for the Smart Content Conference, among other things, including Biz Dev Maven! And there, I read "Perfect storm: #Google acquired #Metaweb" announced on July 16. Having just done a webinar with Attivio yesterday on the topic "Beyond the Data Warehouse: A Unified Information Store for Data and Content" my interest was piqued. Let me tell you why.

I suspect that very few data warehouse vendors or developers have paid much attention to Metaweb or its acquisition. As far as I can tell, it hasn't turned up on the data warehouse or BI analyst blogs either. Perhaps the reason is that Metaweb's business is in providing a semantic data storage infrastructure for the web, and Freebase, an "open, shared database of the world's knowledge". For data warehouse geeks, the former is probably a bit off-message, while the latter may sound like Wikipedia, although the mention of a shared database may raise the interest level slightly.

But, if you're thinking about what lies beyond data warehousing (as I am), and wondering how on earth we're ever going to truly integrate relevant content with the data in our warehouses, what Metaweb and now Google are doing should be of some interest. Here's a quote from Jack Menzel, director of product management at Google on his blog:

"Type [barack obama birthday] in the search box and see the answer right at the top of the page. Or search for [events in San Jose] and see a list of specific events and dates. We can offer this kind of experience because we understand facts about real people and real events out in the world. But what about [colleges on the west coast with tuition under $30,000] or [actors over 40 who have won at least one oscar]? These are hard questions, and we've acquired Metaweb because we believe working together we'll be able to provide better answers."

For me, the interesting point here is the inclusion in the hard questions of conditions that would make sense to even the most inexperienced BI user. Take either of these two hard questions and you can easily imagine the SQL statements required, provided you defined and populated the right columns in your tables. The problem is that you need to have predefined columns and the tables in advance of somebody asking the questions.

What Metaweb on the Internet and Attivio on the intranet (and, of course, other vendors in both areas) are trying to do is to bridge the gap between data and content, so that users can ask mixed search and BI queries based on the implicit understanding that exists in the data/content stores of the semantics of the information. And, perhaps more importantly, to be able to do that in a fully ad hoc manner that doesn't require prior definition of a data model and its instantiation in columns and tables of a relational database. If you want to dig deeper, I invite you to take a look at my recent white paper.

In the meantime, my thanks to @LaurelEarhart and the wonder of synchronicity.