IBM Showcases Software Vision and Hadoop Research

At IBM’s 8th annual Connect meeting with analysts, Steve Mills, Senior VP and Group Executive, had much to crow about. Software is the engine driving IBM’s profitability, anchoring its customer relationships, and enabling the vaulting ambition to drive the company’s Smarter Planet theme into the boardroom. Mills’ assets are formidable: 36 labs worldwide have more than 100 SW developers each, plus 49 more with over 20 – 25,000 developers in all. Mills showcased all this in a matter-of-fact, businesslike fashion with minimal hype and little competitor bashing. A research project aimed at extending Hadoop usage to a broader audience was among the highlights.

Mills gave us a look at his organizing principle:

We have been working on extending the notion of what middleware is. It’s about connecting an organization’s applications, the codification of business process and function.”

Companies from mid-market to large enterprises run thousands of applications; understanding customers’ business scenarios, addressing identified gaps and promoting recommended patterns for success – adoption routes, solution stacks – is the driver. “It’s very easy to make a mess if you’re not guided,” Mills points out. He’s an effective, dedicated proponent of IBM’s Smarter Planet theme, and returned to it at this event, pointing out how IBM-supported projects that instrument and enhance the world’s often aging physical systems pay for themselves in efficiency savings even before the larger goals they enable are considered. He also held forth on other favorite topics: Industry Models, Cloud Computing (“You have to talk about Cloud”), and more, but told us he’d promised not to use all his team’s best slides before they could. “Not that I can’t talk about all of it,” he joked – and we’ve seen him do it. But no 3 hour keynotes here, mercifully, unlike some other vendors’ recent events.

Bringing Hadoop to Business Users

In his presentation, Rod Smith, VP, Emerging Internet Technologies, made it clear that the company is not ignoring the MapReduce/Hadoop phenomenon. He referred graciously to Cloudera’s work and picked up their phrase: “big data.” With the world creating nearly 15 PB of new data per day, a new class of content-centric WebApps is on the horizon, typically “longer running apps” – customers Smith talks with don’t like the word “batch,” he noted. But his focus was different from other vendors I’ve been hearing, where there is an assumption that the “big data” opportunity is limited to the sophisticated programmers who have so far led the way. Instead, “Put the business person in the center of the data,” Smith suggested. “They want their own Google” – here meaning not a search engine, but a data interaction tool capable of visualization and other forms of manipulation.

It’s clear that the need for such solutions will be there, and someone will fill it. When a firm like Extrabux can process 40Gb/day, loading and indexing 70 million constantly changing input records for MapReduce by processing on Amazon’s EC2 cloud for less than $5000 per year – with no DBA – others will follow. (See the September issue of Charles Brett’s Insight-Spectra for details of this case study.) Like other explorers in this new mode, Smith offered his own great examples, including a Visa risk modeling app using Hadoop with the R statistical libraries that reduced an analysis literally from 1 month to 13 minutes. “This is not incrementally better; it changes everything,” he said.

Smith’s Big Sheets project showed off analysis performed on over 2 million patent documents – a “one person project, like all my things.” He referred to the iTunes interface and showed a similarly clean, intuitive model. And he pointed out that “the data operated on does not always get reduced; here it exploded, because one analysis was of how patents made references to other patents.” Similar things happen when analyzing social graphs; it’s why focusing on MapReduce alone to describe these cases doesn’t always paint the full picture. It’s just one step in more complex processes that can be distributed around large systems which scale on demand as needs dictate. Similar thinking about user empowerment, without the elastic scaling (yet), is behind Microsoft’s PowerPivot, which treats Excel as the UI, and adds operators to the Excel language which mimic the kinds of things MDX programmers can do with OLAP cubes, among other things.

IBM is looking past today’s MR cases, which are often reminiscent of early computing days, when specialists spent days to set up machines for a single program run. The problem then was scale too, and learning how to use machine resources efficiently was job one. Today, the economics have flipped – we understand that the people resources are more valuable and we have to empower them. IBM is looking beyond complex setup, java coding and single run models for “big data” processing and towards interactive big data analysis – at Web scale. In Smith’s view, that’s the key to going into an “evidence-based business world.” IBM is focused on hiding the complex details of system parallelization, fault tolerance, load balancing, etc. from the user by hiding everything behind the UI. Tech details weren’t at the top of Smith’s agenda for this presentation, but REST interfaces, the use of Jaql, extensibility via UDFs, integration of Pig, and exporting results into feeds and XML were briefly highlighted. As IBM continues to push at this area, we can expect to see some breakthrough innovations emerge, in larger, end-to-end scenarios.

Follow me at Gartner

I am a Gartner analyst, covering information management with a strong focus these days on big data and NoSQL-related issues. I'll continue to post here, subject to the guidelines there, as well as in my Gartner blog. Posts here will link there.