This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them.Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.

What's IBM's take on Hadoop as the new enterprise data warehouse and disruptor of data-integration and mainframe workloads? Bob Picciano, appointed in February as general manager of IBM's Information Management Software Division, says there's no doubt that Hadoop will displace certain workloads, but he's more dismissive about NoSQL and upstart databases including SAP Hana.

A veteran of multiple IBM software units, including a prior stint in Information Management, Picciano is tasked with revitalizing a business that has been "a little flat for the last year or two," according to Gartner Analyst Merv Adrian. "No doubt their feistiness is going to be more evident," Adrian says.

Feisty is a fitting description for Picciano, who in this in-depth interview with InformationWeek is at turns effusive and dismissive. He talks at length about IBM's vision for five big data use cases while ceding nothing to database competitors SAP, MongoDB and Cassandra. Read on for the big data perspective from inside IBM.

InformationWeek: There's a growing view that companies will use Hadoop as a reservoir for big data. Everyone agrees that conventional databases will still have a role, but some see big enterprise data warehouses being displaced. What's your view?

Bob Picciano: Sometimes people drastically overuse the term big data. We've done more than 3,000 engagements with customers around our big data portfolio, and almost all of them have fallen into one of five predominant use cases: developing a 360-degree view of the customer; understanding operational analytics; addressing threat, fraud and security; analyzing information that you didn't think was useable before; and offloading and augmenting data warehouses.

Some of these use cases are more Hadoop-oriented than others. If you think about exploring new data types including a high degree of unstructured data, for example, it doesn't make sense to transform that data into structured information and put it into a data warehouse. You'd use Hadoop for that. We have an offering called Data Explorer, which is based on our Vivisimo acquisition, that helps index and categorize unstructured information so you can navigate, visualize, understand and correlate it with other things.

Operational analytics is another use case involving Hadoop. There we just delivered a new offering with our Smart Cloud and Smarter Infrastructure that focuses on helping clients to pull in and analyze log information to spot events that could be used to help improve the resiliency of operational systems.

In the case of developing a 360-degree view of customers, maybe you have a system of master data [like CRM], so you have customer data files, but how do you also include information from public or social domains?... And how do you sew together interactions on Web pages? That's very much a Hadoop data workload.

IW: IBM has a Hadoop offering (with IBM BigInsights), but so, too, does Microsoft, Oracle, Pivotal, Teradata, Cloudera and others. How does IBM stand out in the big data world?

Picciano: One of the use cases that's unique to IBM is streaming analytics. In a big data world, sometimes the best thing to do is persist your question and have the data run through that question continuously rather than finding a better place to persist the data. Hadoop is, in many ways, just like a different kind of big database. That may be insufficient to differentiate company performance on a variety of different workloads.

Data is becoming a commodity, information is becoming a commodity and even insight is becoming a commodity. What's going to become a differentiator is how fast you can develop that insight. If you have to pour data into a big data lake on Hadoop and then interrogate that information, then you have to figure out, "is this the right day to ask that question?" With streaming analytics you can ask important questions continuously.

IW: Aspirations around the Internet of Things seem to be reinvigorating the complex event processing market. Is this the kind of analysis you're talking about?

Picciano: Yes. If you think about machine-to-machine data and areas like health care and life sciences, we've done some great work with amazing institutions like UCLA and the Hospital for Sick Children in Toronto by analyzing data in motion with IBM InfoSphere Streams. When you look at neonatal care, for example, a nurse typically comes by once an hour and writes down vital signs. That's one chart point, and they'll come back in another hour and so on. But there's so much volatility around blood oxygen levels, heart rates and respiratory rates. By streaming that information and analyzing on a constant basis, you can spot when infants are having what they call spells, which increase their susceptibility to life-threatening infections. In some instances they can also over-oxygenate babies, and when that happens they can go blind.

IW: You hear a lot of talk about real-time applications, but there seem to be far fewer real-world examples. Is real-time really in high demand?

Picciano: There are many other examples. In the telco space, providers are constantly trying to analyze call quality and spot potentially fraudulent activity. They typically do that based on call data records that they load into a warehouse on a daily basis. We're doing it in real time so there's a whole different degree of remediation for customer experience management. We can identify dropped calls and whether they were related to call quality. You can look at the profile of callers, particularly pre-paid callers, and see if they're trying to burn up their minutes. That means they're likely to churn to another carrier, but we've found that there are ways to intercede in those cases and prevent churn.

We welcome your comments on this topic on our social media channels, or [contact us directly] with questions about the site.

Sorry @rklopp894, I just realized that I didn't respond to your BTW comment. Mr. Picciano did not say that Netezza can't do under 50 TB at all, in fact, there are loads of Pure Data for Analytic systems (which many will know through the Netezza name) that are below 50TB. Hadoop indeed plays in that PetaByte space as well (and below for that matter) and there is a tight integration between Netezza & Hadoop (not to mention IBM has it's own non-forked distribution called BigInsights which you get a limited use license for free with Netezza). What's more, Netezza lets you execute in-database MapReduce programs which can really bridge the gap for the right applications and provide a unified programming method across the tiers (Netezza and Hadoop).

@Lori Vanourek, please see my response to rklopp894 regarding the inefficient column partition replacement LRU algorithm that Mr. Picciano was referring to. With respect to decompression, you actually call out the difference Mr. Picciano is stating. You say that decompression "is not done until it is already in the CPU cache" And THAT IS the issue, you have to decompress the data when loading into registers from cache so that you can evaluate the query. DB2 with BLU Acceleration doesn't decompress the data. In fact, the data stays compressed and encoded in the registers for predicate evaluation (including range predicates, not just equality) as well as join and aggregate processing. That's the clear advantage that Mr. Picciano is pointing out for DB2.

@rklopp, I think Mr. Picciano's understanding of memory usage is EXACTLY in line with the blog posting you point to. In fact, that blog posting clearly states, "in other words where there is not enough memory to fit all of the vectors in memory even after flushing everything else outGǪ the query fails." That's EXACTLY what Mr. Picciano points out when he talks about how a client might have issues at a Qtr-end close when they start to really stress the system. From what I can tell, and DO correct me (my wife always does, swiftly I may add) if I've read the paper you sent us to wrong, but SAP HANA resorts to an entire column partition as the smallest unit of memory replacement in its LRU algorithm. All other vendors that I know of (including columnar ones that I've looked at) work on a much better block/page level memory replacement algorithm. In today's Big Data world, I just find it unacceptable to require a client to have to fit all their active data into memory; I talk to enough of them that this just doesn't seem to be reality.

Here is a description of how HANA utilizes memory (http://wp.me/p1a7GL-lo ) to better inform Mr. Picciano. This information is available to IBM via the HANA Blue Book and other resources as they are one of SAP's best partners and very active in the HANA community.

BTW: The surprise to me was that Netezza is the preferred solution for petabyte-sized solutions... but not below 50TB. I do not believe that they have a large footprint in the space above a petabyte... and Hadoop plays somewhere in that petabyte place?

Thank you Doug for your post. For clarification, SAP HANA does not need to decompress data in order to determine whether or not it fits a query. SAP HANA can select and run operations on compressed data. When data needs to be decompressed, it is not done until it is already in the CPU cache. Also, if an SAP HANA system should run scarce on memory, columns (selected by LRU mechanisms) are unloaded from memory down to Data Volume (HANA organized disks), in a manner that leverages database know-how, thus preventing the usual brutal SWAP activities of the OS. Of course, SAP offers scale-out capabilities with the SAP HANA platform so that customers can grow their deployments to multiple nodes, supporting multi-terabyte data sets.

I was surprised by Picciano's dismissive take on MongoDB and Cassandra.Oracle seems to be taking NoSQL more seriously, but then, they had Berkeley DBIP to draw from when they developed the Oracle NoSQL database. I'd note thatMySQL has offered NoSQL data-access options for some time, but that hasn'tcurbed the rapid growth of NoSQL databases including Cassandra, Couchbase,MongoDB, Riak and others. DB2 may have NoSQL access, but cost, development speed and, frankly, developer interest in using it for Web and mobile apps just isn't the same as what we're seeing with new-ear options.

I was also surprised by the idea of running Hadoop on mainframe,but then, Cray recently put Hadoop on one of its supercomputers. That's notexactly cheap, commodity hardware.

This IT Trend Report highlights how several years of developments in technology and business strategies have led to a subsequent wave of changes in the role of an IT organization, how CIOs and other IT leaders approach management, in addition to the jobs of many IT professionals up and down the org chart.