RESEARCH & RESOURCES

Traditional Tools Still Big Part of Big Data

Imagine a big data project in which old-tech tools such as relational databases and enterprise applications are prominent players. It's more popular than you might think, according to a new survey.

By Stephen Swoyer

September 11, 2012

A new survey from open source software (OSS) business intelligence (BI) specialist JasperSoft Inc. sheds light on how and why adopters are using big data, and in so doing raises a bevy of important questions.

JasperSoft's Web survey of 631 respondents indicates that almost two-thirds (62 percent) have already "deployed" big data projects, while -- among those that haven't -- "a lack of understanding about Big Data" is cited as the chief impediment.

According to director of product marketing Mike Boyarski, JasperSoft was surprised by the number of respondents who say they're already working with big data. "That for us was interesting because we've been talking about it for a while, but ... we didn't expect to see such a large [number of] production projects," he comments.

Among respondents who plan to deploy, a sizeable percentage have already secured budgets.

This, too, is surprising -- and significant, says Boyarski.

"There's this suspicion around the ROI with these types of data projects ... [but] it's clear that the business sponsors are recognizing an opportunity and they're okay-ing it, whether it's time and effort or dollars and budget to go and implement [it]."

The survey unearthed a few ostensible surprises.

Take, for example, the high use of ETL, which almost three-fifths (59 percent) of respondents said was "very important" to their big data projects. JasperSoft, says Boyarski, was surprised that ETL -- or other traditional data integration (DI) tools -- figures so largely in big data project efforts. "We were surprised at how frequent [is] the use of ETL ... or the desired use of ETL within the context of big data [projects]," he said, speculating that respondents could be using ETL as a "sort of intermediary tool looking to put some structure into the data."

Industry veteran Marc Demarest, a principal with management consultancy Noumenal Inc., says he isn't surprised by this. Demarest says that in most cases, Hadoop is being used to pool large amounts of file-oriented data, while MapReduce and conventional DI tools -- such as ETL and ELT -- "are being used to extract data from Hadoop -- or to get data into Hive or something similar, which then becomes a 'source system' for ETL, ELT," or other traditional DI tools.

Another seemingly surprising finding was the prominence of conventional relational databases in many (so-called) "big data" projects. Given the use of ETL, however, this shouldn't come as a surprise. In fact, nearly the same proportion three-fifths (60 percent) of respondents are using vanilla relational databases as are using ETL in their big data projects efforts.

Respondents were able to select multiple repositories, and what is surprising is the comparatively low representation of Hadoop and NoSQL repositories in the survey data. Fewer than one in five (18 percent) respondents say they're using Hadoop -- the archetypal platform for big data -- and slightly more (19 percent) say they're using MongoDB, a NoSQL data store that's also touted for use with big data projects. Other well-known NoSQL solutions included Apache Cassandra (used by just 7 percent of respondents), CouchDB (3 percent), and DynamoDB (4 percent). Elsewhere, analytic database platforms such as those marketed by Teradata Inc., IBM Netezza, and ParAccel Inc. (among others) were used in 11 percent of big data projects.

This seems counter-intuitive. After all, platforms such as Hadoop, MongoDB, Cassandra, and others are marketed as solving the shortcomings of conventional relational platforms in a big data context.

What conclusions can we draw from the data? It's hard to say. As Boyarski concedes, the structuring of the survey question (viz., "What Big Data stores are you using for your project?) doesn't tell us much.

Nor does the lack of any (detailed) follow-up question, such as -- specifically -- which relational platforms were in use. Boyarski suggests that perhaps "a lot of these [big data analytic] projects are ... combining data from various places and trying to supplement what they have, so that's probably where the relational data is coming into play."

Demarest, on the other hand, says he doesn't find this surprising.

After all, several of his clients are primarily using relational databases with their big data projects, he indicates. "'Big data' technologies are just [being used as a] pre-ETL pooling technology for [relational databases]," Demarest explains. In this scheme, he continues, data that's "persisted in Hadoop ends up in my 'normal' [data warehousing or business intelligence] infrastructure, in relational form, for normal consumption through normal mechanisms by normal users."

Similarly surprising -- again, at first glance -- is the representation of traditional enterprise applications in the big data mix. Almost four-fifths (79 percent) of respondents say they're piping enterprise application data -- from e-commerce, financial, ERP, CRM, SCM, PLM, and other applications -- into their big data projects. This came as a surprise to JasperSoft.

"It's interesting that the number-one source was application data, number two was machine-generated, and ... number three [was] human-generated," Boyarski comments.

Although the idea of analyzing big data information in context with traditional enterprise data is commonly touted as the end-game of the (nascent) big data paradigm shift, most such efforts are believed to be in the early-adopter stage. Given the nebulousness of some survey questions -- and the lack of any questions asking how enterprise application data is being used in big data projects -- it's difficult to draw any conclusions, Madsen says.

"I find it odd that enterprise sources [such as OLTP applications] are a major source, because that's not what I've seen, but then it depends on the market," he comments. "Banks, retailers, insurance companies, etc. are doing analytics that were expensive [and/or] resource-constrained in the old environment, so transferring the data makes sense."

Once again, Noumenal's Demarest says he isn't surprised.

Outside of cutting-edge or unconventional sectors -- such as social media -- many adopters are approaching "new" (i.e., big) data the same way they approached ... "old" data, and with good reason, he argues: they've invested millions in developing "old" data skills -- and "old" data infrastructures -- for starters. From the perspective of many big data adopters, he argues, "there's nothing about the 'new' data that invalidates how I deal with the 'old' data." True, Demarest concedes, "there are some cases in which the 'new' data overwhelms the 'old' infrastructure," but -- because these are exception scenarios -- "why should I use new stuff?"

For this reason, and as JasperSoft's survey suggests, the venerable enterprise data warehouse (EDW) is going to be powerfully difficult to dislodge from its place of primacy.

"In all cases, the centerpiece of the all-encompassing architecture is still the relational 'EDW' and its dependent marts," he concludes, adding that "[w]hether this is sidelining the EDW or cementing it in place remains to be seen longer-term."