RESEARCH & RESOURCES

Wanted: IT Pros with Hadoop Management Skills

Never mind Java programmers or data scientists, there's an acute need for IT technologists with Hadoop management skills.

By Stephen Swoyer

April 16, 2013

Nearly a year ago, David Inbar, senior director of big data products with data integration (DI) specialist Pervasive Software Inc., ventured an appreciation of Hadoop on both technological and aesthetic grounds. Hadoop, Inbar said, is "a beautiful platform for all kinds of computation."

A lot has changed since then, especially in the world of Hadoop. Last June, for example, Inbar couldn't point to an example of an ACID-compliant RDBMS for Hadoop; as of this February's Strata 2013 conference, he could point to Pervasive's pending acquisition by analytic database specialist Actian Inc. (Pivotal HD, which includes an ACID-compliant implementation of the Greenplum RDBMS for Hadoop, is marketed by EMC Corp., an Actian competitor.)

"I don't think our view on Hadoop has changed. [It is] still a beautiful platform for what it accomplishes," he explains. "If anything, I'd say from what we're seeing out there ourselves and [what we're hearing from] other users, I think more of the potential benefits of this kind of computing paradigm are being realized, and at the same time, the Hadoop ecosystem -- the combination of open source and commercial vendors -- is moving along at a pretty good clip."

At the same time, Inbar acknowledges, there's a great amount of work still to do. "I think it's fair to say that there are still immaturities and there are still challenges for companies that ultimately want to harness and extract value from big data and from rapidly moving data," he concedes. "One of the biggest of these is a desperate shortage of skills, knowledge, and experience both in terms of acquiring and provisioning and managing Hadoop-based clusters for distributed environments. [Another is] a desperate shortage of people with the skills to actually make use of them once they're up and running."

Much has been made of the shortage of skilled Java, HQL, and Pig Latin programmers, to say nothing of the acute shortage of data scientists.

Because this latter group, in particular, possesses a highly specialized skill set that cuts across multiple domains -- viz., business, technology, and mathematics -- it's destined to be frustratingly rare, Inbar suggests.

They can't be mass-produced; they can't easily be "trained up;" they'll likely always remain highly-prized. "Nowadays, everybody's talking about data scientists. The archetype of the data scientist is someone who is both a programmer and a data analyst and probably to some extent a business domain expert. Maybe not an 'expert,' per se, but maybe at least highly knowledgeable in a particular business domain."

"Surprise, surprise: there's a shortage. Arguably, we're chasing after the wrong and unrealistic combination there," Inbar continues. "The traditional way of handling [this problem] in a more traditional data management ecosystem is that while we do end up with more specialized roles, we [support these roles] by mak[ing] more powerful tools available to the individuals [who fill them]. This is part of the evolution [of the technology], but it will take time."

As a case in point, he cites the claim -- popular among data integration (DI) vendors such as Pervasive -- that the most time-consuming aspects of a data scientist's work involve accessing and preparing data.

The task for Pervasive and other DI vendors is to simplify both the scheduling and the processing that's involved in preparing data for analysis. "You really do need to capture and mung the data before you can do your analysis," he says, noting that this is a relatively old problem in DI.

"Today, we provide a management layer ... for choreographing data integration and data preparation and analytics and being able to drive that on a scheduled basis across different architectures and across internal data centers and even across the cloud."

When it comes to big data, Pervasive and other vendors are on less sure -- or less automated -- ground. Frustratingly, there's a dearth of technologists with expertise in Hadoop management and configuration. This has received a lot less attention -- although it's no less critical. The parameters in traditional data management (DM), like those in most areas of IT, are typically tightly controlled; those in the big data realm are comparatively relaxed. This is as much a function of the complexity of the big data model as of its technological immaturity, Inbar suggests. It makes the problem of orchestrating and managing interactions (i.e., jobs) between and among platforms even more challenging.

"Being able to do this [choreographing] and integrating between [on-premises and SaaS] platforms is something that's old news from a data integration perspective. As the big data solutions start to mature and be adopted, they're going to have to live in the same kind of framework. It's a different problem [with big data], however. How do you make all of this happen together without adding too many more dimensions of complexity and maybe even more vulnerabilities -- more exposure -- of one kind or another?" he asks.

The answer, again, is time: the pool of potential Hadoop administrators is much larger than is that of potential data scientists. IT technologists can be trained-up on Hadoop and big data analytics. Over the same period, vendors will focus on delivering more management amenities, both for Hadoop and for the platforms, services, or technologies that are (or will be) part of its ecosystem.

"The sysadmins who are handling Hadoop clusters need better systems management tools. This is a gap that's being sort of addressed right now, and [which] will be [addressed] over time with compliance and security and controls," he says. "You're going to have the same issues [that you have today], although maybe not to the same extent: you need data governance -- data quality and reliability and trackability and auditability -- because ultimately your analytics need reliable data. If your underlying data isn't reliable, then your brilliant analytics, no matter how lightening fast they may become, aren't going to be helpful, in fact, they're going to be downright dangerous."