Big Data: We Have the Technology, but Do We Have the People?

Organizations are awash in big data, opening up huge opportunities to understand and predict customer preferences and market growth.

Organizations are awash in big data, opening up huge opportunities to understand and predict customer preferences and market growth. In a hyper-competitive global economy, having the right information means competitive advantage.

There is a catch to all this, however. To get to information nirvana, companies need people with the right skills to get them there. People who know how to manage data, build analytics systems, and help make sense of the data.

A recent survey of data scientists by EMC bears this out. A total of 83% felt that new technology would increase the demand for data scientists, and 64% believe that it will outpace the supply of available talent. In fact, a McKinsey Global Institute study predicts that within the next six years, the United States alone could face a shortage of 140,000 to 190,000 people with deep analytical skills as well as 1.5 million managers and analysts with the know-how to use the analysis of big data to make effective decisions.

TechTarget’s Beth Stackpole also pointed out that today’s professional workforce is trained to manage traditional, structured data environments, but are not ready to handle big data environments and open-source platforms such as Hadoop and MapReduce. “While data management teams typically have a well-defined set of expertise around managing and organizing highly structured data and modeling and creating reports in SQL, those conventional skill sets don’t translate well to the unstructured, flat-file part of the big data world, where command lines and NoSQL database technologies are the core building blocks of most of the emerging platforms.”

Hadoop, an Apache open-source project, is a collection of open-source components designed to to store massive amounts of data across multiple nodes and compact it into an accessible format called the Hadoop Distributed File System (HDFS). MapReduce, often used in conjunction with Hadoop, is a programming construct for building an analytical capability on top of the data. NoSQL (”not only SQL”) databases typically handle non-structured data, including Weblogs, documents, text, PDF, video and audio.

At the same time, companies shouldn’t have to look too find the talent they need to manage big data challenges and opportunities. As part of a series of Webcasts, co-sponsored by Informatica and Cloudera, I had the opportunity to speak with executives and consultants at the front line of the big data explosion.

For example, Binh Tran, CTO and co-founder of Klout, pointed out that skillsets are the “number one” challenge the social networking rating service is wrestling with. “When we first started, it was a matter of digging into it, getting into the online documentation. Finding people with production experience on a large scale is basically difficult. We had to hire people out of the Yahoo and Facebook world.” Tran reports seeing more colleges, at least in the Silicon Vally region, offering Hadoop and MapReduce as part of their curricula.

David Menninger, analyst with Ventana Research, pointed to recent survey results in which more than three-fourths of 169 executives say staffing and training issues are the greatest obstacles to making the most of big data.

Skills are short, but the situation is not hopeless, Cloudera’s Omer Trajman points out. The ability to address big data solutions such as Hadoop “isn’t rocket science, people can learn it,” he states. Just a few years ago, there were “only two people who knew Hadoop” — now those numbers are expanding. “We encourage organizations look at skillsets they have internally and train people. There are a lot of folks who have the right background and can learn to use Hadoop. “It’s more than just finding individuals who already learned and hiring them… there are individuals within your organizations who can really grow into these roles… there’s a lot of folks who can learn Hadoop.”

Here are the positions that will play a role in big data:

System administrators: Responsible for the day-to-day operation of the cluster. “They may manage the hardware components either directly and indirectly, identify the need for additional hardware and bring it on-board.” Responsibilities also include monitoring and configuration, he adds. “They’re also responsible for integration of Hadoop with other systems.”

Developers: Build the platform and analytics apps. “They have familiarity of the tools or algorithms, they might be writing or packaging or optimizing or deploying different MapReduce jobs. “They’ll source and maintain different libaries,” Trajman adds. “Their role is similar to the DBAs role in the database world.”

Data analysts/data scientists: Data analysts and data scientists fall into the same category, Trajman says. These professionals apply algorithms to analytic problems, and do data mining. “Their ability to tell a story with the data is what defines them.” In addition, Trajman says, “they may have domain expertise. They’ll help create data products, create data solutions that drive the business.”

Data stewards: Ultimately responsible for the collection of quality data “Data stewards curate and catalog all the incoming data. There’s a lot of data floating around organizations, and Hadoop can get it centralized. So identifying the upstream data models, having a background in ETL [extract, transform, load] and data modeling are all typical skillsets and backgrounds.”