Big Data Analytics: Unleashing the Power of Hadoop

Data analytics, long the obscure pursuit of analysts and quants toiling in the depths of enterprises, has emerged as the must-have strategy of organizations across the globe. Competitive edge not only comes from deciphering the whims of customers and markets but also being able to predict shifts before they happen. Fueling the move of data analytics out of back offices and into the forefront of corporate strategy sessions is big data, now made enterprise-ready through technology platforms such as Hadoop and MapReduce. The Hadoop framework is seen as the most efficient file system and solution set to store and package big datasets for consumption by the enterprise, and MapReduce is the construct used to perform analysis over Hadoop files.

Hadoop was first conceived as a web search engine for Yahoo!, whose developers were inspired by Google’s now-well known MapReduce paper. It has become the cornerstone of a thriving big data marketplace. Estimates from Wikibon, the open source IT research community, put the worldwide big data market currently at approximately $18 billion, destined to reach roughly $50 billion by 2017.

For years, and into the present day, enterprises have been applying data analytics against structured, relational datasets derived from transactional systems, using a wide range of tools from a variety of vendors, from data warehousing platforms to front-end desktop-based analysis software. Now, with the universe of unstructured data rapidly expanding, a new frontier is opening up for analysis, enabling potentially far-reaching insights. Hadoop handles data that traditional relational databases, data warehouses, and other analytics platforms have been unable to effectively manage—including user-generated data through social media and machine generated data from sensors, appliances, and applications. Hadoop accomplishes this by applying more efficient formats and file systems to large datasets that would normally have been out of the reach of standard analytics solutions.

Currently, the most prevalent application seen among Hadoop sites is log and event data analysis, particularly against the machine-generated data coming from web activity and devices. This may include the gathering and analysis of network traffic applications, capacity requirements, security events, and web interactions. As adoption grows, Hadoop-based data may increasingly play a role in more strategic business information, such as sales analysis and workforce allocation.

WHY HADOOP?

Hadoop offers a range of advantages to data analytics efforts. First and foremost, it enables the processing and analysis of all forms of data—regardless of whether it is highly structured or unstructured. Hadoop is also more cost-effective than traditional analytics platforms such as data warehouses. With data warehouses, for example, investment needs to be made in the platform itself, along with investment in extract, transform, and loading (ETL) of data, and data cleansing, and modeling technologies. As a result, data has to be deemed important enough for the data warehouse investment, limiting its use and any ability to experiment with or pilot new forms of analysis. In Hadoop environments—which also can accommodate data warehouse data—big data stores can be brought in and processed cost effectively.

While enterprise adoption of Hadoop is expanding, it also brings new types of challenges

At Hadoop’s core is the principle of moving analytics closer to where the data resides. The framework is based on clusters that distribute the computing jobs required for big data analysis across various nodes. Hadoop is also cloud-friendly. While many enterprises choose to implement the framework within their data centers, Hadoop clusters can also be run from the cloud, either via cloud vendors or through hosting services.

There is also a robust ecosystem of tools and technologies that has developed around Hadoop. Not only is the framework supported by a range of commercial software vendors, but there are also a number of open-source tools available as well, which enable enterprises to derive value out of big data. Many advanced analytic tools on the market also now support Hadoop, enabling visualization, data mining, predictive analytics, and text analytics against big datasets. There is also greater accuracy and flexibility possible in big data via Hadoop. First, analysis can be run against entire datasets, versus smaller samples, as has been the case in the past. In addition, the Hadoop Distributed File System packages datasets into files that can be easily absorbed by existing applications, without the need to upgrade them to a massively parallel version that can absorb big datasets.

CHALLENGES

While enterprise adoption of Hadoop is expanding, it brings new types of challenges, from manual coding demands, skills requirements, and a lack of native real-time capabilities. For example, the Hadoop Distributed File System does not offer the native resiliency or the real-time capabilities that enterprises have come to expect with enterprise-grade software packages. Hadoop is natively batch oriented, and thus real-time analysis may not be available without additional tools. Plus, if the Hadoop system goes down, it may take some time to recover and restore the framework. In addition, the technology—first developed and released in 2006—is still relatively new on the scene, and implementations still relatively immature.