Related Information

Realize the value of your data using Hadoop: Evaluation, adoption and value of data and analytics

Author: Ashim Bose & Jan Jonak

As organizations strive to identify and realize the value locked inside their enterprise data, including new data sources, many now seek more agile and capable analytics system options. The Apache Hadoop ecosystem is a rapidly maturing technology framework that promises measurable value and savings and is enjoying significant uptake in the enterprise environment.

This ecosystem brings a modern data processing platform with storage redundancy and a rich set of capabilities for data integration and analytics, from query engines to advanced machine learning and artificial intelligence (AI).

However, very real challenges remain with this evolving data and analytics model for organizations to adopt it effectively and leverage it, along with other advanced analytics services from cloud providers, to deliver business value. Companies and public institutions may struggle to acquire the skills, tools and capabilities needed to successfully implement Hadoop analytics projects in both cloud and on-premises environments. Others are working to clarify a logical path to value for dataand analytics-oriented deployments.

In this paper, DXC Technology describes a proven strategy to extract value from Hadoop-oriented investments. The paper examines the opportunities and obstacles many organizations face and explores the processes, tools and best practice methods needed to achieve business objectives.

However, very real challenges remain with this evolving data and analytics model for organizations to adopt it effectively and leverage it, along with other advanced analytics services from cloud providers, to deliver business value. Companies and public institutions may struggle to acquire the skills, tools and capabilities needed to successfully implement Hadoop analytics projects in both cloud and on-premises environments. Others are working to clarify a logical path to value for dataand analytics-oriented deployments. In this paper, DXC Technology describes a proven strategy to extract value from Hadoop-oriented investments. The paper examines the opportunities and obstacles many organizations face and explores the processes, tools and best practice methods needed to achieve business objectives.

Hadoop realities

As organizations of all kinds embrace the use of data and analytics, most are now moving beyond traditional business intelligence (BI) to a more advanced and comprehensive analytics environment with much richer data sources. Forward-looking executives now view data and analytics involving machine learning and AI capabilities as vital tools needed to improve customer relationships, accelerate speed to market, survive in an increasingly dynamic marketplace and drive sustainable value.

The value of data

The very nature of the data revolution — driven by the four V’s: volume, variety, velocity and veracity — now poses unique challenges to many organizations. Informed observers expect the global volume of data to swell to 163 zettabytes by 2025, 10 times the amount today.1 Unstructured data now comprises 80 percent of enterprise data,2 with growing volumes of data flowing from increasingly ubiquitous sensors, mobile devices, video streams and social networks. This is in addition to the varieties of data already being stored within enterprises but not easily available for analytics. Given the speed and reach of this data revolution, it is perhaps not surprising that many organizations are less than prepared to meet these challenges.

Hadoop ecosystem emerges and matures

As businesses, public agencies and other organizations struggle with these challenges, many now view the incrementally maturing Hadoop ecosystem and related complementary technology offerings as a logical way to process and analyze big data as well as smaller datasets (see Figure 1). Driven by the needs associated with broadening adoption, the ecosystem is evolving with more enterprise integration and operational features.

The Apache Hadoop ecosystem is an open source software framework for distributed storage and processing of extensive datasets, using simple program modules across clusters of commodity-level computing hardware. Reliable and scalable, Hadoop is designed to run on anything from a single server to thousands of machines. Most ecosystem tools are inherently scalable as well, although early releases of tools may not start as mature enough for demanding production uses.

The primary Hadoop project incorporates common utilities, a distributed file system, frameworks for resource management and job scheduling, and technology for parallel processing of large data volumes. It also provides complementary ecosystem tools, such as the Spark family of components and database tools (e.g., Hive and HBase) for different types of analytics processing. These tools extend to advanced analytics with machine learning, including AI-oriented functional components.

Most enterprises have deployed, or are considering the deployment of, Hadoop environments. Business and IT leaders expect Hadoop to help them extract value from their data and to reduce their total cost of ownership (TCO) for BI and analytics. The Hadoop stack continues to evolve rapidly and now incorporates solution features that allow organizations to build and deploy IT and business solutions that can operate at production grade and across multiple corporate locations. It is not uncommon that some business uses involve more on-demand or periodic scale-out clusters for specific processing needs, deployed in the public cloud, in addition to more permanent multitenant core clusters that would be on premises, or in the private or public cloud. Also, as the analytics uses become increasingly business critical, more deployments are starting to incorporate disaster recovery (DR) cluster solutions with commensurate recovery time objectives.

Challenges remain

While Hadoop, with its rich ecosystem, has quickly gained traction as a viable open source technology in the data and analytics marketplace, as seen with the broader digital revolution, a number of significant challenges have emerged. A Hadoop implementation presents very complex planning, deployment and long-term management challenges. There is still a general shortage of Hadoop skills in the marketplace. Although the Hadoop technology stack continues to evolve, it is still maturing and thus poses a higher degree of difficulty and uncertainty.

This is especially true, and risky, around the evolving domains of data ingestion and data governance, where the ground is still shifting and independent vendors are attempting to provide a measure of distribution/release independence and stability. Also, distribution vendors are exploring different approaches to dynamic cluster management, and that can pose some challenges for companies that have, or intend to have, different types of clusters to manage as the company’s business grows.

The business-oriented drivers for many data and analytics projects are often unclear, or less than precise. Discovery projects often lack focus or a clear enough view of the business benefits expected against the use cases being evaluated. Estimating benefits and prioritizing use cases may require thorough preparations with many key stakeholders in the organization for these evaluations to be viewed as representative and robust enough for investments and for mobilizing the resources needed to execute. Not surprisingly, many organizations are struggling to identify a clear path to value for their current Hadoop and other data, analytics and BI investments.

A logical adoption cycle

A carefully planned, phased and cost/benefit-balanced adoption environment is crucial for the successful implementation and adoption of Hadoop or any other complex technology system. DXC Technology recognizes a proven three-step approach to implementing sophisticated data or analytics systems:

1. Discovery. In this initial exploratory phase, organizations consider potential project candidates, build and evaluate the business case for potential Hadoop and analytics initiatives and reject, when appropriate, poor project candidates. Many Hadoop projects fall into the discovery category. Some include new analytics use cases; others may be about optimizing workloads and off-loading legacy systems.

2. Development and integration. Once a project has proven its value, it must be built and integrated with existing applications and with the larger BI and analytics landscape. The enterprise integration needs to include connectivity, security and operational support, as well as application interfaces.

3.Implementation. In this final, critical phase, organizations may roll out multiple, industrial-strength Hadoop applications. Depending on the nature of the organization, those deployments may occur across a complex IT landscape, in multiple clusters and on a global scale. The rollout needs to enable adoption by targeted user communities; they may be addressed in incremental waves and may require tailored training.

Methods and tools

The good news is that a growing range of techniques and technologies is now available to support the Hadoop framework environment. The following outlines some of the major tools and resources organizations can use to ensure more successful data and analytics project outcomes.