Q&A: John Schroeder of MapR on Hadoop in the Enterprise

Apache Hadoop has become one of the most important technologies in the Big Data ecosystem. We recently spoke with John Schroeder, the founder, president and CEO of MapR, to learn more about the latest on MapR's mission to make it easier for companies to develop Hadoop applications.

Apache Hadoop has become one of the most important technologies for managing large volumes of data, and has given rise to a growing ecosystem of tools and applications that can store and analyze "Big Data" datasets on commodity hardware. MapR is one of the companies building solutions atop Hadoop, offering a distribution for enterprise users that seeks to reduce the complexity of setting up and managing Hadoop. We recently conducted an email Q&A with John Schroeder, the president and CEO of MapR, to learn more about the latest on Hadoop, and MapR's mission to make it easier for companies to develop Hadoop applications.

Data Center Knowledge:For those who may not be familiar with it, what is Hadoop?

John Schroeder: Apache Hadoop is a software framework that supports data-intensive distributed applications. Hadoop was inspired by a published Google MapReduce whitepaper. Apache Hadoop provides a new platform to analyze and process Big Data. With data growth exploding and new unstructured sources of data expanding a new approach is required to handle the volume, variety and velocity of this growing data. Hadoop clustering exploits commodity servers and increasingly less expensive compute, network and storage.

DCK:How did it get developed and for what purposes?

Schroeder: After reading Google’s paper in 2003, a Yahoo engineer developed a Java-based implementation of MapReduce, and named it after his son’s stuffed elephant, Hadoop. Essentially, Hadoop provides a way to capture, organize, store, search, share and analyze disparate data sources (structured, semi-structured and unstructured) across a large cluster of commodity computers, and is designed to scale up from dozens to thousands of servers, each offering local computation and storage. The raw technology initially developed for use by Web 2.0 companies like Yahoo and Facebook has now been made ready for broad adoption due to developments by MapR and a large ecosystem of vendors.

DCK:What is MapR's involvement in the Open Source community around Hadoop?

Schroeder: MapR is a corporate committer on the Apache Hadoop project. Our involvement is very similar to some of our competitors and consistent with industry-standard practices popularized by open source leaders including RedHat. MapR combines twelve open source Apache projects with some of our own intellectual property to produce a full distribution for Hadoop. Our innovations transform Hadoop into a reliable compute and dependable data store, with fast performance. We've also made it easier to build Hadoop applications by opening up a much broader set of open API's to use against the data that's stored in Hadoop.

We’ve also made a commitment to the open source community by defining a new Apache project, Apache Drill, to provide low-latency, interactive data analysis on large data sets. Google developed their internal tool, Dremel, to provide low-latency, interactive data analysis. Now through Drill, we’re working with an initial group of committers to develop this capability as an Apache open source project in order to establish ubiquitous APIs and establish a flexible and robust architecture that will support a broad range of data sources, data formats and query languages.

DCK:What is its current level of development and maturity of the Open Source version of Hadoop?

Schroeder: Hadoop is a breakthrough technology that provides a great deal of value to organizations, but the breadth of use cases that are supported is limited. Experienced users consider Hadoop a batch-oriented scratch pad for large scale analytics. In other words the production deployments of Hadoop are limited. There is work going on to attempt to address some of these issues, but a long-term re-architecture of the underlying platform is required to support enterprise grade high availability, and data protection. For example, the Hadoop Distributed File System is append only. It does not support concurrent read/write or provide support for snapshots or mirroring.

DCK:What is required to make Hadoop work for the enterprise?

Schroeder: Hadoop must be easy to integrate into the enterprise, as well as more enterprise-grade in its operation, performance, scalability and reliability. We’ve made specific innovations to provide enterprise-grade support. Hadoop platforms should provide:

Data Protection – The same level of enterprise data protection required for other applications, must be provided for Hadoop. Businesses must be protected against data lost and ensured they can recover from application and user error. With support for volumes, snapshots and mirroring for all data within the Hadoop cluster, data protection and reliability is greatly improved to satisfy recovery time objectives and business continuity across multiple data centers as well as integration between on-premise and private clouds.

High Availability (HA) – High availability with automated stateful failover and self-healing is essential in every enterprise environment. As businesses roll out Hadoop, the same set of standards should apply as for other enterprise applications. With automated stateful failover and full HA, enterprises can eliminate single point of failure from unplanned and planned downtime to protect users against the unexpected.

Ability to integrate into existing environments – Connections, support for standards. The limitations of the Hadoop Distributed File System require whole scale changes to existing applications and extensive development of new ones. Enterprise-grade Hadoop requires full random/read support and direct access with NFS to simplify development and ensure business users can access the information they need directly.

DCK:What does enterprise-grade level Hadoop mean for end users?

Schroeder: Enterprise-grade is shorthand for describing a Hadoop platform that is constructed to support the broadest range of use cases. Don’t be confused that only enterprise companies need the enterprise qualities of reliability and dependability. Even Web 2.0’s rely on their Hadoop applications. If you look at the initial architecture, Hadoop provided a platform for batch, map/reduce processing. Now, through the efforts of the community and through the efforts of MapR, mission critical use cases can be supported on MapR with full data availability and protection. One of the reasons Hadoop is so rapidly gaining in popularity is we've transformed it from being batch map/reduce only to a move to real time. MapR has also expanded the supported programming and data access interfaces. MapR added a POSIX compliant storage system, along with the ability to access Hadoop using file-based interfaces. We also added JBDC and OBDC drivers, so you can access data in Hadoop from all the standard database tools.

DCK:What are the current use cases for Hadoop in the enterprise and how are they expanding?

Schroeder: Hadoop initially provided the optimal platform for predictive analytics. The advent of Apache Drill provides for interactive real time query processing. HBase is improving and gaining share in the NoSQL market for lightweight OLTP and table-based applications.

Hadoop has very wide applicability in nearly every industry and application domain. For example, financial services companies are using Hadoop for fraud detection and analysis, providing a scalable method to more easily detect many types of fraud or loss prevention and to mitigate risk of financial positions.

In the competitive media and entertainment landscape, the effectiveness of information plays a critical role. Companies rely on Hadoop for targeting marketing offerings and to mine customer insights and ad platforms that can store and analyze large data sets, consisting of billions of objects. Manufacturing firms are performing failure analysis on equipment and optimizing throughout the supply chain. Healthcare companies are able to search and analyze disparate data sources such as patient populations, treatment protocols, and clinical outcomes to accelerate discovery and insight.