Interview: MapR’s Steve Jenkins on Hadoop and how it is making small work of big data

Its name is derived from a cute toy elephant but Hadoop is all but a soft toy. Indeed, it has certainly been capturing a lot of press lately because of the increasing number of high profile technology companies that are using it. Indeed, the list of Hadoop users reads like a list of Who’s who of Silicon Valley. We caught up with Steve Jenkins, VP EMEA at MapR, a California-based enterprise software company that develops and sells Apache Hadoop-based software, to shed some light on this phenomenon.

In a nutshell, what is Hadoop?

Hadoop is an open-source software framework that allows an application to be divided into many small fragments of work, each of which can be executed on any node within a cluster of computing systems. Hadoop also provides a distributed file system that stores data on the compute nodes to improve overall performance.

Why is the technology important to big data processing and analysis?

The Hadoop platform is designed to solve problems where you have a lot of data that might be mixed between complex, unstructured and structured data that does not fit well into the tables of a traditional database. The system is suited to running sophisticated analysis models that are limited by traditional databases that typically run on a single large single computer.

Can you give an example of how it improves the cost/time dynamic over traditional techniques?

One of the most striking examples of how Hadoop can help solve challenging problems is the example of new recent TeraSort benchmarks. In this test, a terabyte of data is sorted using Hadoop and a compute cloud. Using the MapR Hadoop platform and the Google cloud engine, this complex big data process took 54 seconds by spreading the tasks across roughly 1,000 processing cores. To buy and build an equivalent non-Hadoop platform to perform this task would cost several million dollars and probably take longer. In this example, the cost of compute resources rented from Google costs roughly $9 to perform this task!

Which organisations are early adopters of the technology?

Hadoop is used by many organisations with access to large volumes of data that through analysis can reveal insightful information. This could include calculating insurance premiums by cross referencing accident data against claimant profiles or even examining data from a jet engine to help determine how airframe modifications may improve fuel efficiency. The scope and scale of the type of big data problems that Hadoop can assist with is extremely broad.

Can you give us a real world example of a company benefiting from Hadoop?

One great example is a major credit card company that analyses transaction data flooding in from lots of different sources to spot potentially fraudulent activity. A major telecom provider is performing data warehouse offloading to Hadoop to take advantage of a platform that is 50 times cheaper.

Who is behind the technology and what is the relationship with Google?

The original idea that became Hadoop was the brain child of engineers at Google that needed a way of sorting and searching huge amounts of data for its search engine technology. Google published a white paper on the concept that inspired the creation of Hadoop, named after the stuffed toy elephant owned by one of the creators’ children. Today, Google continues to use their MapReduce framework internally alongside other Big Data technologies. From a Hadoop standpoint, Google has a partnership with MapR to provide customers Hadoop functionality in conjunction with the Google Cloud platform.

Why are there so many versions of the technology and how do they differ?

The Hadoop platform is supported by the Apache Software Foundation and consists of the Hadoop kernel, MapReduce, the Hadoop distributed file system (HDFS) and a number of related projects such as Apache Hive, HBase and Zookeeper. However, a number of vendors have created their own versions that offer a different set of features that solve challenges or limitations around performance, storage management, integration with other enterprise applications, or to meet certain specialist use cases. In a similar fashion to the evolution of enterprise versions of Linux like RedHat and SuSe, Hadoop also has versions that offer additional support and enterprise grade features above those offered through open source.

So how does Hadoop impact against traditional relational database technology like Oracle and IBM?

Much of the world’s structured data is stored in traditional relational databases but increasingly a lot of data for example natural text, images or website searches is not suited to a traditional database structure. Instead, Hadoop allows data to be processed without requiring a predefined structure with the ability to scale across multiple machines yet still maintain a coherent process. So instead of relational databases, Hadoop works well with non-relational, distributed database such as HBase which is modeled on Google's BigTable technology.

What additional benefits will future versions of Hadoop potentially offer?

The Hadoop ecosystem is a fast changing marketplace with innovations happening at all levels. One exciting area is the commercial innovations that are expanding the use cases that are possible with Hadoop. These include support for mission critical applications, the combination of file based analytics and NoSQL, integrated search, and support for streaming, real-time analytics.

Is there anything else you would like to add?

Hadoop represents the biggest paradigm shift to impact enterprise computing that we’ve seen in decades. However, it need not be completely disruptive. Innovations that merge 100 per cent Posix compliance with Hadoop enable organizations to access Hadoop like enterprise storage as well as leverage advanced MapReduce functionality. This greatly speeds up adoption and drives ease of development and administration. Similarly, innovations that provide business continuity and availability transform Hadoop into a platform suitable for mission critical workloads and enterprise SLAs.