Apache Announces Hadoop Upgrade, Elevates Spark Project

The open source Apache Software Foundation this week voted to release a Hadoop upgrade that allows for in-memory caching of data and working with data from different storage classes. It also elevated Spark, the Big Data analytics project, to top-level status.

The Big Data industry has been flooded with recent in-memory analytics product announcements from numerous vendors. Now, in-memory caching of Hadoop Distributed File System (HDFS) data in the new Hadoop 2.3.0 release will help developers boost performance of the baseline open source Hadoop distribution.

The problem addressed, according to release notes, is that "HDFS currently has no support for managing or exposing in-memory caches at datanodes. This makes it harder for higher level application frameworks like Hive, Pig and Impala to effectively use cluster memory, because they cannot explicitly cache important datasets or place their tasks for memory locality."

That issue has been fixed, explained Arun Murthy, founder of Hortonworks Inc., a major Hadoop distributor. "It is now possible to use memory available in the Hadoop cluster to centrally cache and administer data sets in-memory in the datanode’s address space," Murthy said. "Applications such as MapReduce, Hive, Pig [and so on] can now request for memory to be cached ... and then read it directly off the datanode’s address space for extremely efficient scans by avoiding disk all together."

Cloudera Inc., another major Hadoop distributor, said the in-memory caching was developed by two of its engineers. By letting developers target certain files and directories for caching, the feature "enables memory-speed reads in HDFS," Cloudera said. "Preliminary benchmarks show that optimized applications can achieve read throughput on the order of gigabytes per second."

The other major improvement, Heterogeneous Storage Hierarchy, means developers can work with different kinds of storage in HDFS. "We now can take advantage of different storage types on the same Hadoop clusters," Murthy said. "Hence, we can now make better cost/benefit tradeoffs with different storage media such as commodity disks, enterprise-grade disks, [solid-state drives], memory [and so on]."

Other improvements in the new Hadoop release include hundreds of bug fixes and new features such as "simplified distribution of MapReduce binaries via the YARN Distributed Cache," noted Cloudera.

In other news, Apache yesterday announced that Spark has been elevated from its previous incubator status to a top-level project. That means "the project's community and products have been well-governed under the ASF's meritocratic process and principles," Apache said.

Spark is a distributed computing framework that allows for advanced analytics in Hadoop. "Spark is well suited for machine learning, interactive queries, and stream processing, and can read from HDFS, HBase, Cassandra, as well as any Hadoop data source," Apache said.