Hadoop Upgrades Address Enterprise-Scale Data Analytics

Recent updates from the open source Apache Software Foundation should help enterprises looking to capitalize on the promise of Big Data analytics. Hadoop, the main tool for processing huge amounts of data from disparate sources, has been updated to a new version and can now more easily work with those different storage classes. It can also process that data more quickly with a new in-memory caching capability.

Meanwhile, Apache graduated its Spark project from incubator status to a top-level project. The advanced analytics engine for enterprise-scale data processing is used by many large organizations such as IBM, Intel and Yahoo.

The new in-memory caching follows a general trend of database vendors moving to in-memory analytics in their own proprietary products and in their own Hadoop distributions based on the core open source software. Now, with the Hadoop 2.3.0 release, developers can more quickly process data stored on the Hadoop Distributed File System (HDFS) in that baseline distribution.

The problem in the distribution, according to release notes, is that "HDFS currently has no support for managing or exposing in-memory caches at datanodes. This makes it harder for higher level application frameworks like Hive, Pig and Impala to effectively use cluster memory, because they cannot explicitly cache important datasets or place their tasks for memory locality."

That issue has been fixed, explained Arun Murthy, founder of Hortonworks Inc., a major Hadoop distributor. "It is now possible to use memory available in the Hadoop cluster to centrally cache and administer data sets in-memory in the datanode’s address space," Murthy said. "Applications such as MapReduce, Hive, Pig [and so on] can now request for memory to be cached ... and then read it directly off the datanode’s address space for extremely efficient scans by avoiding disk all together."

Cloudera Inc., another major Hadoop distributor, said the in-memory caching was developed by two of its engineers. By letting developers target certain files and directories for caching, the feature "enables memory-speed reads in HDFS," Cloudera said. "Preliminary benchmarks show that optimized applications can achieve read throughput on the order of gigabytes per second."

The other major improvement, Heterogeneous Storage Hierarchy, means developers can work with different kinds of storage in HDFS. "We now can take advantage of different storage types on the same Hadoop clusters," Murthy said. "Hence, we can now make better cost/benefit tradeoffs with different storage media such as commodity disks, enterprise-grade disks, [solid-state drives], memory [and so on]."

Other improvements in the new Hadoop release include hundreds of bug fixes and new features such as "simplified distribution of MapReduce binaries via the YARN Distributed Cache," noted Cloudera.

In other news, Apache yesterday announced that Spark has been elevated from its previous incubator status to a top-level project. That means "the project's community and products have been well-governed under the ASF's meritocratic process and principles," Apache said.

Spark is a distributed computing framework that allows for advanced analytics in Hadoop. "Spark is well suited for machine learning, interactive queries, and stream processing, and can read from HDFS, HBase, Cassandra, as well as any Hadoop data source," Apache said.