From the Dev Team

Not a day passes without someone tweeting or re-tweeting a blog on the virtues of Apache Spark.

At a Memorial Day BBQ, an old friend proclaimed: “Spark is the new rub, just as Java was two decades ago. It’s a developers’ delight.”

Spark as a distributed data processing and computing platform offers much of what developers’ desire and delight—and much more. To the ETL application developer Spark offers expressive APIs for transforming data; to the data scientists it offers machine libraries, MLlib component; and to data analysts it offers SQL capabilities for inquiry.…

The Apache Accumulo community has announced its 1.7.0 release. As community’s first major release of 2015, the release represents the culmination of a year of effort from many Accumulo committers and contributors. Apart from many notable changes enumerated below, Accumulo is now well integrated with Apache Ambari.

In this release, 43 different individuals fixed 691 JIRA issues, and we thank everyone who helped in any way to make this Apache Accumulo 1.7.0 a reality.…

SQL is the most popular use case for the Hadoop user community, and Apache Hive is still the defacto standard. Early this week, the Apache Hive community released Apache Hive 1.2.0.

Already the third release this year, the Hive developer community continues to improve the release and grow its team, with 11 Hive contributors promoted to committers in the last three months. Dedicated to make Hive enterprise-ready, the community has made improvements in the following areas:

Additional SQL functionality

Security enhancements

Performance gains

Stability and usability

For the complete list of features, improvements, and bug fixes, see the release notes.…

This is the third post in a series that explores the theme of enabling diverse workloads in YARN. Our introductory post to understand the context around all the new features for diverse workloads as part of YARN in HDP 2.2, and a related post on CPU scheduling.

Introduction

One of the core responsibilities of YARN is monitoring and limiting resource usage of application containers. When it comes to resource management there are two parts:

Resource allocation: Application containers should be allocated on nodes that have the required resources and

Enforcement and isolation of Resource usage: Containers should only be allowed to use the resources they get allocated on a NodeManager (NM).

Kristen Hardwick, Vice President of Big Data Solutions at Spry, Inc is our guest blogger. In this blog, Kristen shares performance analysis during Spryinc’s evaluation of Apache Hive with Tez as a fast query engine.

In early 2014, Spry developed a solution that heavily utilized Hive for data transformations. When the project was complete, three distinct data sources were integrated through a series of HiveQL queries using Hive 0.11 on HDP 2.0.…

With YARN and HDFS at the architectural center, Hadoop has emerged as a key component of any modern data architecture. Today, enterprises utilize Hadoop to store critical datasets and power many of their critical workloads. With this in mind, the services and data within a Hadoop cluster needed to be highly available in face of failures and continue to function while the upgrading to the latest software version.

With the Hortonworks Data Platform (HDP) 2.2, we have enhanced the core platform packaging to put in place support for rolling upgrades of the HDP stack while the cluster is actively servicing users.…

This is the fourth post in a series that explores the theme of enabling diverse workloads in YARN. See the introductory post to understand the context around all the new features for diverse workloads as part of YARN in HDP 2.2.

Introduction

When it comes to managing resources in YARN, there are two aspects that we, the YARN platform developers, are primarily concerned with:

Resource allocation: Application containers should be allocated on the best possible nodes that have the required resources and

Enforcement and isolation of Resource usage: On any node, don’t let containers exceed their promised/reserved resource-allocation

From its beginning in Hadoop 1, all the way to Hadoop 2 today, the compute platform has always supported memory based allocation and isolation.…

Historically, the strength of a platform lies in the abilities of developers to learn, try, and build against the platform APIs and capabilities. As Apache Hadoop matures as a platform, it’s the creativity and efforts of the developer community that is driving the innovation that makes Hadoop a vibrant and impactful foundation of a modern data architecture.

A successful developer community leads to a successful platform, and at Hortonworks we are committed to reducing the friction to speed up the success of our customers.…

With Apache Hadoop YARN as its architectural center, Apache Hadoop continues to attract new engines to run within the data platform, as organizations want to efficiently store their data in a single repository and interact with it in different ways. As YARN propels Hadoop’s emergence as a business-critical data platform, the enterprise requires more stringent data security capabilities. The Apache Knox Gateway (“Knox”) provides HTTP based access to resources of the Hadoop ecosystem so that enterprises can confidently extend Hadoop access to more users, while maintaining compliance with enterprise security policies.…

Two weeks ago, Apache ORC became an Apache top-level project within the Apache Software Foundation (ASF). This step represents a major step forward for the project, and it is representative of its momentum been built by a broad community of developers.

What is ORC and why is it useful?

Back in January 2013, we created ORC files as part of the Stinger initiative to massively speed up Apache Hive and improve the storage efficiency of data stored in Apache Hadoop.…

This is the 3rd post in a series that explores the theme of supporting rolling-upgrades & downgrades of a Hadoop YARN cluster. See the introductory post here.

Background and Motivation

Before HDP 2.2, Hadoop MapReduce applications depended on MapReduce jars being deployed on all the nodes in a cluster. The java classpath of all the tasks and the ApplicationMaster of a MapReduce job were set to point to the deployed jars.…

This is the third post in a series that explores the theme of supporting rolling-upgrades & downgrades of a Hadoop YARN cluster. See here for an introductory post.

Introduction

Carrying out a rolling upgrade/downgrade of all nodes in a Hadoop cluster can be a very disruptive process. Before HDP 2.2, if a NodeManager (NM) were brought down, all active containers on that node would be killed. This would significantly interrupt all applications in the cluster being upgraded/downgraded.…