Data Eng Weekly

Hadoop Weekly Issue #158

21 February 2016

It was a busy week of announcements and releases, many of which coincided with Spark Summit East that took place in New York. Among the highlights are the new Apache Arrow project (in-memory columnar storage) and the community edition of Databricks. Netflix and Google both wrote a bit about their big data infrastructure, and there are great articles about Cassandra, MLlib, Python & Hadoop, Kafka Connect, and more.

Technical

This post looks at why a time series is a good model for storing many types of medical data, why Cassandra is a good system for storying time series data, how to model time series data in Cassandra (including some example table definitions and row-level inserts), and more.

The Netflix blog has a post on the evolution of their data pipeline, which is currently handling over 1 petabyte of data per day. Originally built on Chukwa and Amazon EMR, the latest system (called Keystone), is based on Kafka, EMR, and is supplemented by Elastic Search and streaming consumers (Spark and others).

The Cloudera blog has an example of using Spark's MLlib to do churn analysis. The post shows how to use Spark SQL (from Python) to define the schema of a csv file, build a feature vector, use a RandomForestClassifier to train a model, and validate the generated model.

The IBM Hadoop blog has a presentation on Spark troubleshooting—covering topics ranging from compiling Spark to optimizing cluster utilization to collecting thread dumps for debugging production issues. The presentation covers over 10 different troubleshooting tasks.

Hortonworks is starting a new weekly blog series highlight articles from their Hortonworks Community Connection. This week, the selected articles cover NiFi, Storm, Kafka, and Ambari. They also highlighted three community questions from the week.

Apache Arrow is a new top-level project for columnar, in-memory data spun out of the Apache Drill project. The Apache blog has the announcement, MapR has a post about the origins and design of Arrow (aka Value Vectors), and Cloudera has a post about their plans for integrating Arrow with other big data projects.

Cloudera has an update on the state of Python and Hadoop. The past year has seen improvements to PySpark and the emergence of several Python DSLs, which have great improved the utility of Python for big data processing. The post recaps the current landscape and describes two new initiatives—efficient data interchange via Apache Arrow and Cloudera Manager integration with Continuum Analytic's Anaconda Python.

While modern tools have automated much of Hadoop configuration, many settings aren't "set it and forget it." As a cluster grows or utilization goes up, settings that made sense initially can cause major problems. This post describes one such issue (and how to resolve it)—a HDFS NameNode that became overwhelmed due to the number of HDFS blocks.

The Confluent blog has a post on Kafka Connect, a new tool for moving data between Kafka and other systems. Part of the recently released Kafka 0.9.0, Kafka Connect ships with connectors for HDFS and JDBC. The introductory blog post has many more details on the design and implementation.

Google has written about several large-scale sorting experiments that they've run over the past 10 years. Describing the road to sorting 50PB in 2012, the post shares a lot of interesting anecdotes (such as the use of Reed-Solomon encoding in 2010 and that their benchmark from 2012 is faster than the 2015 GraySort winner).

News

This post looks at the big data landscape, which the author argues has matured and is in a deployment phase. It's still difficult to start a big data system from scratch, but the ecosystem has matured (the post includes some financial numbers to back up this claim). The post also has a map of the landscape, which maps several areas like infrastructure, analytics, and open source.

Hortonworks has a post about what they've seen with enterprise adoption of Spark and how they're helping to speed up adoption. This includes making analytics/data science easier, hardening Spark for enterprise, and innovating core Hadoop.

Releases

Versions 2.1.10 and 3.1.0 of Apache Curator, the high-level framework for interacting with Apache ZooKeeper, was released this week. The releases each contain several bug fixes, improvements, and new features.

At Spark Summit in NYC this week, Databricks announced a new community edition of their Spark-as-a-Service platform and Databricks Dashboards. The community edition (currently in beta) provides free access to Databricks, and Dashboards provide a mechanism for building interactive web pages, logically separating charts and graphs from a single notebook, and more.

Apache BigTop 1.1.0 was released this week. Bigtop is a packaging, smoke/integration testing, and virtualization system for the Hadoop ecosystem. This release is built on Hadoop 2.7.1, supports five operating systems, adds support for Apache Hama and Zeppelin (incubating), adds support for producing docker images, and more.

The Apache Hive team disclosed CVE-2015-7521, which is an authorization bug that can be used to circumvent some authorization checks. The team has provide a new jar and configuration that produce a run-time work-around. Affected versions are Hive 0.13.x through various versions of 1.2.x (see the announcement for the full list).

Cloudera and Continuum have announced an integration for deploying Anaconda Python via Cloudera Manager. This introductory post describes how to configure CM to find and install the Anaconda Parcels and demonstrates how to take advantage of the install using pyspark.

Version 0.5.1 (with bug fixes to the recently announced version 0.5.0) of Apache NiFi was released this week. NiFi is a system for processing and distributing data. The new release improves S3, Hive, and encryption support, adds new extensions for Reimann, ElasticSearch, and Avro, and has improved data inspection and state management.

Version 1.7.0 of Apache Accumulo, the distributed key-value store, was released this week. The new version is backwards compatible with previous versions, and focusses on bug fixes and improvements in the areas of security (new Kerberos authentication), availability (improved data center replication), and extensibility (e.g.support for HTrace for distributed tracing). There are many more details about the release on the Accumulo web site.