Elasticsearch for Apache Hadoop 5.0.0

Elasticsearch for Apache Hadoop, affectionately known as ES-Hadoop, enables Hadoop users and data-hungry businesses to enhance their work-flows with a full-blown search and analytics engine, in real-time. And now, the moment you’ve been waiting for. Drumroll please. Developers and Data Scientists, I am pleased to present to you Elasticsearch for Apache Hadoop 5.0.0!

After several early access releases, a boatload of feedback posts, and much abundant waiting, it’s finally arrived! The Elastic Stack has made it to 5.0, and crossing the finish line along with it is ES-Hadoop! This release contains a substantial number of stability improvements, bug fixes, and shiny new features that we hope all of you will enjoy. And so, without further ado…

What’s New in ES-Hadoop 5.0?

Out with the Old … In with the new!

Sometimes you need to step backward to move forward. We’ve bumped up the versions for a handful of integrations. In doing so, we’ve removed support for some older versions. If you are using the older versions, it would be best to update them before moving to ES-Hadoop 5.0 for maximum compatibility.

Hello Hive 1.0, Goodbye Hive 0.13 and 0.14

Hive 1.0 has been released for quite a while and the majority of distributions have already moved to it. As such, support for Hive 0.13 and Hive 0.14 (two releases that were plagued by serious issues) has now been dropped, cleaning up the code base.

Hello Storm 1.x, Goodbye Storm 0.9

Storm support has been upgraded to 1.0.x. As this version is not backwards compatible with Storm 0.9.x, support for these versions had to be dropped.

Hello Spark 2.0, Goodbye Spark 1.0-1.2

Our support for Spark has been updated with the recent release of Spark 2.0. This version of Spark is not backwards compatible with any previous Spark versions. We have decided to keep support for Spark 1.3-1.6 as a separate compatibility artifact. SparkSQL was originally released in Spark 1.0-1.2 as an alpha component. Since then SparkSQL has become stable in Spark 1.3, but the API has significantly changed. Supporting three very different versions of Spark is a bit much. Because of this, support for Spark 1.0-1.2 has been removed.

HDFS Repository

The HDFS Repository has experienced a substantial upgrade and is now part of Elasticsearch proper. Because of this upgrade, we have removed it from the ES-Hadoop project. Note that the HDFS plugin in Elasticsearch 5.0 is not just conveniently packaged but also better integrated. Among these improvements is no longer needing to disable the JVM SecurityManager - an option that isn’t even available anymore.

(Hadoop/Spark) + Slice API = More Parallel

A substantial change has been added to support the use of Elasticsearch’s new Scroll Slicing functionality. Now you can state the maximum number of documents you wish to see per input task and the framework will attempt to subdivide input splits to increase your computing parallelism. Isn’t sharing beautiful?

Ingest Node

We heard about this cool new feature called the Ingest Node that was available in the alpha releases and coming out in Elasticsearch v5.0. We thought “Oh man, we ingest stuff, this node ingests stuff. We need to schedule a brunch with it immediately to trade gossip.” With the release of ES-Hadoop 5.0 you can now specify an ingest pipeline to send your data to, as well as target only ingest nodes to cut down on unnecessary traffic. We’re still waiting to hear back from you about brunch, Ingest Node. Call us!

Native Support for Spark Streaming

Spark is pretty fast, but sometimes you need your data even faster. We loved hearing that some of you were using ES-Hadoop with Spark Streaming, but we also felt the same heartache about the limitations that you were running into. We decided to do something about it. ES-Hadoop now natively supports consuming DStreams from Spark Streaming! We’ve included some fixes for the most commonly reported Spark Streaming issue of running out of connection resources during small processing windows. May your TIMED_WAIT’s be few, and your Spark Streaming Jobs live long and prosper.

Fast Acting Bug Repellant

Computers are hard. We thank our lucky stars everyday that our friends in the community are so helpful when it comes to reporting issues. When you open up your copy of ES-Hadoop, you’ll find a fresh batch of bug fixes already applied. These bugs range from issues with overwriting data with SparkSQL, memory leaks in the network code, sub-fields in your mapping named “properties”, and a bunch more. If we listed them all out here there would be no room for anything else. Cheers to the bug hunters out there! This one’s for you.

Feedback

As always, we love to hear from our users about what we’re doing well and what needs improving. So drop us a line some time on Twitter, GitHub or on the forum. Operators are standing by.

Special Thanks

We on the ES-Hadoop team would like to especially thank all of the early adopters for aiding us through the last few months of alpha and beta releases. 5.0 is the best release that it can be thanks to all of you. Stay classy.