Elasticsearch for Apache Hadoop 2.2.0 and 2.1.3 released

Today, we announce new versions of the entire Elastic Stack, including a tighter integration of Shield with Kibana and an updated version of ES-Hadoop. Detailed blogs for product releases are available in the releases category of the blog. And yes, the blog has categories – you know, for searchability.

I am pleased to announce that ES-Hadoop is joining the release bonanza through the GA release of Elasticsearch for Apache Hadoop (ES-Hadoop) 2.2.0 and the bug fix release of ES-Hadoop 2.1.3.

Overhauled geo support

Similar to Elasticsearch, geo support has been overhauled in ES-Hadoop 2.2 - not only geo_point and geo_shape types are properly detected, but also their schema is inferred (despite being over a dozen data formats across both types).

Network improvements

ES-Hadoop 2.2 introduces support for wan/cloud/gated Elasticsearch environments where access is done only through one central point. This extends the number of topologies that ES-Hadoop works with, along side client-node only and direct connection. The latter scenario has also been optimized by specifically routing traffic only to data nodes and filtering out master nodes.
The configuration options have been improved to allow configuration of the JVM HTTPS proxy along with resolving of hostnames to IPs (useful when using Elasticsearch with network publishing enabled).

Better runtime diagnostics

To prevent user error and misconfigurations, ES-Hadoop 2.2 introduced classpath checks to make sure only one version is used at a given time; this alleviates scenarios where different versions of the project are deployed leading to an unsupported scenario. Further more, incorrect usage of libraries (such as saving a DataFrame without the Spark SQL support) are also reported.

Such features provide not just richer constructs for the user but also improve performance by pushing down to Elasticsearch more and more of Spark SQL.

Extended configuration options

The support for multi-dimensional fields (arrays) has been enhanced as one can now specify upfront the dimensions for a given field (whether nested or not), quite useful in strictly typed environments (like Spark SQL) especially when the data does not conform exactly to its declaration.
Additionally, options to include or exclude certain fields as long as the number of documents being read were added.

YARN enhancements

A batch of updates were done to the YARN module by upgrading to Elasticsearch 2.2.x and introducing the option for the JVM system properties to be passed directly to the children container.

Repository HDFS is moving soon

The HDFS snapshot and restore plugin (repository HDFS) has been ported to Elasticsearch master and is undergoing a significant overhaul in terms of security. Shout out to Robert for his support in making this happen.
It has been quite an effort considering Hadoop is not compatible with the Java Security Manager, simply asking a plethora of permissions with many of them way too dangerous (such as execute on all permissions during a basic startup).
The current plan is for the plugin to be officially part of Elasticsearch proper as an official plugin in an upcoming release. Until that happens, it is still available as part of the ES-Hadoop project.

More about it, in a future blog post.

Improved reliability

While not something tangible to the user, behind the scenes ES-Hadoop 2.2 has increased its test suites by 50% (!) closing to over 4900 tests. The plan is for the next major release to pass over the 5K threshold.

Last 2.1.X release

Along side 2.2, ES-Hadoop 2.1.3 is released as the last planned maintenance release in the 2.1.X line. It contains a series of backported bug-fixes for those with conservatory upgrade paths. However even if you are on ES 1.x, upgrading to ES-Hadoop 2.2 is highly recommended.