Elasticsearch 5.0.0-beta1 released

Today we are excited to announce the release of Elasticsearch 5.0.0-beta1 based on Lucene 6.2.0. This is the sixth in a series of pre-5.0.0 releases designed to let you test out your application with the features and changes coming in 5.0.0, and to give us feedback about any problems that you encounter.

Over 300 enhancements and bug fixes have been added since 5.0.0-alpha5 (all of which you can read about in the release notes linked above), but there are three changes in this release that deserve special mention below: huge improvements to indexing performance, switching geo_point fields to Lucene’s LatLonPoint, and making Painless the new default scripting language.

Migration Helper

The Elasticsearch Migration Helper is a site plugin designed to help you to prepare for your migration from Elasticsearch 2.3.x/2.4.x to Elasticsearch 5.0. It comes with three tools:

Cluster Checkup

Runs a series of checks on your cluster, nodes, and indices and alerts you to any known problems that need to be rectified before upgrading.

Reindex Helper

Indices created before v2.0.0 need to be reindexed before they can be used in Elasticsearch 5.x. The reindex helper upgrades old indices at the click of a button.

Deprecation Logging

Elasticsearch comes with a deprecation logger which will log a message whenever deprecated functionality is used. This tool enables or disables deprecation logging on your cluster.

Indexing performance improvements

This release includes a number of changes which have increased indexing performance by 80% in our append-only two-node benchmarks. The first change benefits the append-only use case where document IDs are auto-generated by Elasticsearch. Because we know that a document with the same ID does not already exist, Elasticsearch can skip the version check and add the document directly. We first tried to enable this optimization two years ago but back then it resulted in adding duplicate documents during shard relocation. Now, the shard relocation and handover process has evolved enought that we can ensure that duplicate documents are not added.

In 2.0, we added the guarantee that the transaction log would be fsync’ed to disk before a write is acknowledged to the user. The fsync was a synchronous call which effectively blocked indexing progress until the call returned. This release changes the fsync call to be asynchronous so that indexing and document replication can continue during fsync, yet it maintains the same guarantees as before. This is a big win for users with spinning disks for whom fsync is a slow operation.

Search in Elasticsearch is near real-time, meaning that a new segment must be written before the documents it contains become visible to search. Real-time GET (retrieving a document by ID) was implemented by maintaining an in memory list of the documents that have been written to the transaction log but not yet been written to a segment, and their offsets in the translog. This added a lot of overhead and complexity for a relatively infrequent use case — most of the documents you GET are already in a segment. Instead, we now maintain just a list of document IDs without translog offsets. If a recently written document is requested, Elasticsearch performs a refresh and returns the document from the new Lucene segment. Removing the offsets from memory frees up more space in the indexing buffer and greatly reduces the amount of young garbage that has to be collected. This does mean that frequent updates to the same document (e.g. a counter — not a recommended use of Elasticsearch) will be slower.

Geo-points use LatLonPoint

Elasticsearch 2.3 has already seen significant improvements to geopoint search. In this release, the implementation of geo-point fields has been switched from GeoPoint to LatLonPoint with doc values. This change uses a bit more disk space but doubles the speed of geo-distance queries, as can be seen in Lucene’s geo benchmarks.

Painless is the new default scripting language

Scripting in Elasticsearch has been harder to use than it should be because of the security risks involved with enabling languages like Groovy by default. We have spent the last year and a half writing a new, safe, and fast language called Painless. We are so pleased with the progress that we have decided to make Painless the new default scripting language, and to deprecate Groovy, Javascript, and Python. Any new script that doesn’t specify a language will be considered to be in Painless. For backwards compatibility purposes, any existing inline scripts in an index (e.g. in the percolator or in Watcher) will use the language specified by script.legacy.default_lang, which defaults to Groovy.

Other notable changes

Elasticsearch now uses Log4j2 for logging, which exposes new log management options.

Deprecation logging is now enabled by default, as the logs are limited by size.

The update-aliases action now supports deleting an index and adding an alias as a single step, allowing an existing index to be replaced with a newer index+alias atomically.