Reindex is coming!

_reindex and _update_by_query are coming to
Elasticsearch 2.3.0 and 5.0.0-alpha1! Hurray!

_reindex
reads documents from one index and writes them to another index. It can be used
to copy documents from one index to another, enrich documents with fields, or
recreate the index to change settings that are locked when the index is created.

_update_by_query
reads documents from an index and writes them back to the same index. It can be
used to update fields in many documents at once or to pick up mapping changes
that can be made online.

_reindex copies documents

The _reindex API is really just a convenient way to copy documents
from one index to another. Everything else that it can do is an outgrowth of
that. If all you want to do is to copy all the documents from the
src index into the dest index you invoke
_reindex like this:

That requires that you have dynamic scripts enabled but you can do the same
thing with non-
inline scripts.

Recreating an index to change settings that are locked at index creations is a
bit more involved but still simpler than before
_reindex:

# Say you have an old index that you made like this
curl -XPUT localhost:9200/test_1 -d'{
"aliases": {
"test": {}
}
}'
for i in $(seq 1 1000); do
curl -XPOST localhost:9200/test/test -d'{"tags": ["bananas"]}'
echo
done
curl -XPOST localhost:9200/test/_refresh?pretty
# But you don't like having the default number of shards
# You can make a copy of it with the new number of shards
curl -XPUT localhost:9200/test_2 -d'{
"settings": {
"number_of_shards": 1
}
}'
curl -XPOST 'localhost:9200/_reindex?pretty&refresh' -d'{
"source": {
"index": "test"
},
"dest": {
"index": "test_2"
}
}'
# Then just swing the alias to the new index
curl -XPOST localhost:9200/_aliases?pretty -d'{
"actions": [
{ "remove": { "index": "test_1", "alias": "index" } },
{ "add": { "index": "test_2", "alias": "index" } }
]
}'
# Then when you are good and sure you are done with it you can
curl -XDELETE localhost:9200/test_1?pretty

_update_by_query modifies documents

The simplest way to invoke update by query isn't particularly useful on its own:

curl -XPOST localhost:9200/test/_update_by_query?pretty

That will just increment the document version number on each document in the
test index and fail if you modify a document while it is running.
A more interesting example is adding the
chocolate tag to all
documents with the
bananas tag:

Like the last version this will fail if any documents are changed while it is
running, but it is written in such a way that you can just retry it and it'll
pick up from where it left off. If you've already modified whatever application
is making the concurrent updates to add the
chocolate tag whenever
it sees
bananas then you can safely ignore version conflicts in
the
_update_by_query. You can tell it to do so by setting
conflicts=proceed. It will just count the version conflicts and
continue performing updates. Now the command looks like this:

You can read the docs
for more, but the gist is that _reindex plans to do total
operations and has already done updated + created + deleted + noops
of them. So you can estimate how complete the request is by dividing those
numbers.

Cancelling

_reindex was so long in coming because Elasticsearch lacked a way
to cancel running tasks. For short running tasks like
_search and
indexing that is fine. But, like I wrote above,
_reindex and
_update_by_query can touch millions of documents are take a long
time. The tasks themselves are ok with that, but you may not be. Say you realize
ten minutes into a three hour long
_update_by_query that you made
a mistake in the script. There isn't a way to rollback the changes that the
reindex already made but you can cancel it so it won't make any more such
changes:

curl -XPOST localhost:9200/_task/{taskId}/_cancel

And where do you get the taskId? It is the name of the object returned by the
task listing API in the last section of this blog post. The one in the example
return is
BHgHr0cETkOehwqZ2N_-aQ:28295.

In Elasticsearch task cancelation is opt in. It kind of has to be that way in
any Java application. Anyway, tasks that can be canceled like
_reindex and _update_by_query periodically check to
see if they have been canceled and then shut themselves down. This means that
you might see the task if you immediately list its status after it has been
canceled. It will go away on its own and you can't cancel it any harder without
stopping the node it is running on.

Remember that Elasticsearch is a search engine

Every update has to mark the document as deleted and index the entire new
document. The deleted documents have to then be merged out of the index.
_reindex and _update_by_query don't save anything in
that process. They work just as though you performed a scroll query and indexed
all the results. Running a zillion
_reindexs or
_update_by_querys is unlikely to be the most efficient use of
computer resources to accomplish some task. You will almost always be better off
making changes to the application that adds data to Elasticsearch rather than
updating the data after the fact.
_reindex and
_update_by_query are most useful for turning the data that you
already have in Elasticsearch into the data that you want to be in
Elasticsearch.