Elasticsearch Synonyms And Aliases

You've probably heard of Post Traumatic Stress Disorder (PTSD). Recently, we had a visitor come to our office and talk about PTSD, and he argued that it should be referred to as "Post Traumatic Stress," without the "Disorder" part, because really this condition is a normal human response to extreme events.

We all thought that was a good point, but we don't think the name will change overnight, so we'd like to support both names. For most searches on our site, we are using ElasticSearch, and we wanted PTSD and PTS to be treated as synonyms. ElasticSearch provides synonyms to support just such a case. But implementing the change to use synonyms in production with minimal downtime proved to be harder than expected. Here's the process we followed:

Before this change, we didn't have a custom configuration, we just went with the ElasticSearch defaults. This new configuration defines a set of analysis filters, and then creates an analyzer called english_plm_synonyms that uses those filters. The filters are applied in order:

english_possessive_stemmer: The handles possessives in search, so for example, if you search for Ernie's rubber ducky it will still match Ernie rubber ducky

lowercase: This converts text to lower case for search, so search is case insensitive.

plm_synonyms: Use our synonyms filter so PTSD and PTS are synonyms, which is the whole point!

english_stop: Get rid of common English words.

english_keywords: This prevents stemming of PTS. We needed this because, when searching for PTS on our site, especially in forums, you would often turn up matches for PT (Physical Therapy), with the "S" being considered a stem. But we don't want that, PTS should match PTS and PTSD, but not PT.

english_stemmer: Handle stems like -ing and -s.

-

Create a reindexing rake task

We also created a rake task to copy an existing index over to a new index, here it is slightly simplified.

desc'reindex from existing index to new index'task:es_reindex,[:new_index]=>:environmentdo[Condition,Symptom,Lab].eachdo|klass|# a method we have that saves a mapping, # which we modified to take an index argumentklass.es_save_mapping(index: args[:new_index])client=klass.es_client# Open the "view" of the index with the `scan` search_typer=client.searchindex: klass.es_index,search_type: 'scan',scroll: '5m',size: 100,type: klass.name,fields:["_parent","_source","_routing"]# Call the `scroll` API until empty results are returnedloopdor=client.scroll(scroll_id: r['_scroll_id'],scroll: '5m')breakifr['hits']['hits'].empty?# getting this right was a bit tricky...body=r['hits']['hits'].map{|hit|{index: {_id: hit['_id'],data: hit['_source']}.merge((hit['fields'].reject{|k,v|k=='fields'}rescue{}))}}client.bulk(index: args[:new_index],type: klass.name,body: body)endendend

Moving to the new index

Once those pieces were in place, we were ready for the next step. To minimize downtime, the plan was to:

Create an alias for the ElasticSearch index

Bring down the site, change the configuration to point to the alias, and bring the site back up

Build the new index

Change the alias to point to the new index

-

Change ElasticSearch to use an alias

Create an alias called plm_production_alias that points to plm_production

We ran the rake task again, to catch any data that came in during the operation

Once we were sure all was well, we deleted the old index

curl -XDELETE 'http://localhost:9200/plm_production/'

Handling the change in other environments

The changes to use the new analyzer involves changes to our code base, and those changes assume the new analyzer exists. So if a developer checks out the new code without creating a new index, they will get errors that the analyzer does not exist. They must then either apply the change to their index (adapting the "Apply the new settings.." steps above), or drop and recreate their index. The same would apply for QA environments, etc.

Future changes

In the future, if we want to make additional changes, we will need to:

Modify elastic_index.json with our new settings

Create a new index

Apply the settings to the new index

Run the reindex rake task for the new index

Change the plm_production_alias alias to point to the new index

-

Automation

Much of this could be automated. I imagine that you would run a rake task that would generate a new configuration JSON file, with a timestamp in the file name. Then you would make your changes to that configuration file, and then run another task that would create a new index, using the same timestamp, apply the new settings to the index, run the reindex, and reassign the alias. Perhaps that will be the topic for a future blog post.