ELK Operational Tips

I’ve been running ELK clusters for over a year now, and want to share tips and tricks that I’ve found to be useful.

Feel free to post questions and corrections. I’ll try to answer and update when possible.

Elasticsearch

Split brained – this is when you have more than one node in your cluster becoming master.

It is best to avoid ever having this happen. Use the rule of thumb, e.g. if you have N nodes, the number of nodes that can be master is N/2 + 1. Even better, set aside a dedicated pool of master nodes (I recommend minimum of 3 master capable nodes).

If split brained does happen, you want to stop one of the master node ASAP. Depending on whether you have replicas or not, it could be easy fix, or you might end up having to re-index if your indices has gotten out of sync by having the replica promoted to primary and new index data sent to it.

Failed node(s) – one or more failed nodes. There are many scenarios, from failing hardware to outages causing data corruption, etc.

FAQs

How to fix corrupted elasticsearch translog.

In 5.0 there is a tool which can be used to truncate corrupt translog files. This doesn't exist in 2.x but there is a workaround:
POST my_index/_close
PUT my_index/_settings
{ "index.engine.force_new_translog": true }
POST my_index/_open
PUT my_index/_settings
{ "index.engine.force_new_translog": false }
NOTE: Any data in the corrupted translog will be lost.

How to size a cluster?

I want to create a new Elasticsearch cluster. What are the recommended sizing guidelines?
Answer:
This is very much a use case dependent answer. The factors that should be taken into considerations are:

How much data do you expect to index?

Frequency of new data. How often is new data to be indexed? Daily? Hourly?