03
Jan

January 2013 Service Interruptions Postmortem

January was kinda nightmare for us due to service interruptions and performance issues. We have resolved most of the issues and made some critical infrastructural changes which led us to make %100 uptime at February.

What was the problem ?

We had several issues, which are;

Latency problems

Split brain issues

GC pressure

Cluster time-out problems, some nodes get kicked by master

Some of the issues have best practices to solve, some need more than just configuration. Our cluster was built with m2.medium nodes and cluster was enough in terms of memory and cpu. Even though ElasticSearch just plays very nicely if it scaled horizontally that does not means it will solve all capacity problems. If you don’t have enough memory for each node, JVM starts garbage collections and if garbage collection operations goes very frequently that nodes has potential candidates to be removed from cluster, due to not answering ping requests in time. Also singe core instances suffers these potential issues much more. Another point to consider while scaling horizontally is network latency. Adding more nodes means more network connections also increases latency.