Friday, March 21, 2014

Elastic search comes with pretty good defaults. But they may not be suitable for your prod environment. I am giving here a list of most important configuration options in general with reasons. In all means this might not be the complete list for your requirement, so please explore all possible options you want to look for. One needs to give a serious thought to various configuration options based on requirements of the application before moving to prod environment.Configuration

Set cluster name to your application specific name, should be different from cluster name in your QA/Dev environments to prevent unwanted node to join the cluster in prod.

cluster.name: mycluster

Set node name to some meaningful name specific to application

node.name: mynode

Set paths to locations outside of ES installation directory to avoid any overwriting while upgrading to next versions

If you do not want to create index automatically set following property to false. This may be desired if you want to add custom settings to index i.e. custom analyzers, number of shards and replicas

action.auto_create_index: false

Set the following property to false if you want to disable schemaless feature which may desired in prod. ES will not create mapping for unmapped types automatically if set to false.

index.mapper.dynamic: false

Specify default field for query, search is performed against this field when no field is specified. By default it's _all

index.query.default_field: _all

Set type of storage, set it to non-blocking I/O type for file system based storage as given below

index.store: niofs

Set the number of shards of an index, 5 is by default. This value can be overridden by specifying it in index settings json while creating index via REST API.

index.number_of_shards: n

Set the number of replicas of an index, 1 is by default. This value can be overridden by specifying it in index settings json while creating index via REST API.

index.number_of_replicas: 1

set minimum number of nodes that should come up after full restart of cluster before they start replication of data nodes will start swapping data as soon as they come up if this property is not set. This may result in very high I/O traffic in large clusters.

gateway.recover_after_nodes: n

Prefer setting node discovery to unicast and disable multicast. Add names of few hosts to seed list of unicast.

Set minimum number of nodes to quorum to elect a master in split brain situation. So master is not elected in a split brain part if total number of master-eligible nodes in that part are not more than half of total number of nodes in normal cluster. Here n is total number of master-eligible nodes in cluster. This is a dynamic setting, so it's value can be changed on live cluster as number of nodes changes by using cluster API i.e. curl -XPUT localhost:9200/_cluster/settings -d '{"persistent" : {"discovery.zen.minimum_master_nodes": 3}}'

Number of max. open file descriptors, in the machine running ES, should be set to a high value i.e. 65535. Verify this using curl -XGET localhost:9200/_nodes/process?pretty

JVM HeapJVM heap size, ideally half of the available memory can be set to ES JVM. Set ES_MIN_MEM and ES_MAX_MEM env variables to same value in ES script and enable mlockall asexplained above. Verify this using curl -XGET localhost:9200/_nodes/jvmHardware

Prefer to use SSD in place of spinning disk. SSDs can give hundred of times faster IO ops.

More CPU cores the better. Since ES is asynchronous it can use multiple cores efficiently.