Installation

ES_PATH_CONF defines the root where all ES configuration lives. So its easy to setup portal configuration on new docker containers for example.

modules are plugins that are core to running ES.

plugins useful extensions for ES. TODO: look into these.

Always put configuration in the persistent config files such as jvm.options. While its possible (and convenient) to define these on the command line such as -Xms512mb, this is not designed for long term application.

Top configuration tips:

Always change path.data (never use the local OS volume). Multiple paths are supported path.data: [/home/elastic/data1,/home/elastic/data2] all paths will be used.

The elasticsearch binary supports a daemon mode with -d, and a -p for storing the current ES PID in a text file.

Default configuration path can be tweaked using ES_CONF_PATH

Set the node.name explicitly.

Set the cluster.name

Have explicit port numbers (when multiple nodes are spun up on a single machine port range 9200-9299 are used)

Starting and Stopping Elasticsearch

Killing

kill `cat elastic.pid`

TODO (BenS): add more details

Communication

REST API interaction (port rnage 9200-9299)

Internode communication between nodes within the cluster (port range 9300-9399)

Discovery module (networking)

The default module is known as the zen module. By default it will sniff the network for

discovery.zen.ping.unicast.hosts : ["node1:9300", "node2"]

Network settings, there are 3 top level setting namespaces:

transport.* transport protocol

http.* controlling the HTTP/REST protocol

network.* for defining settings across both the above

Sepcial values for network.host:

_local_ loopback

_site_ bind to the public network routable IP (e.g. 10.1.1.14)

_global_ any globally scoped address

_network-interface_ (e.g. _eth0_ for binding to the addressable IP of a network device)

Security

Essential infrastructure:

firewall

reverse proxy

elastic security

Read-only

Consider a read-only cluster, for splitting out reads from writes. CCR (cross cluster replication) make this super handy pattern to roll out.

For locking down the REST API, the reverse proxy could lock down to only GET requests, for certain auth or IP’s.

The same goes for Kibana. Providing read-only dashboards and visualisations.

Enabling X-Pack (Elasticsearch Security)

Mostly easily down via Kibana, under the Management

License Management

For Elasticsearch jump into elasticsearch.yml and set xpack.security.enabled: true. Then generate some credentials for the built-in accounts:

./elasticsearch/bin/elasticsearch-setup-passwords interactive

For Kibana, so it can communicate with the now secured Elasticsearch cluster, jump into kibana.yml and set:
elasticsearch.username: “kibana”
elasticsearch.password: “kibanapassword”

If passwords in cleartext are no go, a encrypted keystore is provided:

bin/kibana-keystore create

Then load it up with key/value pairs:

CRUD

Ingestion

Given ES is just a distributed document store, works with managing complex document structures. ES must be represented as JSON. Beat and Logstash are aimed at making this a smooth process.

An index can be related to a table in a relational store, and has a schema (a mapping type).

ES will automatically infer the mapping type (schema) for you, the first time you attempt to store a document.

A shard is one piece of an index (by default there are 5).

By default, documents will automatically be overridden (version # incremented). If you don’t wont auto overrides, use the _create API. Similarly there is an _update API.

DELETEing a document, space can be reclaimed.

The _bulk API allows many operations to be loaded up together. One-line per operation (based on the JSON oneline standard)

Reading

To query something need to know the (1) cluster, (2) index, (3) mapping type and the (4) id of the specific document

To obtain multiple document, the _mget API is provided.

The _search API exposes the ES searching functionality.

Search

Precision is the ratio of true positives vs the total number returned (true and false positives combined). Its tempting to constrain the net of results to improve precision. This is a tradeoff with recall which will drop.

Recall is the ratio of true positives vs the sum of all documents that should have been returned. By widening the net (by using partial matches).

Scoring is done by 1950’s technique known as TF/IDF. TF (term frequency) the more a term exists the more relevant it is. IDF (inverse document frequency) the more documents that contain the term the less relevant it is.

Okapi BM25 is the 25th iteration of TF/IDF and is the default used by ES

Claude Shannon in 1925 discovered that information content = log 2 * 1/P, and this has been factored into BM25.

Two methods:

Query string can be encoded in the HTTP URL.

Query DSL a full blown JSON based DSL.

When querying, only pull back fields that you are interested (or not) in with the _source option, for example "source": [ "excludes": "content" ]

To increase precious (and drop recall) include the operator option set to and (by deafult the or operator applies) e.g:

Snippet:

"query": "ingest nodes",
"operator": "and"

minimum_should_match instructs that a minimum number of search terms need to match.

match_phrase specifies an exact match e.g. a new way must include all terms in the specific sequence.

If the search was open data was searched the slop option can relax (or tighten) the search, by specifying hte number of terms that can exist between each search term

Date math is now supported "gte": "now-3M", or "now+1d"

A bool query allows a number a conditions to be articulated using the; must, must_not, should and filter. filter is similar to the WHERE clause in a SQL statement. Its not an optional

Query and Filter Contexts

should and must influence the score, and operate in the query context, and determine the shade of grey a match result it by scoring it. Its is handy to combine them, a must with several should’s will

The must_not and filter options operate in what is known as the filter context, and is black and white, results MUST meet the crtieria. A result can’t be more January than another, they are just January.

When a search with only should’s is specified, this will implicitly define a minimum_should_match term of 1.

A should could nest a bool that in turn contains a must_not to down score documents if they contain a certain term.

TODO: Include table on P.164 of Engineer I notes.

TODO: Include should not query, and search tip for phrases

If a user searches TODO: continue from Engineer I notes from P.175

Mapping

Basically a schema, with field level definitions such as data typing.

To view the mapping for an index via the API GET fooindex/_mapping

A data type of text means that is can be ripped apart as tokens.

The keyword, instructs ES to keyword analyse the field.

Prior to version 6.0, a concept known as document types within the same index were supported. This was a design flaw, and removed. Spliting out into separate indexes is now required. Cross index searching is well supported, so this isn’t really a big deal e.g. GET uni_student,uni_lecturer/_search

The keyword data type is used for exact value strings, and text for full text searchable fields.

Specialised types like geo_point and geo_shape are supported.

The percolator type, TODO investigate this.

Be aware of the automatic inferred mappings that ES does, while convenient, typically makes a number of errors when typing fields.

Inverted Index

Very similar to the index in the back of a book. Common terms, and where they are located in a convenient lookup structure. Lucene similarly creates this inverted index with text fields.

Text is broken apart (tokenised) into individual terms. These are converted to lower case, and special characters are stripped.

Interestingly the search query is also tokenised by the analyzer in the same way.

The inverted index is ordered. For search efficiency, allows algorithms like binary search to be used.

Elasticsearch default analyzer does not apply stop words by default. This is also handled much better by DM25 now, than traditional TF/IDF.

Stemming words like “node” and “nodes” to return the same match. By default, Elasticsearch does not apply stemming. Some examples, configuring > configur, ingest > ingest, pipeline > pipelin

Multi Fields (keyword fields)

text fields are broken down into pieces, and are not appropriate for doing literal text comparisons. For example “I like Elasticsearch!” will strip the special characters, casing and the sequence of terms.

GET analysis_test/_analyze
{
"analyzer": "my_analyzer",
"text": "C++ can help it and your IT systems."
}

Some reasons for doing this:

You want to tokenize a comma delimitered field within the document.

Language specific analyzer (e.g. spanish, english).

Stop words, terms that are just noise and add little value in search.

Stemming (with the snowball filter) to boil down words to their roots.

Token filters are applied in the sequence they are defined.

Mapping terms of interest into a common representation, such as C++, c++, CPP should all map to cpp.

TODO: For fun, try to create a custom filter for handling Aussie names (baz to barry)

Standard tokenizers:

whitespace does not lowercase terms and does not remove punctuation

Token filters are applied with the filter keyword. There are dozens of built-in filters.

Snowball filter for applying stemming back words to their root (Snowball is an agnostic stemming definition language)

Lowercase

Stop words, in addition to the standard stopwords provided by the underlying Lucene engine.

Mapping filter e.g. X-Pack to XPack

ASCII Folding is used for stripping and normalising special ASCII characters, and open/closing tags in XML representations

Shingle filter

Many more

The reindex API

The _reindex API clones one index to another index.

A handy pattern is to reindex an index into a temporary staging index. Test apply custom analyzers or mappings etc. If successful, reindex the staging index back to the live index.

Beware for large indexes, as this can take a significant amount of time. TODO checkout scrolling and some internals around this.

Node Types

A node can take on several roles:

Master (low CPU, low RAM, low I/O), the leader of the cluster, manages the creation/deletion of indices, adding/deleting nodes, adding/deleting shards. By default all nodes are node.master enabled and are eligible for master. The number of votes needed to win an election is defined by discovery.zen.minimum_master_nodes. It should be set to N / 2 + 1 where N is the number of master eligible nodes. Very important to configure to avoid split brain (possible multiple and inconistent master nodes). Recommendation is to have 3 master eligible nodes, with minimum_master_nodes set to 2.

Useful for managing bursts of resources (e.g. ebay during the xmas period), the number of data nodes and replicas can be increased dynamically on the existing cluster.

The hashing algorithm called murmur3 modulo the total number of shards, is used to determine the shard number to assign to a specific document.

Updates and deletes are actually difficult to manage in this distributed system, and are essentially treated as immutatble entites.

An index operation must occur on the primary shard, prior to being done on any replicas.

Anatomy of Search (Shards)

Each shard is required to run the query locally.

Each shard returns its best results, to the coordinating node, which is responsible for globally merging the results.

The TF/IDF algorithm, the term frequency make sense even when calculated locally to the shard.

With the default, fetch-then-query behaviour, IDF (document frequency) can be skewed when its calculated locally on the shard. IDF would be very expensive to calculate globally across the cluster. Interestly in practice, this is rarely an issue, especially when you have a large dataset that is evenly distributed across shards, as an even sampling exists.

A global IDF can be computed if desired, by setting the search_type to dfs_query_then_fetch, and useful for testing on small datasets, GET blogs/_search?search_type=dfs_query_then_fetch

For testing you can stop and start nodes to observe the spread of replicas across nodes.

Also can change the replia setting live for an index:

PUT test1/_settings
{
"settings": {
"number_of_replicas": 0
}
}

Troubleshooting

Configuration

Settings to various artifacts are applied at various levels:

Index level, PUT fooindex/_settings { "index.block.writes": true }

Node level, the elasticsearch.yml

Cluster level, PUT _cluster/settings { "persistent": { "discovery.zen.minimum_master_nodes": 2 } }. Note the persistent setting, this will be written to the filesystem somewhere. Similarly a transient property is supported.

Precedence of settings:

Transient settings

Persistent settings

CLI arguments

Config files

Responses

Given the REST API is based on HTTP, two things:

The HTTP response code.

Can’t connect, investigate network and path.

Connect just closed. Retry if possible (i.e. wont result in data duplication). This is one benefit of always indexing with explicit id’s.

4xx, busted request.

429, Elasticsearch is too busy, retry later. Client should have backoff policies, such as a linear or exponential backoffs.

5xx, look into ES logs.

JSON body, always includes some basic shard metadata.

“_shards”: {
“total”: 2,
“successful”: 2,
“failed”: 0
},

Breaking this down:

Total has many shard copies.

Successful the count of shard copies that were updated.

Failed, a count, which will also come with a descriptive faliures structure with informative reason information.

Search responses:

Skipped, ES 6.X onwards has an cheeky optimisation that applies when over 128 shards exists. A pre-optimisation that avoid hassling shards, if it knows there is just no point (i.e. documents that relate to the requested operation will just not exist in those shards).

Cluster and Shard Health

Shard health:

Red, at least one primary shard is not allocated in the cluster

Yellow, all primaries are allocated but at least one replica is not

Green, all shards are allocated

Index health, will always report on the worrst shard in that index.

Cluster health, will report the worst index in that cluster.

Shard lifecycle:

UNASSIGNED, when shards haven’t yet been allocated to nodes yet

INITIALIZING, when shards are being provisioned and accounted for

STARTED, shard is allocated and ready to store data

RELOCATING, when a shard is in the process of being shuffled to another node

Shard promotion, can occur in the instance of a node failure, where a replica will evolve into a primary.

Details shard and index specific details can be obtained, using the _cluster API:

Hot tip: use a fuzziness setting of auto, to dynamically adjust when it should be applied. Consider for example applying a fuzziness of 2 to a 2 character search term such as hi. This would hit any 4 character terms across the whole index. Pointless.

Exact Terms

Explicitly use the keyword field on a field, for example category.keyword.

Exact keyword matches should often be applied in the filter context.

Sorting

Simple sorting, removes the need to score results, which ES will jump at as its a huge optimisation:

Highlighting

Enables a search term result to be wrapped in a tag for later rendering in a UI. By default will wrap in the <em>.

Aggregations

Basically a GROUP BY clause.

Types of aggregations:

Bucket, uses a field within the document type to aggregate on. For example, people by gender. Buckets can be nested. People by country, by gender for example. Buckets can also be sorted by its _key (the value of the in context bucketing term).

Term, what the biggest contributor (e.g. by country) of a specific search term. Term aggregation are not precise due to a distributed computing problem, where aggregates are calculated per shard by each data node, which is then in turn tallied up by the coordinating node. To avoid this, you can ask that more aggregation results be returned to the coordinator, to avoid inaccurate tallying, by specifying a "shard_size": 500

Tricks:

Set a "size": 0 to completely strip everything, but the aggregate result itself.

Queries and aggregations can be coupled together.

The cardinality aggregation reports on just the distinct values. Has a default value of 3000.

Best Practices

Index Aliases

Think symbolic linking for indices. Avoid coupling clients to underlying index. For example, the frontend index alias might be called blogs and the underlying index blogs_v1.

Aliases can also have filters built-in to them, for example only documents that relate to the engineering department.