(You can also find this article on medium) This article describes how we made the full-text organic search faster at the Internet Archive — without scaling horizontally — allowing our users to search in just a few seconds across our collection of 35 million documents containing books, magazine, newspapers, scientific papers, patents and much more.

By organic search we mean the “method for entering one or a plurality of search items in a single data string into a search engine. Organic search results are listings on search engine results pages that appear because of their relevance to the search terms.”(1).

The relevance should be scored on the search query matches for every document. In other words, if the search query has a perfect match in some of our documents, we want to return these documents; otherwise, we want to return the documents containing a part of the query or one of its subset.

Let’s consider some of our initial conditions:

Our corpus is very heterogeneous. It contains a lot of documents with content, context and language.

The size of our documents varies considerably, from a few kilobytes to several megabytes of text. We have documents consisting of only a few pages and books with thousands of pages.

This index should also be agnostic about the kind, content, context and language of a document. For simplicity and for an easier connection with our documents ingestion pipeline, we need to keep everything in a single index.

At the Internet Archive we have thousands of new documents to index every day so the search must continue to work properly in a continuous indexing mode.

The search has to work as fast as possible, serving each search request without slowing down the execution of the other ones.

:: Cluster ElasticSearch

We built the cluster on KVMvirtual machines on Intel hardware using Ubuntu 14.4 and ElasticSearch 2.3.4:

Number and type of node

CPU

RAM

Storage

10 datanodes

14 CPU

45GB

6.4TB SSD consumer

3 masters

2 CPU

2GB

–

4 clients

22 CPU

34GB ram

–

The Java Virtual Machine (JVM) runs with 30GB: -Xms30g -Xmx30g

Our document structure is a bit more complicated, but to keep it simple let’s consider every document as consisting of only these fields: Title, Author, Body. All of these fields are typestring and analyzed with a custom analyzer using the International Components for Unicode.

tokenizer

normalizer

folding

icu_tokenzer

icu_normalizer

icu_folding

We created an index of 40 shards with 1 replica, for a total footprint of ~14TB of data, with shards of ~175GB.

:: First results and first slow queries

At the beginning this index was not too bad. With the first 10 million documents loaded, we got good response time and performance. Our goal is to index and make the search available for all 35 million documents. We soon learned that the performance got worse when adding more documents.

We discovered that some queries were taking too many seconds to be executed, keeping our resources busy and slowing down all the cluster metrics, in particular the indexing latency and the search latency. Some queries were taking more than 30 seconds even when the cluster was not particularly busy.

As an example, a search for “the black cat” took longer than the more simple query “black cat” because of the “the” in the query. Another good example was: “to be or not to be” — particularly slow because it is comprised of only high-frequency words. An even slower query was “to be or not to be that is the question,” which took over one minute.

:: The analysis

To better understand the problem, we monitored the cluster’s nodes while the slow query was running. We used profile and hot-threads from the API to profile the queries and to monitor the threads execution on the cluster. We also used iostat to keep monitored the IO activities on the datanodes.

We discovered:

Lots of hot threads were about the highlighting phase. We need this feature because we don’t want to know only what documents contain the best matches, but we also want to show the results in their context with a snippet. This operation can be expensive, especially considering that we want to display snippets for every document returned in the results list.

The search for queries with high-frequency words was particularly expensive.

The CPU use was ok.

The ratio of JVM ram versus the Operative System ram was not well balanced, so the file system cache was unable to manage our big shards and documents.

The Garbage Collector (GC) pauses often and for too long (sometimes more than a minute) — this is probably because some of our documents contain very big values for the body field.

:: Modeling the problem

In this step we addressed the indexsettings and mappings, looking for ways to improve the performance without adding new nodes or hardware upgrades.

The ElasticSearch stack is big and complex so it is easy to get lost in an empiric quest for the best configuration. We therefore decided to follow a strict approach to the problem. To better analyze it we decided to experiment with a simpler deployment: building a single node cluster, with a single shard index, with the same shard size as one of our production shards.

A simple copy of one of our shards from production to the development context was not enough. We wanted to be sure that the distribution of the documents in the shard is normalized; the shard must contain documents with the same type distribution ratio as the production one, so we wrote a little script to extract the dump of the index in a set of small chunks containing random documents distribution (good enough for our purposes). Then we indexed some of these chunks to build a single shard of 180GB.

Running some benchmarks we focused only on 90% of the most popular queries, extracted form our production nginx log, compiling a list of ~10k queries well distributed for response time.

We use siege to run our benchmark tests with the flags: one single process, verbose output, don’t wait between searches, run for 5 minutes and load the search to siege from a file.

siege -c 1 -v -b -t 5 -f model.txt

We ran this test several times with two different queries: one with the highlighting feature and one without.

Without Highlight

1.3 trans/sec

With Highlight

0.8 trans/sec

Analyzing the hot threads we discovered that our bottleneck was in the parsing of the lucene‘s inverted index.

We have a lot of high-frequency words, some of them present in almost all of the 35M documents.

To simplify a bit: with our resources and our deployed cluster, we weren’t able to search and filter the inverted index fast enough because we have millions of matches to be filtered and checked against the query.

The combination of large documents (we index a whole book as a single document) and the need to support full phrase queries (and show the user the correct results) leads to a serious problem: when a phrase query includes a specific token present in a large portion of the documents, all of these candidate documents must be examined for scoring. The scoring process is very IO intensive because relative proximity data for the tokens in a document must be determined.

:: Exploring the solutions

– Operational improvements:

Optimizing reads: The operating system, by default, caches file data that is read from disk. A typical read operation involves physical disk access to read the data from disk into the file system cache, and then to copy the data from the cache to the application buffer (2). Because our documents are big, having a bigger disk cache allows us to load documents more quickly. We added more RAM to the datanodes, for a total of 88GB, to have a better JVM/disk cache ratio.

Number and type of node

RAM

10 datanodes

88GB

Optimizing writes: In a linux system, for performance reasons, written data goes into a cache before being sent to disk (called “dirty cache“). Write caches allow us to write to memory very quickly, but then we will have to pay the cost of writing out all the data to the discs, SSD in our case (3). To reduce the use of the SSD bandwidth we decided to reduce the size of this dirty cache, to make the writings smaller but more frequent. We set the dirty_bytes value from the default value 33.554.432 to 1.048.576 with:sudo sysctl -w vm.dirty_bytes=1048576By doing this, we got better performance from the disks, reducing the SSD saturation during writings drastically.

– The highlighting problem:
This was the easy one. Upon reading additional documentation and running some tests with different mappings we learned that making the highlighting faster requires storing the term vector with the position offset payloads.

– The scoring problem and the Common Grams:The first idea, that we discovered is the most popular in the industry, was: “let’s scale horizontally!“, that is “add new data nodes.” Adding new data nodes would reduce the size of a single shard and the activity for every data node (reducing the SSD bandwidth, distributing the work better, etc.).

This is usually the easy — and expensive — solution, but Internet Archive is a nonprofit and our budget for the project was already gone after our first setup.

We decided that the right direction to go was to find a smarter way to add more information to the inverted index. We want to make lucene’s life easier when it’s busy parsing all its data structures to retrieve the search results.

What if we were to encode proximity information at document index time? And then somehow encode that in the inverted index? This is what the Common Grams tokenizer was made for.

For the few thousand most common tokens in our index we push proximity data into the token space. This is done by generating tokens which represent two adjacent words instead of one when one of the words is a very common word. These two word tokens have a much lower document hit count and result in a dramatic reduction in the number of candidate documents which need to be scored when Lucene is looking for a phrase match.

Normally, for the indexing and the searching phase, a text or a query string is tokenized word by word. For example: "the quick and brown fox" produces the tokens: the - quick - and - brown - fox
These tokens are the keys stored in the inverted index. Using the common grams approach the high frequency words are not inserted in the inverted index alone, but also with their previous and following word. In this way the inverted index is richer, containing the relative positional information for the high-frequency words.
For our example the tokenization become: the - the_quick - quick - quick_and - and_brown - brown - fox

Using luke to extract the most frequent words:
To implement this solution we needed to know the most high-frequency words in our corpus, so we used luke to explore our normalized shard to extract the list of the most frequent words in our corpus. We decided to pick the 5000 most frequent words:

Rank

Frequency

Word

1

817.695

and

2

810.060

of

3

773.635

in

4

753.855

a

5

735.514

to

…

…

…

49999

44.417

391

50000

44.416

remedy

We ran the siege benchmark in our test single node cluster and obtained great results:

Without Highlight

5 trans/sec

With Highlight using the Fast Term Vector

2.4 trans/sec

:: The new index

With this last step we had everything we needed to reindex the documents with a new custom analyzer using the icu tools we saw before, but this time we also added the common grams capabilities to the analyzer and the search_analyzer of our document fields. In particular we added a specialized filter in the common gram tokenization using our common words file. With these new features enabled the disk footprint is bigger so we decided to add more shards to the cluster, building the new index with 60 shards and 1 replica, for a total of 25 TB of data, ~210 GB for shard.
Check out the end of this document for an example of our new mapping configuration.

:: The final results

The performance of the new index looked very good immediately.
For example let’s consider the results of the query “to be or not to be“:

hits found

max score

execution time

Old Index

6.479.232

0.4427

20s 80ms

New Index

501.476

0.7738

2s 605ms

With the new index 95% of the queries are served in less than 10 seconds and the number of slower queries is marginal. There are less hits found because the new inverted index is more accurate. We have less noise and less poorly relevant results, with a higher max score and an execution time 10x faster for this particular query.

## Branch coverage
In addition to the usual statement coverage, coverage.py also supports branch coverage measurement. Where a line in your program could jump to more than one next line, coverage.py tracks which of those destinations are actually visited, and flags lines that haven’t visited all of their possible destinations.

If the edit to a page contains any of the spam words or email of the user is from the blacklisted domains, the edit won’t be accepted. New registrations with emails from those domains are also not accepted.

Measures taken:

I’ve added the common words found in the recent spam to the spam words

blacklisted mail.com as almost all of the spam was coming from that domain. This may stop some genuine people from registering and making edits.

WARNING: You can only restore a backup to exactly the same version of GitLab that you created it on.

:: Create a backup of the GitLab system:

sudo gitlab-rake gitlab:backup:create

sudo gitlab-rake gitlab:backup:create

the backup file will be created at /var/opt/gitlab/backups/
or where specified in the gitlab config file.

You can skip the backup for come components using:

sudo gitlab-rake gitlab:backup:create SKIP=db,uploads

sudo gitlab-rake gitlab:backup:create SKIP=db,uploads

Please be informed that a backup does not store your configuration files. One reason for this is that your database contains encrypted information for two-factor authentication. Storing encrypted information along with its key in the same place defeats the purpose of using encryption in the first place!
At the very minimum you should backup /etc/gitlab/gitlab-secrets.json (Omnibus) or /home/git/gitlab/.secret (source) to preserve your database encryption key.

:: Restore a previously created backup:

We will assume that you have installed GitLab from an omnibus package and run

sudo gitlab-ctl reconfigure

sudo gitlab-ctl reconfigure

at least once.

Note: that you need to run gitlab-ctl reconfigure after changing gitlab-secrets.json.

Memory problem: if running git push you have this error: fatal: Out of memory, calloc failed, you can configure git to use only one thread for “packing”:

git config--global pack.threads 1

git config --global pack.threads 1

another “solution” is to remove the limit:

ulimit-v unlimited

ulimit -v unlimited

Git prompt:

If you’re a Bash user, you can tap into some of your shell’s features to make your experience with Git a lot friendlier. Git actually ships with plugins for several shells, but it’s not turned on by default. Take a look to: Git prompt [2].

If you’re using zsh, and also make use of oh-my-zsh, many themes include git in the prompt. It is recommended that you set git config –local oh-my-zsh.hide-dirty 1 within the petabox repo to prevent a slow prompt.

[petabox tree file last mod times]
~tracey/scripts/post-checkout is nice to add in petabox/.git/hooks/ when one first clones a tree (technically, clone the tree, then add this hook, then checkout a file — it will take a *long* time to run, then remove the hook). this will make all files have modtime of their last commit (which can be v. helpful to determine quick age/modtimes of files at a “ls” glance) for those who like that kind of thing