OS’s cache does matter

These days I’ve been working on the performance benchmarks that I mentioned in my previous post. They consist in running a big number of queries against a Solr server, and verifying how each feature (sharding, highlighting, faceting, etc) affects the average query time.

To generate this great number of queries, I’ve made a tool that takes the terms from the TermsComponent, combines them using the operators AND, OR and NOT in a random manner, and then writes a file with an arbitrary number of queries. To run the tests I use Solrmeter (http://code.google.com/p/solrmeter). This program allows me to run all the queries in the file easily, and also displays some beautiful plots, which help me to check if everything is working OK. The Solr server is running on Ubuntu 10.04 Server Edition, in a machine with 8 GB of RAM and a quad core processor. The client is running on a laptop with Windows XP.

By default, Solrmeter runs the queries from the file in random order. I was not looking for that kind of feature, because I wanted to repeat exactly the same test several times. That’s why I wrote a little class that loads all the queries from a file and runs them in sequential order.

After running some tests, I found that the results weren’t always the same. In fact, I found something totally different from what I was expecting. As shown in the next figure, when running the first 1000 queries, the average query time was 104 ms. The first queries were slower, but I assumed that it was because the Solr caches were filling.

After restarting Solr and running the first 2000 queries, the result obtained is the one shown in the second figure: the first 1000 queries run really fast (near 20 ms), whereas the time for the next 1000 queries increases abruptly to about 60 ms.

So my question was: why is this happening? I can’t understand it! How is Solr remembering something about those first 1000 queries, if I’m completely restarting the server? Well, the answer was easier than I thought. It’s because in the first case, the operating system needs to read all the blocks from disk, which are then stored in the cache. In the second case, there is virtually no reading time for the blocks needed to answer the first 1000 queries, because they were previously cached, but the next 1000 queries are new, so the operating system has to load a lot of new blocks. This justifies this abrupt increase.

Why the new queries need some previously unused blocks is something that I don’t know for sure, but I think that maybe it has to do with the documents that are returned only by the new queries. As a side note, the OS’s cache can be really big. In this example, it reached almost 2 GB.

After discovering this, I started to run the benchmarks, bearing in mind to clean the OS’s cache and restarting Solr before every run. But this time another unexpected thing happened! The first feature that I tried to test was the FastVectorHighlighter. I wanted to measure how much faster than the default highlighter (from now on, simply called Highlighter) it was. But the first one was running slower! After a series of tests, I figured out what was going on: the time lost by the OS loading from disk the term vectors was greater than the time saved by using the FastVectorHighlighter.

So, to confirm my suspicions, I repeated all the tests again, but this time I executed each one of them twice. The first time, in order to get the OS’s cache filled. The second time to measure the real query time. Now everything is working as expected! The FastVectorHighlighter is 30% faster than the Highlighter!

Therefore, we may conclude that:

If you are doing benchmarks, look at the initial conditions of the OS’s cache, and of other caches that you may have.

The OS’s cache is really useful, it decreases significantly the time required to answer a query (even after completely restarting the server!), so always remember to keep some memory free for the OS.

I hope to show the results of the benchmarks in the next post, so stay tuned!