At the end of August we started seeing the percent of user queries taking more than 30 seconds starting to climb. At first we weren't too concerned. Seven new nodes had just been added to the Isilon cluster that we use to store our Solr indexes and the internal data redistribution within the cluster was resulting in a high I/O load. Since slow queries are the ones with the highest disk I/O requirements, we expected performance to suffer during the data redistribution. However, we started getting nervous when the percentage of user queries taking more than 30 seconds exceeded 10% and kept climbing. We also noticed the initial query we run when Solr is first started every day did not return within 60 seconds for the 3 shards on our server number 1. Our sysadmins investigated and we discovered we were getting Java Out Of Memory (OOM) errors for the shards on that server:

" java.lang.OutOfMemoryError: GC overhead limit exceeded"

That particular error message generally means that the Java Garbage Collector is spending too much time trying to collect garbage without recovering much memory. The usual solution is to increase the amount of memory allocated to the JVM.

We also got an interesting Solr Exception, but assume that that exception was triggered by memory problems:

"org.apache.solr.common.SolrException: Impossible Exception"

We noticed that that the OOM errors were only occuring on server number 1, and when we checked the size of the indexes, we found the index sizes ranged from 405GB to 412GB. None of our other servers had indexes over 400GB. We temporarily rolled back to the last index before the OOM errors occurred in the logs and increased the memory allocated to the JVM for the tomcats running the 3 Solr instances on each of our production servers from 16GB to 20GB.

We then did some investigation using our test server to try to determine what was going on. We pointed our test server at a snapshot of those indexes and turned on Garbage Collection logging:

We discovered that with 20GB, even though we weren't getting OOM errors we were getting frequent "stop-the-world" full garbage collections that took around 30 seconds but did not recover much memory. We increased the memory allocated to the JVM to 32 GB and that reduced the frequency of the full garbage collections. We then pointed the test server at the most recent index (which had 10 days of index growth since the indexes we rolled back to in production) and found that the 32GB was enough to prevent the OOMs and the frequent full garbage collections. We wanted to put the current index into production, so as a temporary measure we increased the memory allocated to the JVM to 32 GB on the production servers and resumed serving the current (updated nightly) index.

We then set up some tests on our test server and monitored memory use with jconsole. What we saw is that when the first query hit the server, total memory use which was around 1 GB, rocketed up to around 18 GB and never went below 18GB. As more queries hit the server, memory use increased to about 24GB before garbage collection kicked in. We still got some full GCs with about 20 second pauses. (Out of the 5 full GCs listed in jconsole below, the first two were less than a second and occurred when we first started up.)

[2]Memory Use

In order to determine what was taking up the 18GB of memory, we started the server again, hit it with the first query and then used jmap to get a list of the objects using the most memory. These are in descending order of memory use. (The chart below is the reformatted output for the top 20 objects. See http://www.hathitrust.org/node/300[3] for the original.)

-jmap -heap:histogram <PID>

Rank

class name

instances

GB

(millions)

1

[C (Java UTF16 character arrays)

86

4.5

2

java.lang.String

86

3.4

3

[I (Java integer arrays)

1

1.7

4

org.apache.lucene.index.TermInfo

29

1.1

5

org.apache.lucene.index.TermInfo

29

1.1

6

org.apache.lucene.index.TermInfo

29

1.1

7

org.apache.lucene.index.Term

29

0.9

8

org.apache.lucene.index.Term

29

0.9

9

org.apache.lucene.index.Term

29

0.9

instances

GB

10

[J

253

0.69

11

[Lorg.apache.lucene.index.Term;

2

0.23

12

[Lorg.apache.lucene.index.TermInfo;

2

0.23

13

[Lorg.apache.lucene.index.Term;

2

0.23

14

[Lorg.apache.lucene.index.TermInfo;

2

0.23

15

[Lorg.apache.lucene.index.Term;

2

0.23

16

[Lorg.apache.lucene.index.TermInfo;

2

0.23

17

java.lang.ref.SoftReference

380304

0.02

18

[B

33739

0.01

19

<constMethodKlass>

47027

0.01

20

[Ljava.util.HashMap$Entry;

7833

0.01

Total

17.9

What the above shows is that the data structure that represents the "index to the index" (the Lucene tii file) in memory is taking up most of the memory. (See Lucene File Formats [4]for details on the tii and tis files.)

If you look at lines 4-6 and 7-9 you will see 3 sets of Term and TermInfo objects; one set for each index. There are about 29 million of these for each index. By default the tii file holds every 128th item in the tis file so the total number of terms in each index is (29 million x 128) or about 3.5 billion. This is consistent with what we know about our indexes. [1]

Each of the 3 shards has 2 segments. Looking at #11 you see a pair of arrays of "org.apache.lucene.index.Term" objects and #12 is a pair of arrays of Terminfoobjects. There are 3 of these sets, one for each shard/index.

At the top of the list, item #1at 4.5 GB consists of character arrays and item #2 at 3.4GB consists of Strings. We asked on the mailing list to confirm that these other data structures were related to the in-memory representation of the tii index and Mike McCandless replied:

Every indexed term that's loaded into RAM creates 4 objects (TermInfo,Term, String, char[]), as you see in your profiler output. And eachobject has a number of fields, the header required by the JRE, GCcost, etc. [2]

Even though the tii files took only about 2.2 GB on disk (about 750MB per index), once they are read into memory they occupy about 18 GB.

In Solr 1.4 and above there is a feature that lets you configure an “index divisor” for Solr. If you set this to 2, then Solr will only load every other entry from the tii file into memory; thus halving memory use for the tii file representation in memory. The downside is that once you have a file pointer into the tis file and seek to it, in the worst case you have to scan twice as many entries.[3][6]

We upgraded the Solr on our test server to Solr 1.4.1 (in production we are currently using a pre-1.4 development release) and ran some tests with different settings.

We set the termInfosIndexDivisor to 2 and then to 4 and ran a query against all 3 shards to cause the tii files to get loaded into memory. We then ran jmap to get a histogram dump of the heap. The table below shows the total memory use for the top 20 classes for each configuration including a base configuration where we don’t set the index divisor.

Base (current production config)

Index divisor =2

Index divisor =4

Total mem use for top 20 classes (GB)

17.9

9.6

6.1

We ran some preliminary tests and thus far have seen no significant impact in terms of response time for index divisors of 2, 4,8, and 16 with base memory use dropping as low as a little over 1 GB [4]. We plan to do a few more tests to decide on which divisor to use and then to work on JVM tuning (We should be able to eliminate long stop-the-world collections with the proper settings.) Once we get that done, we plan to upgrade our production Solrs to 1.4.1 and reduce the memory allocated to the JVM from 32 GB to some level possibly as low as 8 GB. That will leave even more memory for the OS disk cache. When we finish the tests and come up with a configuration and JVM settings we will report it in this blog.

Next up "Adventures in Garbage Collection" and "Why are there So Many Words?"

[1] In February when we had about 550,000 documents per index we had about 2.5 billion terms.(See: http://www.hathitrust.org/blogs/large-scale-search/too-many-words[1]). However, those indexes were optimized to one segment. The number of segments has a dramatic effect on the total size of the tii data structures in memory. The current indexes hold around 670,000 documents each, but contain two segments. The tii and tis files contain unique terms per segment, so the reason the total number of terms in memory increased by nearly 50% (2.5 billion to 3.5 billion) while the number of documents increased only about 20% is because of many common terms that are duplicated between the two segments. If we optimized down to 1 segment we would probably have closer to 2.8 billion terms rather than 3.5. In looking at the logs, we realized that what triggered the OOM was not just the size of the index, but we had switched from optimizing to 2 segments to optimizing to 3 segments at the end of August. (We have switched back to 2 segments in production until we resolve the memory issues.)

[3][10] The reason that it’s a linear scan of the tis file instead of a binary search has to do with decompression of the entries:

“The entry for each term in the terms dict stores a long file offset pointer, into the .frq file, and another long for the .prx file. But, these longs are delta-coded, so as you scan you have to sum up these deltas to get the absolute file pointers. The terms index (once loaded into RAM) has absolute longs, too. So when looking up a term, we first bin search to the nearest indexed term less than what you seek, then seek to that spot in the terms dict, then scan, summing the deltas.”

[4] We suspect that the reason we aren't seeing a significant performance impact even with a divisor of 16 (which means 16 times as many entries might have to be scanned in the worst case) is probably because the amount of of data that has to be scanned fits into a read buffer (so there is only one disk seek).The difference between a linear scan of 128 entries and 2048 entries (16 * 128) in memory is probably lost in the noise.