Ops Monkey

Wednesday, October 27, 2010

Yesterday I solved a problem that had been vexing me for more than a week. I am helping a client load test an application using Apache Solr. (JDK 1.6.0_14, Solr version 1.4.1 which uses Lucene 2.9.3).

Under a small amount of load, searches would return quickly, but the service was quickly overwhelmed with most queries taking 3+ seconds to return, even searching in cores that had very small index sizes (Hundreds of documents).

Thread dumps and hprof stack profiling of the application were confusing. Solr seemed to be spending huge amounts of time not in search code or scoring, where I had expected, but instead in the query parsing code. Many of the threads looked something like this:

"Thread-234" daemon prio=10 tid=0x00007ff4b7436800 nid=0x4aae waiting on condition [0x000000005a811000]
java.lang.Thread.State: RUNNABLE
at org.apache.lucene.queryParser.QueryParser.jj_2_1(QueryParser.java:1613)
at org.apache.lucene.queryParser.QueryParser.Clause(QueryParser.java:1308)
at org.apache.lucene.queryParser.QueryParser.Query(QueryParser.java:1265)
at org.apache.lucene.queryParser.QueryParser.TopLevelQuery(QueryParser.java:1254)
at org.apache.lucene.queryParser.QueryParser.parse(QueryParser.java:200)
at org.apache.solr.search.LuceneQParser.parse(LuceneQParserPlugin.java:78)
at org.apache.solr.search.QParser.getQuery(QParser.java:131)
at org.apache.solr.handler.component.QueryComponent.prepare(QueryComponent.java:103)
(etc)

or this:

"Thread-123" daemon prio=10 tid=0x00007ff4b6d00800 nid=0x4ab6 waiting on condition [0x000000005b011000]
java.lang.Thread.State: RUNNABLE
at org.apache.lucene.queryParser.FastCharStream.readChar(FastCharStream.java:47)
at org.apache.lucene.queryParser.QueryParserTokenManager.jjMoveNfa_3(QueryParserTokenManager.java:529)
at org.apache.lucene.queryParser.QueryParserTokenManager.jjMoveStringLiteralDfa0_3(QueryParserTokenManager.java:87)
at org.apache.lucene.queryParser.QueryParserTokenManager.getNextToken(QueryParserTokenManager.java:1190)
at org.apache.lucene.queryParser.QueryParser.jj_ntk(QueryParser.java:1776)
at org.apache.lucene.queryParser.QueryParser.Clause(QueryParser.java:1328)
at org.apache.lucene.queryParser.QueryParser.Query(QueryParser.java:1265)
at org.apache.lucene.queryParser.QueryParser.TopLevelQuery(QueryParser.java:1254)
at org.apache.lucene.queryParser.QueryParser.parse(QueryParser.java:200)
at org.apache.solr.search.LuceneQParser.parse(LuceneQParserPlugin.java:78)
(etc)

Some of the queries are somewhat complicated, with nested parens and such, but it did not seem as if they were complicated enough to justify 2-3 second parse times. Additionally even uncomplicated queries (no parens, no nesting, just simple q=xyz) were taking tens or hundreds of milliseconds under load. The code for QueryParser.java is complicated, it is generated code using JavaCC, and uses many nested method calls. It seemed like perhaps there was some single mutex that all of these method calls were fighting over - especially because these stack traces show the thread as runnable, yet the state is "waiting on condition".

While under load, I did a backtrace using gdb (thread apply all bt 12) and saw several suspicious looking threads that looked like this:
Thread 37 (Thread 1237547376 (LWP 32044)):
#0 0x00007f1421c594e4 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/libpthread.so.0
#1 0x00007f1421c59914 in pthread_cond_wait@GLIBC_2.2.5 () from /lib/libpthread.so.0
#2 0x00007f1421388727 in os::PlatformEvent::park () from /opt/jdk1.6.0_14/jre/lib/amd64/server/libjvm.so
#3 0x00007f14213617c2 in Monitor::IWait () from /opt/jdk1.6.0_14/jre/lib/amd64/server/libjvm.so
#4 0x00007f1421361e2a in Monitor::wait () from /opt/jdk1.6.0_14/jre/lib/amd64/server/libjvm.so
#5 0x00007f14214bcb39 in VMThread::execute () from /opt/jdk1.6.0_14/jre/lib/amd64/server/libjvm.so
#6 0x00007f14213f0e4b in OptoRuntime::handle_exception_C_helper () from /opt/jdk1.6.0_14/jre/lib/amd64/server/libjvm.so
#7 0x00007f14213f0ef9 in OptoRuntime::handle_exception_C () from /opt/jdk1.6.0_14/jre/lib/amd64/server/libjvm.so
#8 0x00007f141d411ec6 in ?? ()

It turns out that by default in this environment, apps are launched with jdwp debugger attachment enabled.
(JVM flag: -Xrunjdwp:transport=dt_socket,server=y,suspend=n,address=9000)

Negative performance impact is expected when a debugger is attached, but this flag adds an overhead to certain JVM calls, specifically around exception handling. The Lucene Query Parser seems to use exceptions for signaling within a tight loop ( LookaheadSuccess ), so this may be why query parsing is affected so much.

After disabling the debugging, the average return time for solr queries dropped from ~3 seconds to < 400ms for complicated queries, and from ~3 seconds to < 30ms for the simple queries. Also, before the test, the solr process would use at most ~1.2 CPU's of a 4CPU machine. After the test, we can get the solr process to use more than 3CPU on the same 4 CPU machine, and the overall throughput of the load test was more than doubled.

Friday, March 27, 2009

Recently I was working with a client to support a text-search type functionality on their website. Given a search category and a list of words, the app was to return the highest relevancy titles that matched those words. For purposes of demonstration, assume the ranking was simple, using only an "or" search on the list of words. (So for example, a search of "Green tea" would match return all titles that matched "Green" or "Tea", and relevancy scoring didn't need to take into account if more than one word was matched)

This was a high-traffic application with upwards of 100 queries per second.

The most painful part of this query was sorting the results in order of relevancy. The simplified table looks like this:

The primary key was the combination of category_id-keyword-title_id - queries always include both the category_id and the keyword, and the title_id was returned as an index into another table containing more information about the title:

In the "Extra" part of the query plan, you can see that it is "Using Filesort". This is never good - this means mysql will scan every matching row and build up a temporary store, so that it can re-order the results and return only 20. For this search term, it is creating a new table with 5074 rows. Creating and sorting a 5000 row table is not that big of a deal on its own, but multiply that by 100QPS and the database is doing a lot of extra work.

Without the order by, the query plan is easier on the database. It still matches 5074 rows but no longer needs the filesort. However, this query does not return the required results - they are returned in more or less random order.

On the mysql documentation about ORDER BY optimization, it says that you can add the column that you are ordering by to the index, to allow the order by to happen in an index scan instead of a table scan. With that in mind, I added the "relevancy" column into the index the optimizer was choosing - in this case, the primary key.

I did a little digging and found a great post in the MySQL Performance Blog about this. Using the "IN" clause causes the index to be used slightly differently, and eliminates the use of the "relevancy" column for sorting optimization. The above post links to another article talking about a UNION optimization. I went one step simpler and just had the code make multiple queries to the database, one keyword at a time, then re-ordering the result in code. The explain plan for that now looks like this:

Even though the code now has to make a few trips to the database to satisfy the entire query, each individual query runs very fast. At the beginning of the exercise, we were unable to sustain a load of even 50QPS, by the time the optimizations were finished, we were running 80 threads against the database (4 core opteron 2216HE) with a combined ~900QPS using no cache.

A welcome side effect of querying on a smaller unit (category ID, plus a single keyword), we are able to make more use of a distributed cache (memcached) to store those results. While we may not get a lot of queries for "Green tea", by storing the results of "green" and "tea" separately, we get at least a partial cache hit for other queries like "green tree", "green grass", "oolong tea", etc.

Monday, June 18, 2007

This information is specific to the 3Ware 9500 Series controller. (More specifically, the 9500-4LP). However, the 3Ware CLI seems to be the same for other 3Ware 9XXX controllers which I have had experience with. (The 9550 for sure)

Under linux, the 3Ware cards can be manipulated through the "tw_cli" command. (The CLI tools can be downloaded for free from 3Ware's support website)

I now need to physically replace the bad drive. Unfortunately since our vendor wired some of our cables cockeyed, I will usually cause some I/O on the disks at this point, to see which of the four disks is "actually" bad. (Hint: The one with no lights on is the bad one.)

dev306:~# find /opt -type f -exec cat '{}' > /dev/null \;

With the bad disk identified and replaced, now I need to go back into the 3Ware CLI and find the new disk, then tell the array to start rebuilding.

dev306:~# /opt/3Ware/bin/tw_cli//dev306> maint rescanRescanning controller /c0 for units and drives ...Done.Found the following unit(s): [none].Found the following drive(s): [/c0/p3].

We run a lot of Supermicro H8DAR-T motherboards in production. These are the BIOS settings that work well for us. I have not done a lot of tweaking trying to get more performance out of our systems with BIOS settings, since stability is key.

Note that unless specified here, we leave the settings at their default values. (Some of these settings are default values but documented because we need them set that way) Especially important options in BOLD.

The Supermicro H8DAR-T motherboard comes in (at least) two flavors. The differences that I know about between the two versions are:

* The version 2.01 board will run Opensolaris/Nexenta out of the box. This is because of a difference in the SATA controller hardware. The version 1.01 board will not run Opensolaris without an add-on controller card.

* The 1.01 and 2.01 boards use different hardware sensors (For temperature, fan speed, etc). We get sensor stats through our IPMI cards; because of this the IPMI cards need to be flashed to the specific version of the hardware. The IPMI cards do work for poweron/poweroff and console redirection without this specific firmware, only the sensors do not work if the IPMI firmware mis-matches the motherboard version.

Unfortunately, I do not see enough of a difference at POST time to be able to tell them apart. However, there are two ways I know of to do the detection.

1. With the cover of the machine off, the version can be seen in the back left corner of the board. (Will post pics later)