Details

Description

From profiling of monitor contention, as well as observations of the
95th and 99th response times for nodes that perform distributed search
(or ‟aggregator‟ nodes) it would appear that the HttpShardHandler code
currently does a suboptimal job of managing outgoing shard level
requests.

Presently the code contained within lucene 3.5's SearchHandler and
Lucene trunk / 3x's ShardHandlerFactory create arbitrary threads in
order to service distributed search requests. This is done presently to
limit the size of the threadpool such that it does not consume resources
in deployment configurations that do not use distributed search.

This unfortunately has two impacts on the response time if the node
coordinating the distribution is under high load.

The usage of the MaxConnectionsPerHost configuration option results in
aggressive activity on semaphores within HttpCommons, it has been
observed that the aggregator can have a response time far greater than
that of the searchers. The above monitor contention would appear to
suggest that in some cases its possible for liveness issues to occur and
for simple queries to be starved of resources simply due to a lack of
attention from the viewpoint of context switching.

I've been struggling with the same issue, and resolved it by just bumping up the number of connections per host (see SOLR-2184 which maybe should be marked as a related issue?)

One thing that I would suggest (and be willing to help with!) is doing a great job of documenting what these parameters do, and potentially listing out as examples the two options for optimizing for throughput versus fair balanced response times, and why you would do one or the other. Exposing these params is good, but can often cause folks to seize on them and use the inappropriately and shoot themselves in the foot. As an example, think of the current issues with the "Optimize" command, where I have seen folks optimize after every commit, because who wouldn't want that!?

Eric Pugh
added a comment - 09/Mar/12 15:48 I've been struggling with the same issue, and resolved it by just bumping up the number of connections per host (see SOLR-2184 which maybe should be marked as a related issue?)
One thing that I would suggest (and be willing to help with!) is doing a great job of documenting what these parameters do, and potentially listing out as examples the two options for optimizing for throughput versus fair balanced response times, and why you would do one or the other. Exposing these params is good, but can often cause folks to seize on them and use the inappropriately and shoot themselves in the foot. As an example, think of the current issues with the "Optimize" command, where I have seen folks optimize after every commit, because who wouldn't want that!?

The first one at least (and I'd guess the others as well) apparently don't depend on this seed, they fail in IntelliJ too. They all seem to fail on SearchHandler[298]. I think once it was a result of something not being implemented in one of the Mocks, but can't swear to it.

Erick Erickson
added a comment - 09/Mar/12 19:17 The 3x patch runs through all the tests, but the trunk fails.
The first one at least (and I'd guess the others as well) apparently don't depend on this seed, they fail in IntelliJ too. They all seem to fail on SearchHandler [298] . I think once it was a result of something not being implemented in one of the Mocks, but can't swear to it.
Errors in the debugger like: java.lang.IllegalArgumentException: Invalid uri 'http://://[::1]:33332/solr/select': escaped absolute path not valid. What????
I don't have time to look into it now...
ant test -Dtestcase=TestDistributedGrouping -Dtestmethod=testDistribSearch -Dtests.seed=-7e47ec572525db2b:747a1a2f9e04dca3:611098fa6c7bdbe4 -Dargs="-Dfile.encoding=MacRoman"
ant test -Dtestcase=FullSolrCloudDistribCmdsTest -Dtestmethod=testDistribSearch -Dtests.seed=666cf0bc1fb2bbe2:-307010e33b27cd24:611098fa6c7bdbe4 -Dargs="-Dfile.encoding=MacRoman"
ant test -Dtestcase=DistributedSpellCheckComponentTest -Dtestmethod=testDistribSearch -Dtests.seed=-19364590f7798def:3cfbf5694891d7b:611098fa6c7bdbe4 -Dargs="-Dfile.encoding=MacRoman"
ant test -Dtestcase=TestDistributedSearch -Dtestmethod=testDistribSearch -Dtests.seed=6cdbfdcabaf8471c:af7365c8c8b063:-35863e1b26976412 -Dargs="-Dfile.encoding=MacRoman"
ant test -Dtestcase=OverseerTest -Dtestmethod=testShardLeaderChange -Dtests.seed=47a8a04f4df29e39:7cf1cab9cde30f9f:-35863e1b26976412 -Dargs="-Dfile.encoding=MacRoman"
ant test -Dtestcase=BasicDistributedZkTest -Dtestmethod=testDistribSearch -Dtests.seed=-37854d406161a603:-3d7c63121a0b99fe:-a77be13ddf887d7 -Dargs="-Dfile.encoding=MacRoman"
ant test -Dtestcase=DistributedTermsComponentTest -Dtestmethod=testDistribSearch -Dtests.seed=6fe8b3c1860d0a59:-20d9908f662d6a90:-35863e1b26976412 -Dargs="-Dfile.encoding=MacRoman"

Erick Erickson
added a comment - 12/Mar/12 14:19 OK, this passes all the tests on my mac. I'll have a chance to look over the code today. I'll commit this today or tomorrow unless somebody finds something.

Erick Erickson
added a comment - 12/Mar/12 14:19 OK, this passes all the tests on my mac. I'll have a chance to look over the code today. I'll commit this today or tomorrow unless somebody finds something.

The above monitor contention would appear to suggest that in some cases its possible for liveness issues to occur and for simple queries to be starved of resources simply due to a lack of attention from the viewpoint of context switching.

that's pretty interesting... how many concurrent requests were you running through a single aggregator?

Yonik Seeley
added a comment - 12/Mar/12 21:11 The above monitor contention would appear to suggest that in some cases its possible for liveness issues to occur and for simple queries to be starved of resources simply due to a lack of attention from the viewpoint of context switching.
that's pretty interesting... how many concurrent requests were you running through a single aggregator?

Erick Erickson
added a comment - 13/Mar/12 14:47 Trunk r: 1300160
3.6 r: 1300158
Big thanks Greg!
Greg:
How about an entry in CHANGES.txt? And perhaps a short example somewhere on the Wiki explaining how to make this happen?

Greg Bowyer
added a comment - 14/Mar/12 00:50 No problems, how would I change the wiki, just go ahead and edit it ?
I am loathe to submit a patch for changing the CHANGES.txt, but is that the same as the wiki ?

Yeah, just go ahead and edit the Wiki, all you need to do is create an account.

As for CHANGES.txt. Just attach it to this JIRA and I'll go ahead and commit it without a new JIRA. Two tricky things:
1> there are two files, one in the 3x branch and one in the 4x, both need the text
2> The 4x branch needs to be edited in two places, one for the 3.x section and in the 4x section.

Erick Erickson
added a comment - 14/Mar/12 13:51 Yeah, just go ahead and edit the Wiki, all you need to do is create an account.
As for CHANGES.txt. Just attach it to this JIRA and I'll go ahead and commit it without a new JIRA. Two tricky things:
1> there are two files, one in the 3x branch and one in the 4x, both need the text
2> The 4x branch needs to be edited in two places, one for the 3.x section and in the 4x section.
I'll see it when if this JIRA changes and check them in.
Thanks!

I added some information to the wiki in the solrconfig.xml section talking about the shardHandlerFactory that you might want to review. Documentation is not my strong point

also can we can put the following into the CHANGES.TXT

SOLR-3221: Added the ability to directly configure aspects of the concurrency
and thread-pooling used within distributed search in solr. This allows for finer
grained controlled and can be tuned by end users to target their own specific
requirements. This builds on the work of the HttpCommComponent and uses the same
configuration block to configure the thread pool. The default configuration has
the same behaviour as solr 3.5, favouring throughput over latency. More
information can be found on the wiki (http://wiki.apache.org/solr/SolrConfigXml)

Greg Bowyer
added a comment - 21/Mar/12 19:00 I added some information to the wiki in the solrconfig.xml section talking about the shardHandlerFactory that you might want to review. Documentation is not my strong point
also can we can put the following into the CHANGES.TXT
SOLR-3221 : Added the ability to directly configure aspects of the concurrency
and thread-pooling used within distributed search in solr. This allows for finer
grained controlled and can be tuned by end users to target their own specific
requirements. This builds on the work of the HttpCommComponent and uses the same
configuration block to configure the thread pool. The default configuration has
the same behaviour as solr 3.5, favouring throughput over latency. More
information can be found on the wiki ( http://wiki.apache.org/solr/SolrConfigXml )

David Smiley
added a comment - 13/Apr/12 14:51 Curious; why was the default choice made to favor throughput over low latency? Generally speaking, low latency in search is more important. There should be no issue of backwards compatibility, either.

I'd go with something a bit shorter - no need to get into the gritty details - that's why the JIRA issue number is there. I'd stick to something closer to "Make shard handler threadpool configurable." or "Added the ability to directly configure aspects of the concurrency and thread-pooling used within distributed search in solr."

Mark Miller
added a comment - 08/May/12 14:46 I am loathe to submit a patch for changing the CHANGES.txt
Usually the committer handles this - no rules about it though.
I'd go with something a bit shorter - no need to get into the gritty details - that's why the JIRA issue number is there. I'd stick to something closer to "Make shard handler threadpool configurable." or "Added the ability to directly configure aspects of the concurrency and thread-pooling used within distributed search in solr."

David Smiley
added a comment - 11/May/12 18:57 Only to support the exacting behaviour that was in solr prior to this patch.
I wouldn't of made the same decision but it's moot now, esp. for 3x. I don't see how this change would break anybody, and I see how it would improve.
I think Solr 4 should have better defaults; agreed?