Better faceting for multiple intervals on DV fields

Details

Description

There are two ways to have faceting on values ranges in Solr right now: “Range Faceting” and “Query Faceting” (doing range queries). They both end up doing something similar:

searcher.numDocs(rangeQ , docs)

The good thing about this implementation is that it can benefit from caching. The bad thing is that it may be slow with cold caches, and that there will be a query for each of the ranges.

A different implementation would be one that works similar to regular field faceting, using doc values and validating ranges for each value of the matching documents. This implementation would sometimes be faster than Range Faceting / Query Faceting, specially on cases where caches are not very effective, like on a high update rate, or where ranges change frequently.

Functionally, the result should be exactly the same as the one obtained by doing a facet query for every interval

I did some very basic performance testing to compare interval faceting vs facet queries:
Dataset: Geonames.org dataset (added 4 times to make it a 33M docs)
Query Set: 4960 boolean queries using terms from the dataset
1 document updated every second
autoSoftCommit every second.
HW: MacBook Pro Core i7, 2.7 GHz with 8 GB of RAM with spinning disk (5400 RPM)
All times are in milliseconds
Repeated the test with different number of intervals (on the “population” field of the geonames dataset)

Num Intervals

1

2

3

4

5

10

Min

Intervals

25

23

26

23

24

26

Facet Query

2

2

3

4

4

6

Max

Intervals

1885

2254

2508

2800

2749

3031

Facet Query

2199

2414

3957

2766

1869

5975

Average

Intervals

181

177

191

183

148

174

Facet Query

156

277

359

299

216

408

P10

Intervals

53

54

54

54

54

56

Facet Query

26

30

33

31

29

35

P50

Intervals

96

95

98

97

88

96

Facet Query

54

211

293

188

58

74

P90

Intervals

453

940

467

458

350

438

Facet Query

432

656

794

749

660

1066

P99

Intervals

809

884

968

877

857

897

Facet Query

867

1041

1354

1219

1116

1784

There is some variation between the tests with different number of intervals (with the same method) that I don’t understand very well. For each test, I’d restart the jetty (index files are probably cached between tests though).

In general what I see is that the average is similar or lower than facet query, the p10 and p50 similar or higher than facet query (these are probably the cases where the facet queries hit cache), and lower p90 p99 for the Intervals impl. This probably because of facet query missing cache.

“Max” variates a lot, I don’t think it’s a very representative number, I just left it for completeness. Min is very similar for all cases, it’s obvious that in the best case (all cache hits), facet query is much faster than intervals.

I also did a quick test on an internal collection with around 100M docs in a single shard, ran around 6000 queries with around 40 intervals each, for this test I got:

Min

Intervals

122

Facet Query

124

Max

Intervals

6626

Facet Query

61009

Average

Intervals

238

Facet Query

620

P10

Intervals

155

Facet Query

151

P50

Intervals

201

Facet Query

202

P90

Intervals

324

Facet Query

461

P99

Intervals

836

Facet Query

23662

This collection has updates and soft commits.
I don’t have numbers for distributed tests, but from what I could see, the result was even better on wide collections, because of the lower p90/p99 I assume.

Added a distributed test case. I don't know if it makes sense to have a complete different test class for this, but I need to suppress some older codecs, and I don't know if there is a way from inside the test (on the distributed test cases) to know which codecs are being used.

Tomás Fernández Löbbe
added a comment - 01/Jul/14 21:02 Added a distributed test case. I don't know if it makes sense to have a complete different test class for this, but I need to suppress some older codecs, and I don't know if there is a way from inside the test (on the distributed test cases) to know which codecs are being used.

Mark Miller
added a comment - 07/Jul/14 16:28 Thoughts:
The numbers look good, the patch looks good, looks like good comments, looks like good tests, very clean patch.
Unfortunately, probably not something I can look at soon personally, but just to note in case another committer can tackle this more near term.

Sorry I haven't been able to get to this, but I started looking at it today. Very nice patch! I uploaded a few tweaks in a bit (spelling corrections, and I thought that FacetComponent.countFacets was getting a little long so I refactored the three clauses into their own methods, no functional change at all).

A few questions though:
1> I'm assuming this supports the f.<fieldname>.facet syntax, that's at a higher level that this patch I'd guess
2> There's a TODO in SimpleFacets [1410] or so....
3> Perhaps the most substantive question I have is the syntax. Why use '()' for exclusive ranges and not '{}' like range facets do? Other things being equal, it seems like this would be more consistent with other Solr syntax.

Erick Erickson
added a comment - 19/Jul/14 03:35 Tomas:
Sorry I haven't been able to get to this, but I started looking at it today. Very nice patch! I uploaded a few tweaks in a bit (spelling corrections, and I thought that FacetComponent.countFacets was getting a little long so I refactored the three clauses into their own methods, no functional change at all).
A few questions though:
1> I'm assuming this supports the f.<fieldname>.facet syntax, that's at a higher level that this patch I'd guess
2> There's a TODO in SimpleFacets [1410] or so....
3> Perhaps the most substantive question I have is the syntax. Why use '()' for exclusive ranges and not '{}' like range facets do? Other things being equal, it seems like this would be more consistent with other Solr syntax.

1> I'm assuming this supports the f.<fieldname>.facet syntax, that's at a higher level that this patch I'd guess

Yes, that's the way to set the intervals for a field. For example:

f.test_l_dv.facet.interval.set=[10,20]

2> There's a TODO in SimpleFacets ...

There are a couple of TODOs. In SimpleFacets, to add a comment about the distributed case. I'll add that. In IntervalFacets there are a couple more. Two of them I kept from the section of code I used from NumericFacets (maybe this can be refactored in the future). Two TODOs are in the parsing method of the interval. I see those two as possible future improvements, I don't think they are necessary at this point.

3> Perhaps the most substantive question I have is the syntax. Why use '()' for exclusive ranges and not '{}' like range facets do? Other things being equal, it seems like this would be more consistent with other Solr syntax.

Well, that is true, but I guess in range queries “{}“ had to be used because “()” was already being used for grouping clauses. I feel that “(10,20)” is a better representation of an interval, but I see your point too to be consistent with other Solr syntax. I think I’m OK either way. I was also planning on adding support for local params on the interval to support setting an arbitrary key, and if we use curly braces it would be more confusing:

Tomás Fernández Löbbe
added a comment - 19/Jul/14 05:43 Thanks for looking at the patch Erick.
1> I'm assuming this supports the f.<fieldname>.facet syntax, that's at a higher level that this patch I'd guess
Yes, that's the way to set the intervals for a field. For example:
f.test_l_dv.facet.interval.set=[10,20]
2> There's a TODO in SimpleFacets ...
There are a couple of TODOs. In SimpleFacets, to add a comment about the distributed case. I'll add that. In IntervalFacets there are a couple more. Two of them I kept from the section of code I used from NumericFacets (maybe this can be refactored in the future). Two TODOs are in the parsing method of the interval. I see those two as possible future improvements, I don't think they are necessary at this point.
3> Perhaps the most substantive question I have is the syntax. Why use '()' for exclusive ranges and not '{}' like range facets do? Other things being equal, it seems like this would be more consistent with other Solr syntax.
Well, that is true, but I guess in range queries “{}“ had to be used because “()” was already being used for grouping clauses. I feel that “(10,20)” is a better representation of an interval, but I see your point too to be consistent with other Solr syntax. I think I’m OK either way. I was also planning on adding support for local params on the interval to support setting an arbitrary key, and if we use curly braces it would be more confusing:
f.test_l_dv.facet.interval.set={!key=‘foo’}(10,20)
vs
f.test_l_dv.facet.interval.set={!key=‘foo’}{10,20}

Erick Erickson
added a comment - 20/Jul/14 21:40 Tomás:
Took another look today, looks really good. I did happen to notice a couple of unused parameters, so I attached a patch takes them out.
Unless there are objections, I'll commit this next Tuesday or Wednesday.
Thanks!
Erick

Thanks for committing this Erick. I tested this in 4x and tests are passing too. The only thing that needs to be added in addition to the merge is "Lucene3x" to the list of codecs to skip (in DistributedIntervalFacetingTest and TestIntervalFaceting)

Tomás Fernández Löbbe
added a comment - 23/Jul/14 20:00 Thanks for committing this Erick. I tested this in 4x and tests are passing too. The only thing that needs to be added in addition to the merge is "Lucene3x" to the list of codecs to skip (in DistributedIntervalFacetingTest and TestIntervalFaceting)

Tomás Fernández Löbbe , do you have any scripts for your performance testing? I am impressed with that part specifically as I've got some interval faceting on my horizon to do what is somewhat similar to what you've got here LUCENE-5735

David Smiley
added a comment - 25/Jul/14 13:41 Tomás Fernández Löbbe , do you have any scripts for your performance testing? I am impressed with that part specifically as I've got some interval faceting on my horizon to do what is somewhat similar to what you've got here LUCENE-5735

David Smiley I used the attached Java class for running the queries. As I said, the dataset was geonames, added 4 times (with different IDs) so that the index had 33M docs total.
The queries are all boolean queries with two OR terms, generated by taking terms from the “name” field of the dataset. An example:

Unfortunately, It looks like I deleted the schema file I used, however there was nothing crazy about it, population is an int field with docValues=true. autoSoftCommit is configured to run every second.

For the second test, I can’t upload the code because it’s full of customer specific data, but the test is very similar. I took some production queries, which had “intervals” in 6 fields, around 40 “intervals” total (originally using facet queries for each of them). For that test I used a similar bash script to upload data every second too.

I have been testing this code in an environment mirroring production for around 2/3 weeks now and QTimes have improve dramatically (on a multi-shard collection). I haven’t seen errors related to this.

Tomás Fernández Löbbe
added a comment - 25/Jul/14 16:21 David Smiley I used the attached Java class for running the queries. As I said, the dataset was geonames, added 4 times (with different IDs) so that the index had 33M docs total.
The queries are all boolean queries with two OR terms, generated by taking terms from the “name” field of the dataset. An example:
name:cemetery name:lake
name:el name:historical
name:church name:el
name:dam name:la
name:al name:church
name:al name:creek
name:baptist name:la
name:la name:mount
name:creek name:de
name:center name:park
name:church name:creek
...
Eyeballing the logs, most of those queries matched a high number of docs from the index. In addition, I had a bash script running to add documents every one second:
#!/bin/bash
IFS='\n'
while read q
do
echo $q > tmp.doc
curl -v 'http://localhost:8983/solr/geonames/update?stream.file=/absolute/path/to/tmp.doc&stream.contentType=text/csv;charset=utf-8&separator=%09&encapsulator=%0E&header=false&fieldnames=id,name,,alternatenames,latitude,longitude,feature_class,feature_code,country_code,cc2,admin1_code,admin2_code,admin3_code,admin4_code,population,elevation,dem,timezone,modification_date&f.alternatenames.split=true&f.alternatenames.separator=,&f.alternatenames.encapsulator=%0E&f.cc2.split=true&amp;f.cc2.separator=,&f.cc2.encapsulator=%0E'
sleep 1
done < allCountries.txt
Unfortunately, It looks like I deleted the schema file I used, however there was nothing crazy about it, population is an int field with docValues=true. autoSoftCommit is configured to run every second.
For the second test, I can’t upload the code because it’s full of customer specific data, but the test is very similar. I took some production queries, which had “intervals” in 6 fields, around 40 “intervals” total (originally using facet queries for each of them). For that test I used a similar bash script to upload data every second too.
I have been testing this code in an environment mirroring production for around 2/3 weeks now and QTimes have improve dramatically (on a multi-shard collection). I haven’t seen errors related to this.