It has been over a year since we released our Spatial Solr Plugin (SSP) to the community and its great to see that its serving so many users so well. During that time there has also been a great deal of work done on adding official spatial search support to Solr. Much of this work is now complete. However as it is only available in the yet unreleased Solr 3x and 4x, it remains out of the grasp of many users. This coupled with the fact that the future of spatial search support in Lucene is unclear means that there continues to be a place for SSP in the community.

Also during this period we received a lot of feedback from our SSP users about bugs they had found (thanks!) and features they wished SSP had. We also found some bugs ourselves in the Cartesian tier code which had come from Lucene. Faced with the need to fix these bugs as soon as possible and the desire to extend SSP to meet our users’ wishes, we decided to create SSP 2.0, a totally newly designed and written plugin for Solr providing highly efficient and simple spatial search support. In this blog I will introduce SSP 2.0, explain some of the motivations behind our changes and discuss the future for SSP 1.0 users.

Efficient Filtering vs Poles and Meridians

The largest change made in SSP 2.0 was the removal of the Cartesian tier filtering. While an excellent way to quickly and efficiently filter out documents outside of an area, the tier code suffered from bugs in regards to its handling of poles (the North and South variety) and meridians. The code was also very complex and difficult to maintain. Many of the bugs were also identified by Grant Ingersoll in the Lucene trunk. There its future seems dim, with it most likely being deprecated in Lucene 3x and removed in 4x. Consequently when writing SSP 2.0, I sought for a way to remove this code while still providing efficient filtering.

As it turns out a simple and effective way to do this is to exploit the performance of NumericRangeQuerys (NRQs) and create a bounding box around the search circle. Although requiring two NRQs, the bounding box driven approach addressed the problems of the poles and meridians while still providing efficient filtering.

An additional benefit of removing the Cartesian tier code is that the process of indexing data was simplified. It is no longer necessary for fields to be added at index time through an UpdateRequestProcessor since the bounding box NRQs query against the latitude and longitude fields that users are already indexing. This increases indexing performance while reducing the size of your index.

Trading Performance for Performance

When designing SSP 1.0, one of our primary focuses was ensuring high performance. Spatial searches are expensive in comparison to their free text search query cousins, consequently we sought to reduce the cost where possible. We did this in two ways. First, we used multi-threading to compute distances in parallel and secondly, we stored the computed distances and re-used them for sorting and display purposes. It turned out however that this came at a price. Storing the computed distances in the SpatialFilter meant that we could not let the results of the filtering be cached in Solr’s caches. This meant that the filtering had to occur with every query, even for the same search area. Furthermore, storing the computed distances for hundreds of thousands of documents meant that SSP had high resource requirements in a concurrent querying environment.

In SSP 2.0 we made the decision to change from a Lucene Filter which stored the computed distances, to a Lucene Query which scored documents by distance. This meant that the computed distance would become part of the overall score for a query, reducing the memory footprint. The result of the Query could also be cached, meaning that frequently executed queries would have little to no cost. It also had the additional benefit of allowing boosting by distance which I discuss later. It however also meant that we no longer had cached distances to use during sorting and result displaying. Nor could we use multi-threading to reduce the cost of computing distances since Querys are scored in single threads. Consequently we looked for ways to reduce the cost of these operations.

As it turns out, it is possible to compute a simple form of the distance between two points which can be used for comparison sake but not displaying. Consequently we change SSP to use this simple form for scoring and for sorting. This optimization calculation managed to offset some of the performance lost by having to calculate the distance for a document during both filtering and sorting.

For including for true distances in the search results, we adapted our GeoDistanceComponent which was already retrieving the distances from our distance cache, to calculate the distances for the page of results being returned to the user. The cost of calculating distances for say, 10 documents, is negligible and not noticeable.

Distance Relevancy

As mentioned above, SSP 2.0 is based on a Lucene Query which scores documents using a simplified distance. This means that documents which are closer to the centre of our search area are now considered more relevant. This is logical since I would rather go to a pizza restaurant 100m away rather than 1km away. Like other Querys, SSP also supports score boosting. This means that users of SSP can decide for themselves whether distance plays a large or small role in the relevancy of their search results. See below where I discuss the new query syntax for examples.

Simplified Query Syntax

One feature requested by one of our users was to ability to search in multiple search areas, each with their own latitude, longitude and radius. This presented a usability challenge to SSP’s query syntax which was built around the idea of a single search area. Consequently we used this as a motivation to address and simplify our query syntax. We’ve removed as many parameters as possible and reduced the search area syntax to a single parameter circles which concisely defines the areas that will be searched. The following example illustrates the change:

Here the latitude, longitude and radius are now defined in a unified tuple. It is possible to define multiple search areas by separating the tuples with ‘//’ such as circle=52.347,4.453,10//40.543,10.983,100. Gone are the parameters for units (all distances are now assumed to be in km), calc (all calculations are done using the arc calculator) and threadCount (multi-threading is no longer used).

Two additional parameters have been added:

qtype:Allows the user to specify the QParserPlugin type they wish to use for parsing the free text search component of the query, if one is defined. Users wishing to use dismax queries would set this to dismax. Example {!spatial qtype=dismax …}pizza

boost:Allows the user to define the boost to applied to the spatial queries. Default is 1.0 meaning no boost. Example {!spatial boost=3.0 …}pizza

Query vs Filter Query

A little known feature about SSP is that its not necessary for the spatial aspect of your query to be your main query (q), it can in fact be a filter query (fq). This was not of much use in SSP 1.0 since its Qparser required you to have a query string and because the results of the spatial filtering could not be cached, there wasn’t much benefit to using a filter query.

In SSP 2.0 we addressed these limitations. The SpatialQParser now correctly handles a query string being omitted and since the results of the SpatialQuery are now cachable, you can now benefit from Solr’s filterCache.

Assuming you would normally send the request q={!spatial …}title:pizza, it is now possible to send the request q=title:pizza&fq={!spatial …}.

When should you use which? Well, one limitation of SSP’s support for sorting by distance is that this is handled during the parsing of your main query. Therefore if you wish to sort by distance, you must use the SpatialQParser for your main query. Other than that, it really depends on your use case. If you are doing frequent searches in a few search areas with differing query string, then its best to move the spatial search to a filter query so it can be cached. However if you are often doing a few spatial searches with the same query strings, then its best to have it as your main search query so they can be cached together with the query string results.

Bringing it all Together

So, how do you configure SSP 2.0? The following is an example configuration:

Thats all thats required. Gone is the need for the additional dynamic field in the schema.xml, and gone is the need for the UpdateRequestProcessor in the solrconfig.xml.

Summary of Changes

The following is a summary of the major changes made in SSP 2.0:

Cartesian tier code has been removed and replaced by dual NRQ bound boxes

SpatialFilter has been replaced by SpatialQuery which scores documents using a simplified distance and supports boosting

Spatial queries can be now be cached

The SpatialQParserPlugin can be used to define a filter query (fq) instead of a query (q) when support for sorting is not required

Multi-threaded filtering is no longer supported

All distances are now assumed to be in kilometres

All returned distances are computed using the arc great circle calculation

Support for searching in multiple areas has been added

Query syntax has been simplified with redundant parameters removed, search area definition combined into a single circle parameter

qtype local parameter has been added allow the user to define which QParserPlugin should be used to parse the free text component of a search

boost local parameter has been added to allow the user to add a boost to SpatialQuerys

SpatialTierQueryParserPlugin has been renamed to SpatialQParserPlugin

SSP 1.0

While we are very pleased by the improvements made in SSP 2.0, we appreciate that many people are using SSP 1.0. If possible, we recommend that you upgrade from SSP 1.0 to 2.0 so that you can benefit from the improvements discussed above. The process for upgrading is very simple:

Add the new SSP jar to your Solr installation and remove the existing SSP jar

Replace your configuration of the SpatialTierQueryParserPlugin which that of the SpatialQParserPlugin shown above.

Remove the configuration of the SpatialTierUpdateRequestProcessor

Remove the dynamic _tier_* field from your schema.

Consider re-indexing your lat and lng fields to use a TrieDoubleField with a greater precisionStep. This is optional assuming you used a TrieDoubleField in your SSP 1.0 setup as well. The greater precisionStep will increase the performance of the underlying bounding box queries

For those who are unable to upgrade, JTeam is committed to continue supporting SSP 1.0 users. To find out more about how we can do this, contact us.

Future and Roadmap

JTeam has every intention to continue to develop SSP and will strive to patch any bugs as soon as possible. The following features are on our roadmap:

Add single NRQ morton number bounding boxes

Polygon search areas

Clustering / collapsing of results that are physically close to one another

16 Responses

Nice work Chris! I am also actively working with spatial search, having contributed a spatial filtering technique based on geohashes which support variable numbers of points per document. Your work here and other work for Solr does not support that. Here’s the patch: https://issues.apache.org/jira/browse/SOLR-2155 I have continued the work in the patch with a working polygon search (based on the LGPL licences JTS), benchmarking performance using Geonames.org https://issues.apache.org/jira/browse/LUCENE-2844 and by revamping my geohash filter algorithm to use a more scalable algorithm that will scale similarly to lat-lon NRQs (by indexing a term for each geohash tier level which is conceptually similar to how NRQ trie works). A geohash is a morton number by the way. Assuming your work here is ASL licensed, it would be nice to see my work combined with yours to become the future of geospatial search in Solr! I suggest you not waste time in the NRQ morton numbers or polygon search areas since I’ve already done it.

January 10, 2011 at 12:49by Rich mercer

|

This is great Chris, some awesome improvements. I’m a little disappointed that distance faceting didn’t make it in though. That’s probably my #1 requirement right now.

I have installed and tested Spatial Solr Plugin 2.0 and find some limitations:

1) The simplified distance can’t be used for local-search boosting because the scale runs from +1 at 0 distance to -1 on the other side of the world. Thus all documents within 20km run from 1.0 to 0.99999, so one needs boost=100000 to make a substantial difference, at which point distance boost would outweighs all the other factors.

Fix: the distance used for boosting should rather scale from +1 at the centre to 0 at the radius of the circle – this can be done by scaling the simplified distance appropriately when it is used as a boost.

2) Spatial queries always filter the results to the circle, even when used as a boost query in the “bq” parameter of a dismax handler, so currently there’s no way to boost by distance without the circle filter (workaround: use a big circle).

3) For some reason I always get ArrayIndexOutOfBoundsException at LatLongLocationDataSetFactory.java:68 when attempting to sort by distance.

January 17, 2011 at 12:40by Rob

|

Thank you so much for updating the syntax and allowing for multiple search areas; This is an important feature to me;

I have a requirement for spatial search with the distance in miles. The previous version supported it. Is there way to incorporate this at all with the new version?

January 27, 2011 at 15:17by Spatial Solr user

|

You said:

For including for true distances in the search results, we adapted our GeoDistanceComponent which was already retrieving the distances from our distance cache, to calculate the distances for the page of results being returned to the user. The cost of calculating distances for say, 10 documents, is negligible and not noticeable.

Question:

How can I include calculated distance in search output?

February 17, 2011 at 22:36by Alexander Kanarsky

|

In case anyone is interested in polygons/polylines search on top of the SSP 2.0 RC4; I created a project to support such a thing (with certain limitations). The patch to the SSP source code is here: https://sourceforge.net/projects/ssplex/files/

February 21, 2011 at 10:21by Joe

|

Rob mentioned Sunspot support for SSP 2.0. I too am struggling to ‘add’ this latest Spatial Search Plugin to Sunspot. No luck so far. Just getting no matches when I include geo parameters in the query. Any tips?

I’m still using version 1.0 of your spatial library and now I want to update to the new version. But I got one Problem, I’m using lucene and solr, where are the examples or testcases for this usecase.
Comparing v1 to v2 a lot of classes are gone.

In SSP 2.0 RC 5, it supports boosting, “bq” and qtype. Can any send me the sample query. Because i tried query like

{!spatial qtype=dismax boost=100000 circles=lat,lng,distance}&fq=*. But i cannot find any change in the response.

At the same time, I am unable to boost the records using “bq”

Thanks in advance

Regards,

S. Vanchi Nathan

May 4, 2011 at 00:31by Gavin

|

I wonder if you could help me with an error. I installed a new copy of Tomecat6 and Solr 1.4.1 in a virtual machine, running Ubuntu 9.10. I then added the extra xml into my 2 .xml conf files (shown under `Bringing it all together` above), which are in the …/example/solr/conf folder. I also have the SSP jar file inside …/example/lib and …/example/solr/lib (I’ve seen a couple tutorials mention both folders). And when I view the initial admin page (at http://localhost:8983/solr/admin/) I get the following error:

HTTP ERROR: 500

Severe errors in solr configuration.

Check your log files for more detailed information on what may be wrong.

Is there a way to index multiple lat/long per document? For example i need to do a radius search for all people within a X km, and each person can have multiple office locations. One way would be to index location as a separate document type. But then i lost filtering based on the ‘Person’ document unless i port over all the fields from Person to Location.

You can see that I am pasting a sample query above. My problem is the common one: there is no geoDistance returned. I have the schema.xml and the solrconfig.xml modified just like you advise in this blog post. Yet, Solr just doesn’t want to return the distance no matter what I try.

On the bright side, Solr *is* sorting the results by distance. I manually measured 2 points about 150 km apart on Google Maps. Then I verified that Solr does indeed return the second point when I put the radius about 150, but not when it is below 150. So, that is helpful. But I’d still love to get the geo-distance.

In any event, I suppose I can simply triangulate the distance of returned locations via the PHP code using trigonometry.

Latest Tweets

Meta

About this blog

The content on this blog is developed mostly by our own developers. It contains a selection of best practices, trends and technical information on different open source technologies. Our blog is where we can share our experiences with technologies we are experimenting with, advancing with and more often than not indulging in.
In true open source community style, we don't hold back and we share code snippets, screen shots and provide insight wherever we can. We particularly enjoy setting up a dialogue with our readers and often there is also a thread of question and answers in the blogs too. Our blogs are not however all about just the technical successes but also our less successful efforts too ;-)

trifork

In many of our blogs, you'll find we endeavour to take you on a journey of our thought process. This means some of our blogs can often become epics and have a whole series associated with them.
We think that most of our readers appreciate reading about our experiences on new products, releases, conferences and technologies and to be honest its what makes us tick and what we thrive on most. Every now and then we also invite partners, industry leaders and others around us as guest writers or we share some interesting conversations we have had with them.
Suggestions, input and feedback is always welcome, just drop us a note and we'd be happy to hear from you.