Indexing Freeform-Tagged Data

Dieren, Netherlands

Tuesday, June 26th 2012, 09:11 CEST

At last week's MongoDB UK I presented on "Geolocation, Maps and MongoDB". During this talk I presented a few methods on how to store OpenStreetMap's tags on geographical objects. OpenStreetMap's tags are free form and for example for a road can look like:

As you can see, for every tag that we want to search on we would have to create an index. More indexes make inserts, updates and deletes slower and MongoDB is also limited to 64 indexes per collection. In order to have an index on every tag we need to find another solution. Also note that the second query doesn't even use an index when you check it with calling explain().

Tags as key/value pairs

The second method I introduced was the one where instead of storing the tags as properties of the tags field, I stored objects containing a tag/tagvalue pair each:

This query will find all documents where either the key is name or the value is Strand. This includes documents where there is an array element like { "k" : "name", "v" : "Strand" } in the tags_indexed array, but also every other document where there is one "k" : "name" and a "v" : "Strand" element, such as in:

In the first case, 12 documents are scanned and found. But in the second case, the same 12 documents are returned, but 172145 documents are scanned. This is because at the moment, $elemMatch does not use the index fully and will only use the first part of the compound index1 — tags_indexed.k in our case.

In order to make index usage better, it therefore makes more sense to have the property with the highest cardinality to be the first key in the compound index. In my data-set, the cardinality of tags_indexed.v is (a lot) higher than tags_indexed.k:

Now with the higher cardinality on the first part of the compound key only 44 documents are scanned. And instead of 429 milliseconds it now took less than 1 millisecond. Excellent news!

The second question from the first section ("find roads and paths") can be rewritten as:

db.poi.find( { 'tags_indexed.k': 'highway' } );

However, this query can not be serviced by our rewritten index ('tags_indexed.v' : 1, 'tags_indexed.k' : 1) as MongoDB can only use keys starting from the left side of a compound index for queries. We can now either add another index that only encompasses the k part or perhaps find a different solution.

Tag keys and values combined

There is one other alternative that I would like to explore, and that is where I combine the key and value of each tag into one string, and store those all in an array in the object, like:

Here we use a regular expression search (/^highway=/) anchored to the start of the string (with the ^). This can use an index, and is almost as good as the the case where tags_indexed.k was the first part of the compound key, except that it is a bit slower (527 vs 741 milliseconds):

Right now, I think that having an indexed on tags_combined where each tag is combined with key and value and an array element tags_combined is the best solution. Its index scanning is better for the name=Strand case than the other, and not significantly slower in the highway=* case. I would still like to see how the speed differences progress when I add more objects to the database.

Shortlink

Comments

Ian Mercer

Thursday, June 28th 2012, 22:18 UTC

The combined tags and values approach is the approach I use in my music database (after trying the other approaches).

I store each tag twice, once as 'tag:value' and once as just the value; that way I can also search for a value in any field (artist, album, genre, word,...). I also split every tag up into individual words and add them to the tag array as 'word:value' so I can find words that aren't at the start of the string (without having to use a RegEx search).

Good article.

Guillaume

Wednesday, July 4th 2012, 17:25 UTC

Nice post. But with any of this new structures, I can't see how to use sort() to order the results by "alt_name" for example ...

Can you give an example ?

Cordially.

Add Comment

Name:

Email:

Will not be posted. Please leave empty instead of filling in garbage though!

Comment:

Please follow the reStructured Text format. Do not use the comment form to report issues in software, use the relevant issue tracker. I will not answer them here.