Friday, April 10, 2015

ok, I have some benchmark numbers. Two almost-identical databases, one in mongo, one in postgres, so I can actually compare apples to apples. In Mongo I have everything in one collection. In postgres, I have two tables: tweet_pgh, which is all the data, and tweet_pgh_small, which is just the data I need.

These tables are pretty optimized: the mongo collection has indices on user.screen name, coordinates (geospatial), coordinates.coordinates.0 and coordinates.coordinates.1. The sql tables have btree indices on user_screen_name and gist indices on the coordinates (clustered). Oh, and of course I've got PostGIS installed.

Every query here selects everything in a table, then iterates through it (just incrementing a counter) to simulate a semi-realistic use case.

Count in mongo: 3653683

Count in postgres: 3653278

Searching for single user stuff. User A has 2942 items, User B has 1928 items, User C has 3499 items. (using A, B, and C here for their anonymity instead of their real twitter handles.) First three are mongo, the rest are postgres.

Wow! So, wait. So the same query that took 53 minutes in Mongo took 33 seconds in postgres. And three seconds in a reduced-size table.

It's a little unfair; in Mongo you can do a little better by using $gt and $lt on the coordinates instead of doing a $geoWithin - it's easier to compare numbers than coordinates, if you're just doing a box query. So we have the following: