postgres – 8KData Bloghttps://www.8kdata.com/blog
ToroDB, PostgreSQL. All NoSQL & SQLThu, 01 Dec 2016 15:09:36 +0000en-UShourly1https://wordpress.org/?v=5.0.3https://www.8kdata.com/blog/wp-content/uploads/2016/04/cropped-logo-8kdata-blog-ico-32x32.pngpostgres – 8KData Bloghttps://www.8kdata.com/blog
3232Announcing ToroDB Stampede 1.0 betahttps://www.8kdata.com/blog/announcing-torodb-stampede-1-0-beta/
https://www.8kdata.com/blog/announcing-torodb-stampede-1-0-beta/#commentsWed, 30 Nov 2016 16:06:15 +0000https://www.8kdata.com/blog/?p=192A few days ago we published a blog post, “The conundrum of BI/aggregate queries on MongoDB”, where we analyzed and measured some performance issues that happen on MongoDB with aggregate/OLAP/DW type of queries. We also showed that if we would transform the JSON data into a relational form and query it with SQL on a PostgreSQL database, performance can be up to orders of magnitude better. Impressive!

However, this requires a significant effort. The DDL needs to be defined –and this may be non-trivial if the origin data is of high variety. Also the data needs to be migrated, and while there are many ETL tools for that, it is still an involved process. And won’t happen in real-time! What if I require real-time statistics? What if my origin data adds a new property that is not reflected in the DDL? Do I need to sacrifice the “schema-less“ness of MongoDB for being able to perform analytic queries?

Of course not! At least, starting today. I’m very pleased to announce ToroDB Stampede, the first member of ToroDB’s family and 8Kdata’s first commercial product.

With ToroDB Stampede you will see how your MongoDB collections are transformed, in real time, to a relational structure in PostgreSQL. From there, you can run your native SQL queries on the data and use your favorite Business Intelligence tools, without requiring any ETL or change in your current database infrastructure.

So how does it work? It’s very simple:

Stampede works as a secondary (hidden) node on a MongoDB replica set.

Once started it will perform a full initial database sync and then will switch to streaming replication mode.

All the incoming data is transformed on-the-fly from a document shape (strictly speaking, BSON) into a set of relational tables. Tables will have the names of your document properties, arrays and nested documents will be transformed into relations, and columns named after the corresponding keys.

You don’t need to provide any DDL. All the DDL is automagically created by Stampede. Even if new keys or embedded documents appear, new columns and/or tables will be automatically and transparently created.

And this is where the fun begins. Now you have all your MongoDB data in perfectly shaped tables in a PostgreSQL database! Visualization and data exploration are greatly improved, and, more importantly, SQL querying, native SQL querying, is at your hand! Use it to connect to your favorite BI tools. Use it to migrate off of MongoDB to PostgreSQL. Use it to have a SQL replica. Unleash your unstructured data, into a relational database! See the example below to understand how ToroDB generates the tables and columns out of JSON documents, and check the documentation for more information.

Surely enough, performance matters. Does ToroDB Stampede deliver on the promise of 10-100x faster queries? There’s only one way to find it out. Benchmark time! The following benchmarks used one or more (when MongoDB was used in a sharded configuration) AWS i2.xlarge instances (4 vCPUs, 30GB RAM, 800Gb local SSD). We used a XFS filesystem and basic tunning was done on both MongoDB and PostgreSQL configuration. For each dataset, we manually created 6 different queries, that try to extract business value out of the information. MongoDB queries were done via the Aggregation Framework and Stampede ones with regular (Postgres) SQL. MongoDB 3.2 with WiredTiger (compression enabled, the default) and PostgreSQL 9.6 were used. All the tests were run 5 times, using the first two to warm up the caches and the numbers show the average of the last three runs.

Based on the Github Archive we performed an initial benchmark over a 500Gb dataset. We run 6 different queries (named A through F) which you may check here: MongoDB (A, B, C, D, E, F) and Stampede/PostgreSQL (A, B, C, D, E, F).

Up to 57x faster! All queries are significantly faster than MongoDB, and only one (A) is slightly slower compared to a 3-node MongoDB cluster. Trying with a smaller dataset reveals even bigger differences. This is likely due to a much better buffer management in PostgreSQL:

Woah! 267x faster! Query D takes 2,400 seconds on a single MongoDB node (about 20 minutes), 383 seconds on a three-node MongoDB shard (better, but still more than 6 minutes) and just 9 seconds on a single node PostgreSQL.

Here both MongoDB and Stampede had an index on both the _id and actor.login fields. Stampede will automatically replicate any index created in MongoDB (no action required on your side). We also wanted to try whether indexes were being used and what impact they had on the performance:

From the results we can conclude that: a) PostgreSQL results are almost the same, which is consistent with the assumption that indexes are usually not required for aggregate queries; b) MongoDB worsened the results for query A without indexes… but significantly improved query time for query D, when the index is removed! This may probably an issue with the query planner.

We also benchmarked another data set, based on the flights stats information from Transtats. Similar aggregate queries were written. Data size is smaller (50Gb) which leads to smaller differences:

Still, results are consistently faster even when pitched against the three-node MongoDB sharded cluster. And up to 11x faster queries, which is a very significant improvement! While developing Stampede we have performed benchmarks where we have observed more than 2000x faster queries. Of course, this may be a degraded case for MongoDB and surely Stampede does not perform always as well on every single circumstance.

So the recommendation is always the same: please don’t trust our numbers. Do your own. Benchmark Stampede, and please let us know the results.

If you need more information or you just simply would like to give us your opinion, please feel free to comment below or join the discussion on Hacker News! Thank you.

]]>https://www.8kdata.com/blog/announcing-torodb-stampede-1-0-beta/feed/2The conundrum of BI/aggregate queries on MongoDBhttps://www.8kdata.com/blog/the-conundrum-of-bi-aggregate-queries-on-mongodb/
https://www.8kdata.com/blog/the-conundrum-of-bi-aggregate-queries-on-mongodb/#respondMon, 28 Nov 2016 18:19:50 +0000https://www.8kdata.com/blog/?p=164MongoDB is a very popular NoSQL database. MongoDB indeed popularized the concept of the “document”, a hierarchical set of key-value pairs, with a non-fixed structure (schema). Having flexible schemas and the 1:1 mapping between the business logic and the database objects are, arguably, key to MongoDB’s popularity. The reasons why MongoDB is used as the source of truth for many OLTP applications.

Surprisingly to some, dynamic schema does not avoid doing data design, nor it prevents from performance implications of that design:

When designing data models, always consider the application usage of the data (i.e. queries, updates, and processing of the data) as well as the inherent structure of the data itself.

But what if queries cannot be known ahead of time? What if different set of queries have almost orthogonal design needs? After all, the relational theory and data normalization are precisely aimed at solving this problem: a data representation that is completely independent of the queries.

Queries that need to retrieve a document or part of it, typically indexed, perform very well. But what are the consequences of performing queries on MongoDB for which data was not modeled after? Indeed, how do aggregate queries –which represent a completely different query pattern of the typical OLTP gimme-this-single-document query– perform? How does it perform when used with Business Intelligence (BI) applications, where queries are unknown and left for the end users?

Let’s reason intuitively first. What is NoSQL’s “schema-less” concept? It means that any document, with any given structure (there’s always a structure) is stored into the system as-is. One after the other. So if there are mixed information, different data belonging to different “parts” of the business logic, some documents with some properties, others without, what does it look like? Chaos!

Aggregate queries typically require scanning many documents, looking for specific properties of the data, often without the help of an index. Since NoSQL stores document with any structure and data, how does the database know which documents belong to the query criteria? It doesn’t! It has to scan the whole dataset (normally, the whole collection) to answer the query.

This scan is not cheap. Below is depicted how it works on MongoDB. First, decompression may need to be performed. Then, for each document, theBSON document needs to be traversedand evaluated with the query criteria. Worst case, every single key-pair of every document needs to be evaluated.

Let’s use the Github Archive dataset as an example, and illustrate a query to return the top 10 most active actors. This is how MongoDB would internally process all the data in the collection to find the query results:

On comparison, a RDBMS with a normalized data design will typically target a subset of the tables, and only the columns involved will be evaluated. So much less I/O, which is normally the bottleneck of a database, will be required to answer the same query. In this case, only the table github_actor is required to answer the query:The query on aPostgreSQL relational database would be:

SELECT count(*), login
FROM github_actor
GROUP BY login
ORDER BY 1 DESC
FETCH FIRST 10 ROWS ONLY;

Using a 1,4Gb sample of the Github Archive dataset, the query in MongoDB requires exactly the same I/O as the collection’s reported storage size (536,37Mb). This was expected, the whole collection is scanned. PostgreSQL scans just a 6th of the data to answer the query. Taking this further, if we would use Citus Data’scstore (a columnar storageextension for PostgreSQL), the results are even more surprising: answering the query requires just 1/100th of the IO that MongoDB requires:

So the performance of aggregate queries on NoSQL, more precisely MongoDB, is not as good as expected. What is indeed the reality? We measured the execution time of the above queries comparing MongoDB and PostgreSQL on a 1Gb, 10Gb and 100Gb Github Archive sample of the dataset: (measuring query execution time; less is better)

Without zoom is hard to appreciate PostgreSQL’s results:

Similar aggregate queries deliver the same results. At 8Kdata, while developing ToroDB, we have performed dozens of similar queries. The results are consistently clear: a relational design almost always outperforms MongoDB aggregate queries by several factors, often orders of magnitude. Although we have seen even more pathological cases (i.e., even faster), it can safely be said that it is “100x faster”.

So one solution for the BI/aggregate queries on MongoDB is to bring them to SQL! Indeed, Stripe did that. There are ETLs too. However, all of these solutions require complex processes and designing the SQL schema of the destination data on your RDBMS. There’s no easy way to bring MongoDB to relational.

Hope is not lost, though. In a few days, 8Kdata will be publicly announcing a solution to this problem. If you are interested, keep watching this blog, subscribe to our newsletter and follow us on Twitter. Stay tuned!