20/09/2016

A Columnar DBMS Benchmark

MonetDB vs. Vertica vs. MariaDB Columnstore

As discussed in
a previous article where PostgreSQL was pit against HPE Vertica, the choice of database is critical to a good BI solution. This article will further dive into columnar databases with a comparison of HPE Vertica, MonetDB and -the still in alpha- MariaDB Column Store (MariaDBCS).

MonetDB was developed at the CWI in Amsterdam and released as open-source software in 2004.

Its architecture is comprised of 3 layers; a query interface front-end, the optimization mid-layer and the MonetDB Assembly Language (MAL) interpreter. It is aimed at maximum efficiency of hardware for large scale processing. MonetDB also has (limited)
support for clustering.

Designed to manage large, fast-growing volumes of data with fast query performance. It offers compression and uses it along with sorting and encoding for ‘cost-effective loads and reads’. Vertica is designed to run in a cluster and as such the results of this test would be vastly different when using even bigger datasets and multiple nodes. For the purpose of this test though, Vertica is set up in a single node format.

The new analytics engine MariaDB ColumnStore is MariaDB’s effort to provide a product which is scalable, reliable and open-source with a hundred percent SQL compatibility. The engine itself runs under the familiar MariaDB interface and as such is backwards-compatible with components like previous JDBC drivers.

Running some tests

Testing is done on a VirtualBox running Ubuntu 14.04 (
Trusty Tahr), with 8GB Memory and 50GB storage. Each database will have full reign of the machine as we’ll have consecutive runs for testing. This dataset is inserted into a data warehouse running on each database using Pentaho Data Integration. It contains 20.520.383 records total.

fact_ratings

dim_movies

dim_links

dim_tags

The DDL statements for each table:

CREATE TABLE dim_movies ( movieId DECIMAL(20)

, title VARCHAR(200)

, genres VARCHAR(120)

)

CREATE TABLE dim_links ( movieId DECIMAL(25)

, imdbId DECIMAL(25)

, tmdbId DECIMAL(25)

)

CREATE TABLE dim_tags ( userId DECIMAL(20)

, movieId DECIMAL(20)

, tag VARCHAR(500)

, "timestamp" VARCHAR(50)

)

CREATE TABLE "fact_ratings"( "fact_id" BIGINT , "userId" DECIMAL

, "movieId" DECIMAL

, "rating" DOUBLE

, "timestamp" VARCHAR(40)

)

Keep in mind that all 3 of these databases will require modification of the create table statements above.

Noticeable when testing MariaDBCS was that it did not agree with some data types and did not allow LOCAL statement when using LOAD DATA.

The DWH writing times for each database:

MonetDB

HPE Vertica

MariaDBCS

3m 23s

3m 49s

4m 13s

Each test will be a simple every-day-use query.

The first test is a simple average query: “SELECT AVG(fact_id) FROM fact_ratings;”

MonetDB

HPE Vertica

MariaDBCS

1.40s

0.39s

28.37s

Result: 10000131 MariaDBCS takes a very long time to calculate and display the average here while the other 2 complete the task under a few seconds.

The second test query is a count of all the ratings:

“
SELECT rating, count(*) from fact_ratings group by rating order by rating;”

MonetDB

HPE Vertica

MariaDBCS

1.22s

0.95ms

29.24s

Result:

rating

count

0.5

239.125

1

680.732

1.5

279.252

2

1.430.997

2.5

883.398

3

4.291.193

3.5

2.200.156

4

5.561.926

4.5

1.534.824

5

2.898.660

The third test is somewhat more complex. The goal is to have the top 20 movies (of the table) and their average ratings. “select dim_movies.title, avg(fact_ratings.rating)

The fourth test builds further on the third one: now we want the top 20 movies, their rating and imdbid; that are larger than the average rating. “select dim_movies.title, fact_ratings.rating, dim_links.imdbid

From these results it's clear that MonetDB and HPE Vertica are the winners of this benchmark. MonetDB is the fastest choice when agglomerating multiple tables together to produce an end result, where as HPE Vertica offers the best results when executing simple queries and calculations. As the newcomer to this type of storage, MariaDB has offered up a valuable candidate which albeit slow in comparison; does offer full SQL.

Again I feel the need to throw in a reminder that if we were using a larger dataset (read: terra/petabytes) with Vertica properly set up, it would dwarf the other two in comparison.

Suggested is that; when choosing a DBMS; look at the dataset and the type of queries that will be used. Considering each database has its pitfalls, which is the best option for you?

To conclude; speed is optimal and it’s clear there isn’t a monopoly on it just yet.