Every once in a while, a vendor war ignites over the question of performance. Due to the fickle and subjective nature of benchmarks, it’s quite possible for one vendor to publish results showing their product is by far the fastest, while their competitor will use the same benchmark to prove exactly the opposite. Is speed in the eye of the beholder?

Redshift vs. BigQuery: Concerns Shift to Performance

Amazon Redshift is a popular cloud-based data warehouse, and Google’s BigQuery is quickly catching up as an alternative. Both products are acclaimed for their ability to process big data at lightning speed.

In the early days of cloud-based data warehouses, data logistics were a prime concern, with a learning curve and sheer time needed to move the big data volumes. Today, it is becoming much easier to pipe data into cloud data warehouses, with vendors like XPlenty and Alooma promising to move data from sources like Postgres or MySQL to Redshift or BigQuery in minutes and with tight control over errors and duplications.

So the concern is shifting to performance. How fast can each data warehouse crunch the numbers?

Amazon’s Benchmark: Redshift 6x Faster, With a Catch

These are benchmark results published by Amazon in October 2016, showing that across 96 queries in the standard TPC-DS benchmark, Redshift is overwhelmingly faster than Google’s BigQuery — the average query ran in 29.75 seconds on Redshift vs. 181.81 seconds on Google BigQuery.

In the small print, Amazon mentioned they ran the test on an 8-node DC1.8XL Redshift cluster.

As one pundit on Hacker News pointed out:

8-node DC1.8XL Amazon Redshift cluster — well, yeah. That's $28,108.80 a month if you're running queries on demand and don't want a delay/coordination in Amazon instance creation/destruction. BigQuery may or may not be as fast, but it's truly a managed service; I give it our data and it just works. I don't have to worry about instances, boot up time, maintenance, hourly costs, etc.

Interestingly, Google had presented their own test results at the high-profile Horizon event one month earlier, based on the smaller TCP-H benchmark in which BigQuery was the clear winner (the results were not released to the public). Amazon’s critique of this test was that it focused on only one of the 22 queries in the TCP-H test.

Is Amazon Redshift really fastest, or is that true only when run on a super-expensive, complex cluster configuration? Is BigQuery the fastest, or does that only hold in very specific query scenarios?

An Independent Benchmark: BigQuery 48x Faster

To get at the truth we should probably look beyond vendor benchmarks. Mark Litwintschik, an independent technologist, ran a number of queries on a dataset of 1.1 billion taxi rides (104GB in size) on both Redshift and BigQuery.

Take into account that this dataset is 100 times smaller than Amazon’s, 100GB vs. 10TB. Queries are also much simpler — SELECT FROM and GROUP BY, without complex joins, etc., which are a feature of the TPC-H benchmark selected by Amazon.

Lintwintchik’s results: Amazon Redshift completed the queries in 48 seconds for the simplest query and one minute, 57 seconds for the most complex one (SELECT FROM the entire data set, GROUP BY, and ORDER BYORDER BY).

On BigQuery, he says he was “absolutely blown away” that the same queries executed in one to three seconds. How much faster is that? You do the math.

So... TPC-DS, a standard industry benchmark, running on a big cluster of Amazon Redshift machines, ran through very complex queries on 10TB of data in under three seconds (with the exception of Query 22). Litwintschik’s simplified DIY benchmark takes over a minute for a much smaller data set, albeit on roughly one-fourth of the compute resources.

Confused? We sure are. There are additional benchmarks but none seem to provide a clear answer to the question of which data warehouse is faster, or at least of how we can more accurately define the question.

Wanted: A Better Methodology

Because of the very different nature of the two data warehouses (and other offerings in the market), there is a need for a methodology that compares data processing using standard queries on a dollar-per-throughput basis.

It’s all good and well if we can get blazing performance on Redshift for $28 big ones per month, while BigQuery takes a few seconds more and costs (for example) a mere $0.27.

Devising such a methodology is beyond the scope of this article. But if one is in the works, please enlighten us in the comments!