Meta

Author: DavidGruzman

During my work in BigDataCraft.com I saw repeating problem our customers face. The problem is how to get efficient SQL on big data in the cloud.

Lets see a typical case.

First case – daily logs of some nature arrived and stored in the S3. There is a need to do a few dozens reports over new data each day. Daily data size is from dozens to hundreds of gigabytes. There is also need to do ad-hoc reports on the older data.

What are our options?

to run Hive over S3. It will work, but slow. Transer from S3 will be big part of execution time. Hive itself is slow. Data will be transferred between S3 and EC2 instances for each query. We can do things a bit better by manually scripting to keep recent data in the local HDFS.

to run efficient engine – Cloudera Impala. Problem that it does not work with S3 directly. To go this way we will need to script moving data to local HDFS, and cleaning it up when it outgrow the local space. In other words – we will need to treat local HDFS as a cache manually.

To use Redshift. It is good option, but will also require managing moving data from s3. it is also proprietary and will lock us in the AWS. And it is expensive if we do not commit for a several years reservations.

Second case is extension of first one. We have some data accumulated in the S3. And our data science team wants to slice and dice some data to find some interesting XYZ. What should they do? Data scientists usually not so IT savvy to build and manage own hadoop cluster. Sometimes they just want to get SQLs on subset of data with interactive speeds.

Cloud sounds like a perfect platform for the big data processing – you get as much processing power when you need it and release when you don’t. But why does a lot of big data processing happen outside of cloud? Lets try to find out:

The question came from following dilemma in big data processing in cloud :
Store data in S3 and process in EC2. It is elastic and economical per GB, you can resize you cluster as you wish, but you are limited by S3 bandwidth. EMR against S3 is popular example of this approach.

Or, you can also build HDFS or other distributed storage on top of local (ephemeral) drives. There appears to be a clear tradeoff: good bandwidth is available, but storage is going to be expensive and elasticity will suffer, because you can not remove nodes when their processing power is not needed. Redshift or hadoop with HDFS on local drives are the perfect examples of this approach.

Both solutions have drawbacks. Lets take a closer look.
Cloud storage, like s3, is built to store a lot of data in cheap and reliable way. Circa $30 per TB per month. It is also highly reliable: SLA with a lot of nines…
Local drives should be fast. Today it means SSD. This technology provides very good performance but price per GB is high.

The cost of HDD space is 20-25 times lower than on SSD. In Amazon cloud difference in cost of local drive space vs s3 space is even higher. For example 1TB of storage on c3.8xlarge instances costs aroud $2K per month. It is x60 (sixty times!!!) more expensive than to store data in s3.

What about throughput? The difference between access to local data and data on S3 is around 5 times. Moreover, bandwidth to S3 can be throttled by amazon, depending on the current load and other factors. As opposed to always reliable access to local drives.

There is possible counter-argument, that we do not need this storage bandwidth. Assuming that we process data in a speed matching the storage bandwidth – we do not need more of it. S3 can give us 50-100 MB/sec of data for big instance, like c3.8xlarge. If we process data using MapReduce or Hive – it is close to processing speed assuming MR processing to be about 5MB/Sec per core.
In case of more efficient engines – like Redshift or impala – the speed is about 100MB/sec per core or more…
So, we need faster access. To prove this point, you can pay attention that RedShift nodes has 2.4 GB/sec of disk IO. I can trust that AWS engineers know what they are doing.

Now lets recall that usually big data is a huge pile of cheap data. By cheap I mean – low value per GB. Should data be expensive (like financial transactions) it could happily live in Oracle + enterprise storage.

So, how do we utilize our resources more efficiently? On one hand we have a lot of slow and inexpensive storage, and on the other a bit of fast and expensive. The obvious solution is to use fast storage as cache. These days it is rather common: DRAM memory holds disk cache, SRAM memory inside CPU used as cache for DRAM.

In the above situation I suggest to use local SSD drives as a cache for cloud storage (s3).
So, what does it mean? Effective cache should meet the following requirements:
Store hot data set. It is main duty of the cache. Usual heuristics is LRU – last recently used. We assume that data recently used has good chance to be used again.

Prefetch: Predict what data will be needed and load it ahead of time. In case of disk cache – it’s read ahead (if we read first block of the file there’s a good chance we will need the next). In case of CPU – there are very advanced algorithms for pre-fetch. In case of usual data warehouse we can assume that recently added data has better chance to be of interest than old one…

To summarize: I believe that to be able to efficiently process big data in the cloud we need to use local drives as a cache of the dataset stored in the cloud storage. I also believe that other ways will not be efficient, as long as cloud storage is much slower and cheaper than local drives.

There is a lot of criticism of HDFS – it is slow, it has SPOF, it is read only, etc. All of the above is true. Systems built on top of a local file system are more efficient than those built on top of HDFS (like Cassandra vs. HBase). That is also true.
However, HDFS has brought us a new opportunity. We have a platform to share data among different database engines. Let me elaborate.
Today there is a number of database engines built on top of HDFS. Such as Hive, Pig, Impala, Presto, etc. Many Hadoop users develop their own partially generic map-reduce jobs which could be called specialized database engines.
Before we had such a common platform, we had to select database engine for our analytic and ETL needs. It was a tough choice and it was lame. Selected engine owns data by storing it in its own format. We were forced to write some custom data operations not in the language of our choice, but
in database’s internal language – like PL-SQL, Transact-SQL, etc… When our processing became less relational, we wrote a lot of logic in database internal language and had to pay database licence in order to run our own code…
The first spring bird of change was Hive’s external tables. At first glance, it was a minor feature – instead of letting Hive select the place and the format for the data, we got the chance to do this job for it. Conceptually, it was a big change – we are beginning to own our data. Now it can be produced by our own MR jobs, it can be processed in any way we like aside of Hive. And at the same time we can use Hive to query these data.
Then came Impala, and instead of defining its own metadata, it reuses the data from Hive and thus inherits those wonderful external tables. So we have two engines capable of processing the same data. As far as I remember, for the first time.
Why is it so important? Because we are no longer restricted by the given engine problems and bugs. Let’s say Impala is perfect at running 5 out of our 6 queries, but fails at the last one. What do we do? Right, we can run 5 queries using Impala and the last one using Hive, which is slower, but is a much less demanding creature than Impala.
Then came Spark with its real time capabilities, Presto from the Facebook, and, I am sure, many more are to come. How to select the right one?
So, in my opinion, we should stop selecting, but rather combine them. We should shift to single data – multiple engines paradigm. Thanks to the gods of the free open source software it does not involve licensing costs.
What makes it possible: common storage, open data formats and shared metadata. HDFS is a storage, data formats are CSV, Avro, Parquet, ORC, etc., and shared metadata is Hive metastore and HCatalog as its successor.
We do not have to do a revolution, but I think we should prefer engines running on top of common storage (like HDFS) data formats, which are understandable to many engines and database engines that do their best at data processing and give us freedom of data placement and formats.