Category: Tools

This is a little text with all the stuff that helped me prepare for the Google Cloud Data Engineer Exam. There are a lot of courses and resources, that help you in preparing for this. The following links helped me in preparation for my Google Data Engineer Exam.

On Coursera there is are several courses provided by Google to prepare for this exam, including a Specialization for the Google Cloud Data Engineer Exam.

And Google also offers a test exam for the Google Cloud Data Engineer.

And now to a little deeper description of the topics I struggled a bit with.

Big Query

Streaming

When you use streaming to insert data into BigQuery, it gets saved into the streaming buffer and then is loaded into the persistent table from there. It can take up to 90 minutes until the data is persistent. But all data is queryable together, buffer and persistent table. So your results update immediately.

During streaming inserts each row gets an insertID which Bigquery uses to try, find and remove duplicates, if these occur in a time window of one minute.

If this method is not secure enough, Google Cloud Datastore is an option, since it supports transactions.

You can also stream into partitioned tables, if you add a partition decorator $YYYYMMDD to the table. Streamed data gets saved to the UNPARTITIONED partition. To check this query the system column _PARTITIONDATE IS NULL.

Typical use cases include

high volume input, but not transactional data

analysis on aggregated data and not for single row queries

Restrictions

max row size: 1 MB

HTTP limit: 10 MB

max 10,000 rows / second per project

max 10,000 rows / query; 500 rows recommended

max 100 MB / s and table inserts in streaming

API limit of 100 queries / s per user

max 300 concurrent users

max 500 tablesets / s

Partitioned tables

Partitioned tables are separated into smaller units to optimize query performance. There are several ways to to partition tables in BigQuery:

partitioned by loading time (automatically)

partitioned by field with the data type TIMESTAMP or DATE

You can correct loading time by updating the suffix or pseudo columns. Partitioned tables contain a system column _PARTITION_TIME, you can use to filter. If you used an explizit, not automatic partitioning, the column _PARTITION_TIME is not exiting.

There are two columns for data, that do not belong to partitions:

__NULL for rows with a NULL value in the date field

__UNPARTITIONED for rows with a date outside of the valid timerange#

Another way to optimize tables, is sharding, where you use different tables to split data, instead of a partitioning column. Join these tables using a UNION. In this case keep in mind, that there is a limit of 1,000 tables per query.

Access rights

viewer: right to read all tables in a dataset

editor: viewer plus creating, deleting and changing of tables in a dataset

owner: editor plus creating and deleting datasets

admin: everything

Restrictions

There are some restrictions you need to keep in mind, when using BigQuery:

max 100 concurrent interactive queries possible

max 4 concurrent interactive queries on external sources

max 6 UDFs in a single interactive query

max 1,000 updates per table per day

max query duration is 6 hours

max 1,000 tables per query

SQL command can be max 1 MB in size

max 10,000 query parameters

max 100 MB / row

max 1,000 columns per table / query / view

max 16 levels of nested fields

Best Practices

ETL from relational database to BigQuery

There are several ways to import data from a relational database into BigQuery. Do this by using the WebUI, in which case the schema is upload able as as JSON file. You can export data to Google Cloud Storage in JSON format and import from there.

Another way is using Google Cloud Dataflow. Automatic setup is possible, but you can customize the process and maybe join data already in Dataflow, to denormalize it by using lookup tables as side input.

Backup / Snapshots

BigQuery supports a feature for “Point in Time Snapshots”. If you you a query decorator @<time> in Legacy SQL you can get an image of the table during the specified timestamp. This decorator can contain relative or absolute values.

BiqQuery ML

BigQuery supports machine learning in SQL for structured data. This is easy to learn, as it is based on SQL and not some other programming language. Right now it does provide linear regression, binary logistic regression for classification, multi class logistic regression and k-means clustering. The advantage of using this is, that you only have to use one tool, SQL, for everything. This means, data need not be moved from one place to another.

UDF

BigQuery does support user defined functions (UDF) in the languages SQL or JavaScript. Load external libraries by storing them in GCS.

Denormalize data

BigQuery does not need normalized data for fast query speed, but rather denormalized data. This means you should use nested and repeated fields in tables. Disk usage is not an issue here and denormalized data saves you the need to join.

You can reuse nested and repeated fields by using formats like AVRO, Parquet, ORC, JSON, Datastore or Firestore.

BigQuery Transfer Service

There is an option to repeatedly load data into BigQuery from Campaign Manager, GCS, S3, AdManager, Google Ads, Google Play and Youtube. You can also load Stackdriver logs into BigQuery.

BigTable

BigTable is Googles solution for a real time key / value datastore. It handles sparse tables, meaning tables, where a lot of columns in rows are null. Empty columns do not use any space. This system is designed for billions of rows, thousands of columns and data the size of TB or PB.

BigTable only has one key per row, called rowkey. It has small latency and is HBase compatible. Also it scales linearly corresponding to the number of nodes in a cluster. There is an option for replication over two clusters, in which sets up one as cold standby. It is possible to change cluster sizes while the system is running, but after that BigTable takes a few minutes to rebalance cluster load.

To make the most out of it, its use case is for high scalability, key / value data and streaming.

BigTable orders columns belonging together into groups or families. Empty columns do not use any space. To make the most of it, divide key values equally on all nodes. Data is not stored on the cluster nodes, but on Colossus. The nodes only contain references to the data. This makes it easy to recover data after a system crash and no data is lost if nodes fail.

The master node distributes data node to optimize performance. Data locality is important. It can be influenced by key design. Group data that is queried together by designing key in that way. But BigTable does not support SQL or joins and is only really recommended with over 1TB of data.

Performance

SSD

read 10,000 rows in 6 ms

write 10,000 rows in 6 ms

HDD

read 5,000 rows in 200 ms

write 10,000 rows in 50 ms

Reasons for underperformance

incorrect schema design: key does not distribute data equally, only one key is possible

columns are sorted by family and then alphabetical

reads / writes are scewed

all operations need to be atomic on a row

information is distributed over several rows

row is to big (only 10MB per cell and 100MB per row)

timestamp is beginning of rowkey

not using combined keys

key consists of domains, hash or sequential ids

too little data (< 300GB)

test was to short (needs to run at least 10 minutes)

cluster too small (>70% CPU; 70% space)

HDD is used

still is a development instance

data is distributed automatically due to usage pattern, make sure to give system time to distribute to your needs

Key Visualizer

The Key Visualizer is a tool for supporting you in finding BigTable usage patterns. It provides graphic reports to help you recognize performance problems, like spotting hotspots (rows with a lot of actions) or rows with too much data. It helps you in getting evenly distributed accesses over all rows.

The reports are generated automatically (daily / hourly) if a table has more than 300GB or at least 10,000 row read or writes per second. The reports also recognize hierarchical keys and provides them as this hierarchy. Also connected keys are summarized.

Scale BigTable

You can scale BigTable programmatically. This is helpful if you, for example get monitoring data from Stackdriver and then react on this databy scaling up or down. After scaling a cluster it can take up to 20 minutes until you can see better performance, as data needs to be redistributed.

This automatic process does not help in cases where performance peaks for a short period of time. In this case, you need to scale manually before the event. If scaling does not help performance, then probably the key design leads to imbalanced load distribution.

You can scale a running cluster. It is possible to change the following things:

nodes in cluster

cluster per instance

usage profiles

label / names

upgrade from development to production instance

Schema design

Schema design in BigTable is crucial for performance. There is only one index per row, called rowkey. All rows are automatically sorted due to the rowkey, in alphabetical order. And as all operations are atomical per row, rows should not be dependent on one another.

Use key design to make sure data queried together is lying close to each other. And keep in mind, that one cell should not be bigger than 10MB and one row no larger than 100MB. The max values for this is 100MB per cell and 256MB per row.

There are up to 100 column families possible. If all dependent values are in one family, it is possible to only query this family. Less columns is better and column names should be short.

Key design

There are several best pratices for key design:

use inverted domain names, like com.domain.shop

string ids should not contain hashes, due to readability

integrate timestamp if time is queried often, but not only timestamp

use inverted timestamp if newest data is queried more often

always use grouping field first in key

try to avoid a lot of updates on a row, try to write several rows instead

Dataflow

Dataflow is a system to develop and run data transformation pipelines in. It has the following key concepts:

Pipeline: complete process in a DAG

PCollection: distributed dataset for a pipeline

PTransform: data processing operation, that takes a PCollection and performs processing. It returns again a PCollection

ParDo: Parallel processing runs a function on a PCollection

SideInputs: additional inputs for ParDo / DoFns

It is a best practice to write invalid input to its own sink and not to throw it away.

PubSub

Messaging queue provided by Google. It keeps data up to seven days. All logs are automatically pushed to Stackdriver for analysis. It has the following attributes:

messages might not be delivered in the order they were received in, because the system is distributed.

if message delivery is not acknowledged after a defined time, message will be redelivered. This timespan is configurable

Data Migrations

Data Transfer Appliance

The Data Transfer Appliance is an encrypted offline data transfer up to PB in size. You get the appliance and ship it back to Google where it will be uploaded into the cloud. This is faster than internet upload and can be handled like a NAS. Only applicable if the data is not changed in between.

Storage Transfer Service

Storage Transfer Service is a transfer service online to GCS or from bucket to bucket. It can be used to backup online data, or as a sync from source to sink, including deletion of source once it is copied. If you need to decide between gsutil and Storage Transfer Service it is:

local data -> gsutil; this also supports rsync

other cloud providers -> Storage Transfer

Dedicated Interconnect

Dedicated Interconnect is a physical connection between on premise and Google Cloud. You provide your own routing equipment. It is possible at 10 or 100Gbit.

Dataproc

Dataproc is a managed Hadoop / Spark cluster in Google Cloud. It offers some specials:

Cloud Storage Connector: Hadoop or Spark jobs access GCS directly

data is not transferred to HDFS

GCS is HDFS compatible, just use gs:// instead of hdfs://

data still available if cluster is gone

high availability is possible

Preemptible workers:

same machine type is others

own instance group

lost instances will be re-added once there are free resources

save no data

no clusters with just preemptible nodes

local disk is not available in hdfs, it is only used for local caching

Scale cluster:

scaling is always possible, even with running jobs

Machine Learning

Machine Learning Engine

The Machine Learning Engine provides predefined scaling values, but there is also a custom tier. There you can change:

masterType: machine type for the master

workerCount: number of workers in cluster

parameterServerCount: number of parameter servers

Data Loss Prevention API

This API finds sensible data (PPI) and can remove it, e.g. personal data, payment data, device identification, also custom and country specific. You can define word lists and regexes. It also provides data masking, removing and encryption. Use it in streams, text, files, BigQuery, pictures

Vision API

Others

Datastore

Datastore is a NoSQL DB, that is ACID compliant and has a SQL like querying language. There is an index for rows, combined indexes are possible too. It also provides a Firestore export to a GCS bucket.

Datastudio

Reporting solution provided by Google. It provides caching to speed up reports or reduce costs. There are two different kind of caches:

responsive cache: remembers results of a query and if data is queried again, cache supplies it

queries are executed in background and response is stored in predictable cache

if responsive cache has no data, predictive cache is asked

if no data in predictive cache, data is queried from source

works only with owner credentials

Cache is automatically refreshed, but it can be done manually. The default for refreshing is 15 minutes, but also 4 and 12 hours. You should turn of predictive cache in case of:

data changes frequently and data freshness is more important than speed

reduce costs by turning it off

After the Google Cloud Data Engineer Exam

After taking the exam, you get a preliminary result on the spot. You can see it after submission or later in your Webassessor account. Once Google checked the exam you get an email with a voucher and a link to their perk shop, where you can chose between a backpack and a hoodie. I chose the backpack.

R Project and Production

Running R Project in production is a controversially discussed topic, as is everything concerning R vs Python. Lately there have been some additions to the R Project, that made me look into this again. Researching R and its usage in production environments I came across several packages / project, that can be used as a solution for this:

Plumber

For reasons of ease of use and because it was not a hosted version, I took a deeper look into Plumber. This felt quite natural as it uses function decorators for defining endpoints and parameters. This is similar to Spring Boot, which I normally use for programming REST APIs.
So using Plumber is really straight forward, as the example below shows:

The #’ @get defines the endpoint for this request. In this case /hello, so the full url on localhost is http://127.0.0.1:8001/hello. To pass in one or more parameters you can use the decorator #’ @param parameter_name parameter_description. A more complicated example using Plumber is hosted on our Gitlab. This example was built with the help of Tidy Textmining.

Production ready?

Plumber comes with Swagger, so the webserver is automatically available. As the R instance is already running, processing the R code does not take long. If your model is complicated, then, of course, this is reflected in the processing time. But as R is a single thread programming language, Plumber can only process one request at a time.
There are ways to tweak this of course. You can run several instances of the service, using a Docker image. This is decribed here. There is also the option of using a webserver to fork the request to serveral instances of R. Depending on the need of the API, single thread processing can be fast enough. If the service has to be highly available the Docker solutions seems like a good choice, as it comes with a load balancer.

Conclussion

After testing Plumber I am surprised by the easy of use. This package makes deploying an REST API in R really easy. Depending on your business needs, it might even be enough for a productive scenario, especially when used in combination with Docker.

Three Systems for save Development

When you are building a productive Data Lake it is important to have at least three environments:

Development: for development, where “everything” is allowed.

Staging: for testing changes in a production like environment.

Production: Running your tested and productive data applications

With these different environments comes the need to keep data structures in sync. You need to find a way to change each system in a controlled way, when it is time to deploy some changes. There are differences in how to manage data structures, depending on where in data processing they occur. Theses differences we split into the parts “Data Entry to Staging” and “Core and Presentation”. I wi

Managing different data structures in the three systems of your Productive Data Lake

As all three systems can be have different data structures, using tools to make this easier is important. Different Apache Avro™ schema versions are managed by a publishing them to a tool called Apache Avro™ Schema Service of which there also also three instances. By publishing changes first on the development, then staging, then production system, we make sure, only tested data structures are deployed to the Production Data Lake. Each change to a schema belongs to a version number, so rolling back to an earlier state, in case of an error is possible.
Evolving our data structures not contained in Apache Avro™ schemas, we use tables managing all meta information.

Data Structures from Entry to Staging

Apache Avro™ is a good way to manage evolving data structures, as I explained here. Once the data is available in Apache Hive™ mapping the AVRO data types to those of your Productive Data Lake system is necessary. We use Apache HAWQ™ as base of our Data Lake. So we map the AVRO data types to HAWQ data types. To keep this as effortless as possible, we reuse attribute names as often as possible. If an attribute, like “user_id” is occuring in different schemas and has the same meaning, then we just reuse it. So the mapping table does not contain an entry for each column of each schema, but only for distinct ones. Considering more than 100 schema with on average 10 attributes, this helps in keeping maintanence down. In the table below is an example, of this mapping.

Schema Fieldname

AVRO datatype

HAWQ datatype

user_id

string

varchar(100)

transaction_id

string

varchar(100)

parameters

map

text

request_time

long

bigint

session_id

string

text

list_of_something

array

text

Data Strucutes for Core and Presentation

A totally different challenge is it, to manage the data structures after data entry. There need to occur transformations, for technical and business user benefit. This can mean other attributes / column names or a splitting or joining of data from one schema into different tables. These tables here again can evolve. Evolving database tables is a pain, since you cannot, in most cases just drop the table and recreate the new one. Here the processing power of Apache HAWQ™ comes into play, as we can exactly do that now. In most cases, at least for transactional data, we just reload the whole table, during the daily batch process. So we manage all table definitions in tables. They contain all table and column attributes, as well as a version number. This makes rolling back possible.

Example of table definitions:

Schema Name

Table Name

Partitioned

distributed by

dropable

Version

core

dim_user

user_id

true

3.0

core

rel_user_purchase

user_id, purchase_id

true

4.0

core

fact_user_registration

user_id

true

4.2

core

fact_usage_page

time_value_request

false

5.0

Exmaple of column definitions:

Schema Name

Table Name

Column Name

Data Type

Data Type Length

Version

core

dim_user

user_id

varchar

100

3.0

core

dim_user

user_type

varchar

50

3.0

core

rel_user_purchase

user_id

varchar

100

4.0

core

rel_user_purchase

purchase_id

varchar

100

4.0

Automation of data structure deployment.

After defining all the meta data concerning the data structures, we need a way to automate this. For this I created a Java programm using all the information from schema definitions and tables for data type mappings and table definitions. This programm runs on any of the three environments and also take in consideration the version it should create. This only works, if no one can create or change table information manually. So the programm first deploys changes to the data structure on the development system. If it works there and all changes are ready to be deployed on the staging system, run the programm in this environment. With the version numbers defined besides each change, it is possible to run version 5.0 on production, version 5.1 on staging and version 5.2 on development if neccessary.

Conclussion

This way is a possibility to make data structure evolution easier and keep three different systems in sync. It is most important to make sure, that there are no manual changes to the data structure. This approach works nicely, if you can drop and recreate all tables without hesitation. If that is not the case, you can still define your tables this way, just make sure they are not dropped automatically when the programm runs. Having all definitions in one place and being able to automatically deploy the tables and the changes, with version control, can be a great help. This way no one has to remember to create new tables manually in either system. So errors due to missing tables or columns are minimized which is good if you create a Productive Data Lake, where errors affect data products needed by other products of your company.

Data Lake vs Datawarehouse

The Data Lake Architecture is an up and coming approach to making all data accessible through several methods, be that in real-time or batch analysis. This includes unstructured data as well as structured data. In this approach the data is stored on HDFS and made accessible by several tools, including:

All of these tools have advantages and disadvantages when used to process data, but all of them combined make your data accessible. This is the first step in building a Data Lake. You have to have your data, even schemaless data accessible to your customers.
A classical Datawarehouse on the opposite only contains structured data, that is at least preproccessed and has a fixed schema. Data in a classical Datawarehouse is not the raw data entered into the system. You need a seperate staging area for tranformations. Usually this is not accessible for all consumers of your data, but only for the Datawarehouse developers.

Data Lake Architecture using Apache HAWQ

It is a challenge to build a Data Lake with Apache HAWQ, but this can be overcome on the design part. One solution to build such a system can be seen in then picture below.

Data Entry

To make utilization of Apache HAWQ possible the starting point is a controlled Data Entry. This is a compromise between schemaless and schematized data. Apache AVRO is a way to do this. Schema evolution is an integral part of AVRO and it provides structures to save unstrcutured data, like maps and arrays. A separate article about AVRO will be one of this next topics here, to explain schema evolution and how to make the most of it.
Data structured in schema can then be pushed message wise into a messaging queue. Chose a queue that fits your needs best. If you need secure transactions RabbitMQ may be the right choice. Another option is Apache Kafka.

Pre-aggregating Data

Processing and storing single message on HDFS is not an option, so there is need of another system to aggregate messages before storing them on HDFS. For this a software project called Apache Nifi is a good choice. This system comes with processors that make things like this pretty easy. It has a processor called MergeContent that merges single AVRO messages and removes all headers but one, before writing them to HDFS.
If those messages are still not above the HDFS blocksize, there is the possibility to read messages from HDFS and merge them into larger files still.

Making data available in the Data Lake

Use Apache Hive to make data accessible from AVRO format. HAWQ could read the AVRO files directly, but Hive handles schema evolution in a more effective way. For example, if there is the need to add a new optional field to an existing schema, add a default value for that field and Hive will fill entries from earlier messages with this value. So if HAWQ now accesses this Hive table it automatically reads the default value for field added later with default values. It could not do this by itself. Hive also has a more robust way of handling and extracting keys and values from map fields right now.

Data Lake with SQL Access

All data is available in Apache HAWQ now. This enables tranformations using SQL and making all of your data accessible by a broad audience in your company. SQL skills are more common than say Spark programming in Java, Scala or PySpark. From here it is possible to give analysts access to all of the data or building data marts for single subjects of concern using SQL transformations. Connectivity to reporting tools like Tableau is possible with a driver for Postgresql. Even advanced analytics are possible, if you install Apache MADlib on your HAWQ cluster.

Using Data outside of HAWQ

It is even possible to use all data outside of HAWQ, if there is a need for it. Since all data is available in AVRO format, accessing it by means of Apache Spark with Apache Zeppelin is also possible. Hive queries are possible too, since all data is registered there using external tables, which we used for integration into HAWQ.
Accessing results of such processing in HAWQ is possible too. Save the results in AVRO format for integration in the way described above or use “hawq register” to access parquet files directly from HDFS.

Conclusion

Using Apache HAWQ as base of a Data Lake is possible. Just take some contraints into consideration. But entering data with semi-structured with AVRO format also saves work later when you process the data. The main advantage is, that you can utilize SQL as an interface to all of you data. This enables many people in your company to access your data and will help you on your way to Data Driven decisions.

Apache Zeppelin is pretty usefull for interactive programming using the web browser. It even comes with its own installation of Apache Spark. For further information you can check my earlier post.
But the real power in using Spark with Zeppelin lies in its easy way to connect it to your existing Spark cluster using YARN. The following steps are necessary:

But there is also the possibility to add your own interpreter to Zeppelin. This makes this tool really flexible.
Another feature it has, is the built in integration of Apache Spark. It ships with the following features and more:

Automatic SparkContext and SQLContext injection

Runtime jar dependency loading from local filesystem or maven repository.

Canceling job and displaying its progress

It also has built in visualization, which is an improvemnt over using ipython notebooks I think. The visualization covers the most basic graphs, like:

Tables

BarCharts

Pies

Scatterplot

Lines

These visualizations can be used with all interpreters and are always the same. So you can show data from Postgres and Spark in the same notebook with the same functions used. There is no need to handle different data sources differently.
You can also use dynamic forms in your notebooks, e.g. to provide filter options to the user. This comes in handy, if you embedd a notebook in your own website.

Apache Spark has release version 2.0, which is a major step forward in usability for Spark users and mostly for people, who refrained from using it, due to the costs of learning a new programming language or tool. This is in the past now, as Spark 2.0 supports improved SQL functionalities with SQL2003 support. It can now run all 99 TPC-DS. The new SQL parser supports ANSI-SQL and HiveQL and sub queries.
Another new features is native csv data source support, based on the already existing Databricksspark csv module. I personally used this module as well as the spark avro module before and they make working with data in those formats really easy.
Also there were some new features added to MLlib:

Spark increased its performance with the release of 2.0. The goal was to make Spark 2.0 10x faster and Databricks shows this performance tuning in a notebook.

All of these improvements make Spark a more complete tool for data processing and analysing. The added SQL2003 support even makes it available for a larger user base and more importantly makes it easier to migrate existing applications from databases to Spark.

In Data Science there are two languages that compete for users. On one side there is R, on the other Python. Both have a huge userbase, but there is some discussion, which is better to use in a Data Science context. Lets explore both a bit:

R
R is a language and programming environment especially developed for statistical computing and grahics. It has been around some time and several thousand packages to tackle statistical problems. With RStudio it also provides an interactive programming environment, that makes analysing data pretty easy.

Python
Python is a full range programming language, that makes it easy to integrate into a company wide system. With the packages Numpy, Pandas and Scikit-learn, Mathplotlib in combination with IPython, it also provides a full range suite for statistical computing and interactive programming environment.

R was developed solely for the purpose of statistical computing, so it has some advantages there, since it is specialized and has been around some years. Python is coming from a programming language and moves now into the data analysis field. In combination with all the other stuff it can do, websites and easy integrations into Hadoop Streaming or Apache Spark.
And for people who want to use the best of both sides can always use the R Python integration Rpy2.

I personally am recently working with Python for my ETL processes, including MapReduce, and anlysing data, which works awesome in combination with IPython as interactive development tool.

Since Apache Spark became a Top Level Project at Apache almost a year ago, it has seen some wide coverage and adoption in the industry. Due to its promise of being faster than Hadoop MapReduce, about 100x in memory and 10x on disk, it seems like a real alternative to doing pure MapReduce.
Written in Scala, it provides the ability to write applications fast in Java, Python and Scala, and the syntax isn’t that hard to learn. There are even tools available for using SQL (Spark SQL), Machine Learning (MLib) interoperating with Pythons Numpy, graphics and streaming. This makes Spark to a real good alternative for big data processing.
Another feature of Apache Spark is, that it runs everywhere. On top of Hadoop, standalone, in the cloud and can easily access diverse data stores, such as HDFS, Amazon S3, Cassandra, HBase.

The easy integration into Amazon Web Services is what makes it attractive to me, since I am using this already. I also like the Python integration, because latelly, that became my favourite language for data manipulation and machine learning.

Besides the official parts of Spark mentioned above, there are also some really nice external packages, that for example integrate Spark with tools such as PIG, Amazon Redshift, some machine learning algorithms, and so on.

Given the promised speed gain, the ease of use and the full range of tools available, and the integration in third party programms, such as Tableau or MicroStrategy, Spark seems to look into a bright future.

The inventors of Apache Spark also founded a company called databricks, which offers professional services around Spark.

With Hadoop 2.0 and the new additions of Stinger and Impala I did a (not representive) test of the performance on a Virtual Box running on my desktop computer. It was using the following setup:

4 GB RAM

Intel Core i5 2500 3.3 GHz

The datasets were the following:

Dataset 1: 71.386.291 rows and 5 columns

Dataset 2: 132.430.086 rows and 4 columns

Dataset 3: partitioned data of 2.153.924 rows and 32 columns

Dataset 4: unpartitioned data of 2.153.924 rows and 32 columns

The results were the following:

Query

Hive (0.10.0)

Impala

Stinger (Hive 0.12.0)

Join tables

167.61 sec

31.46 sec

122.58 sec

Partitioned tables Dataset 3

42.45 sec

0.29 sec

20.97 sec

Unpartitioned tables Dataset 4

47.92 sec

1.20 sec

36.46 sec

Grouped Select Dataset 1

533.83 sec

81.11 sec

444.634 sec

Grouped Select Dataset 2

323.56 sec

49.72 sec

313.98 sec

Count Dataset 1

252.56 sec

66.48 sec

243.91 sec

Count Dataset 2

158.93 sec

41.64 sec

174.46 sec

Compare Impala vs. Stinger

This shows that Stinger provides a faster SQL interface on Hive, but since it is still using Map / Reduce when calculating data it is no match for Impala that doesn’t use Map / Reduce. So using Impala makes sense when you want to analyse data in Hadoop using SQL even on a small installation. This should give you easy and fast access to all data stored in your Hadoop cluster, that was before not possible.
Facebook’s Presto should achieve nearly the same results, since the underlying technique is similar. These latest additions and changes to the Hadoop framework really seem like a big boost in making this project more accessible for many people.

By continuing to use the site, you agree to the use of cookies. more information

The cookie settings on this website are set to "allow cookies" to give you the best browsing experience possible. If you continue to use this website without changing your cookie settings or you click "Accept" below then you are consenting to this.