Hadoop, bigdata, cloud computing and mobile BI

Main menu

Category Archives: NoSQL

Introduction

As discussed in the previous post about Twitter’s Storm, Hadoop is a batch oriented solution that has a lack of support for ad-hoc, real-time queries. Many of the players in Big Data have realised the need for fast, interactive queries besides the traditional Hadooop approach. Cloudera, one the key solution vendors in Big Data/Hadoop domain has just recently launched Cloudera Impala that addresses this gap.

As Cloudera Engineering team descibed in ther blog, their work was inspired by Google Dremel paper which is also the basis for Google BigQuery. Cloudera Impala provides a HiveQL-like query language for wide variety of SELECT statements with WHERE, GROUP BY, HAVING clauses, with ORDER BY – though currently LIMIT is mandatory with ORDER BY -, joins (LEFT, RIGTH, FULL, OUTER, INNER), UNION ALL, external tables, etc. It also supports arithmetic and logical operators and Hive built-in functions such as COUNT, SUM, LIKE, IN or BETWEEN. It can access data stored on HDFS but it does not use mapreduce, instead it is based on its own distributed query engine.

The current Impala release (Impala 1.0beta) does not support DDL statements (CREATE, ALTER, DROP TABLE), all the table creation/modification/deletion functions have to be executed via Hive and then refreshed in Impala shell.

Cloudera Impala is open-source under Apache Licence, the code can be retrieved from Github. Its components are written in C++, Java and Python.

Impalad is running on each Hadoop datanode and and it plans and executes the queries sent from impala-shell.

Impala-state-store stores information (location and status) about all the running impalad instances.

Installing Cloudera Impala

As of writing this article, Cloudera Impala requires 64-bit RHEL/Centos 6.2 or higher. I was running the tests on RHEL6.3 (64-bit) on AWS. You need to have Cloudera CDH4.1, Hive and Mysql installed (the latter is used to store Hive metastore).

Note: AWS t1.micro instances are not suitable for CDH4.1, that requires more memory than t1.micro provides.

Cloudera recommends to use Cloudera Manager to install Impala but I used manual steps, just to ensure that I have a complete understanding of what is going on during the installation.

Step 1: Install CDH4.1

To install CDH4.1 you need to run the following commands (these steps describe how to install Hadoop MRv1 – if you want to have YARN instead, that requires another MapReduce rpms to be installed. However, Cloudera stated in the install instructions that they do not consider MapReduce 2.0 (YARN) production-ready yet thus I decided to stick with MRv1 for these tests. CDH4.1 MRV1 can be installed as a pseudo distribution or a full cluster solution, for the tests we will see pseudo distribution:

If you follow the manual installation procedure, there will be no impala config files created automatically. You need to create /usr/lib/impala/conf directory and copy the following file into it: core-site.xml, hdfs-site.xml, hive-site.xml and log4j.properties.

Run Impala queries

In order to run Impala interactive queries from impala shell, we need to create the tables via Hive (remember, the current Impala beta version does not support DDLs). I used Google stockprices in this example (retrieved from http://finance.yahoo.com in csv format):

Running the very same query from impala-shell executes significantly faster. Cloudera claim that it can be executed an order of magnitude or even faster, depending on the query. I can confirm from my experiences that impala-shell returned the result as an average around in one second, compared to the Hive version which took roughly 82 seconds.

Conclusion

There are more and more efforts in the Big Data world to support ad-hoc, fast queries and realtime data processing for large datasets. Cloudera Impala is certainly an exciting solution that is utilising the same concept as Google BigQuery but promises to support wider range of input formats and by making it available as an open source technology it can attract external developers to improve the software and take it to the next stage.

If you are interested to learn more about Impala, please, check out our book, Impala in Action at Manning Publishing.

Last time I wrote about Hadoop on Heroku which is on add-on from Treasure Data – this time I am going to cover NoSQL on Heroku.
There are various datastore services – add-ons in Heroku terms – available from MongoDB (MongoHQ) to CouchDB (Cloudant) to Cassandra (Cassandra.io). This post is devoted to Cassandra.io.

Cassandra.io

Cassandra.io is a hosted and managed Cassandra ring based on Apache Cassandra and makes it accessible via RESTful API. As of writing this article, the Cassandra.io client helper libraries are available in Java, Ruby and PHP, and there is also a Objective-C version in private beta. The libraries can be downloaded from github. I use the Java library in my tests.

Heroku – and Cassandra.io add-on, too – is built on Amazon Elastic Compute Cloud (EC2) and it is supported in all Amazon’s locations. Note: Cassandra.io add-on is in public beta now that means you have only one option called Test available – this is free.

Installing Cassandra.io add-on

To install Cassandra.io add-on you just need to follow the standard way of adding an add-on to an application:

The java RESTful API library has one simple configuration file called sdk.properties. It has very few parameters stored in it – the API url and the version. The original sdk.properties file that is cloned from github has the version wrong (v0.1), it needs to be changed to 1. You can verify the required configuration parameters using heroku config command.

Step 1./The code creates a keyspace named AAPL using HTTP POST, url: https://api.cassandra.io/1/keyspace/AAPL/
It uses KeySpaceAPI class with Token and AccountId as parameters for the constructor. Token is used as username, while AccountID is the password. (Remember: these attributes can be retrieved using heroku config command or via Heroku Admin console)

Step 4./ Then the code prepares the data as name/value pairs (Open = “533.96”, Close = “530.38”, etc), defines a rowkey (“18-05-2012”) and the uses DataAPI postBulkData method to upload the data into Cassandra.io. DataAPI credentials are the same as above.

If you want to try out a robust, highly available Casssandra datastore without any upfront infrastructure investment and with an easy to use API, you can certainly have a closer look at Cassandra.io on Heroku. It takes only a few minutes to start up and the APIs offer a simply REST based data management for Java, Ruby and PHP developers.

Like this:

This time I write about Heroku and Treasure Data Hadoop solution – I found it really to be a gemin the Big Data world.

Heroku is a cloud platform as a service (PaaS) owned by Salesforce.com. Originally it started with supporting Ruby as its main programming language but it has been extended to Java, Scala, Node.js, Python and Clojure, too. It also supports a long list of addons including – among others – RDBMS and NoSQL capabilities and Hadoop-based data warehouse developed by Treasure Data.

Treasure Data Hadoop Architecture

The architecture of Treasure Data Hadoop solution is as as follows:

Heroku Toolbelt

Heroku toolbelt is a command line tooling that consists of heroku, foreman and git packages. As it is described on heroku toolbelt website: it is “everything you need to get started using heroku”. (heroku CLI is based on ruby so you need ruby under the hood, too). Once you have signed up for heroku (you need a verified account meaning that you provided your bank details for potential service charges) and you have installed the heroku toolbelt, you can start right away.

Depending on you environment – I am using Ubuntu 12.04 LTS – you can use alternative installation method like:

$ sudo apt-get install git
$ gem install heroku
$ gem install foreman

Heroku and Treasure Data add-on

If you want to use Treasure Data on Heroku, you need to add the Treasure Data Hadoop add-on – you need to login, create an application (heroku will generate a fancy name like boiling-tundra for you) and then you need to add your particular add-on to the application you just created:

Treasure Data Hadoop – td commands

Now we are ready to execute td commands from heroku. td commands are used to create database and tables, import data, run queries, drop tables, etc. Under the hood td commands are basically HiveQL queries. (According to their website, Treasure Data plans to support Pig as well in the future).

By default Treasure Data td-agent prefers json formatted data, though they can process various other formats (apache log, syslog, etc) and you can write your own parser to process the uploaded data.

Now we are ready to run HiveQL (td query) against the dataset – this particular query lists the highest prices of AAPL stock on the top and shows the prices in ascending order. (time value is based on UNIX epoch):

Cassandra – originally developed at Facebook – is another popular NoSQL database that combines Amazon’s Dynamo distributed systems technologies and Google’s Bigtable data model based on Column Families. It is designed for distributed data at large scale.Its key components are as follows:

Keyscape:it acts as a container for data, similar to RDBMS schema. This determines the replication parameters such as replication factor and replication placement strategy as we will see it later in this post. More details on replication placement strategy can be read here.

Column Familiy: within a keyscape you can have one or more column families. This is similar to tables in RDBMS world. They contain multiple columns which are referenced by row keys.

Column: it is the smallest increment of data. It is a tuple having a name, a value and and a timestamp.

Installing Cassandra from binariesDatastax is the commercial leader in Apache Cassandra, they offer a complete big data platform (Enterprise Edition) built on Apace Cassandra as well as a free Community Edition. This post is based on the latter one. In 2012 they were listed among the Top10 Big Data startups.

Beside the Cassandra package they also offer a web-based management center (Datastax OpsCenter), this can make Cassandra cluster management much easier than the command line based alternatives (e.g. cassandra-cli).

To download Datastax Community Edition, go to this link. Both the Datastax Community Server and the OpsCenter Community Edition are available in here. As of this writing, The Cassandra Community Server version is 1.1.2 (dsc-cassandra-1.1.2-bin.tar.gz) and the OpsCenter is 2.1.1 (opscenter-2.1.1-free.tar.gz).

The installation is as simple as to unzip and untar the tarballs. Then you need to configure the cassandra instance by editing <Cassandra install diractory>/conf/cassandra.yaml file.

A few parameters that needed to be edited:

cluster_name: 'BigHadoop Cluster'
initial_token: 0
listen_address: 10.229.30.238
seed_provider:
# Addresses of hosts that are deemed contact points.
# Cassandra nodes use this list of hosts to find each other and learn
# the topology of the ring. You must change this if you are running
# multiple nodes!
- class_name: org.apache.cassandra.locator.SimpleSeedProvider
parameters:
# seeds is actually a comma-delimited list of addresses.
# Ex: ",,"
- seeds: "10.229.30.238"
rpc_address: 0.0.0.0

My configuration had two nodes, the second node has a similar cassandra.yaml file except for the listen_address and the token.

The next step is to install the OpsCenter (on one designated node) and the agents on all the nodes. This is again as simply as unzip and untar the tarball that we just downloaded from Datastax site and then edit opscenterd.conf

[webserver]
port = 8888
interface = 0.0.0.0
[agents]
use_ssl = false

Note: I did not want to use SSL between the agents and the OpsCenter so I disabled it.

To start up the OpsCenter:

$ cd
$ bin/opscenter

In fact, OpsCenter is a python twistd based webserver so you need to have python installed as well. Amazon AMI had python 2.6.7 preinstalled.

$ python -V
Python 2.6.7

OpsCenter also uses iostat which was not preinstalled on my instance, so I had to install sysstat package, too:

$ sudo yum install sysstat

You can also install the agents manually – that is what I did – or automatically, but you have to ensure that they are installed on every node that are members of the cluster. The agent is part of the OpsCenter tarball, it can be found under OpsCenter/agent directory.

To configure the agent you need to edit conf/address.yaml file:

$ cat address.yaml
stomp_interface: "10.229.30.238"
use_ssl: 0

stopm_interface is the OpsCenter interface, while use_ssl: 0 indicates that we do not use SSL for agent communications.

Note: Cassandra and OpsCenter are using TCP ports that are not open by default on an AWS EC2 instance. You need to defined a special security group that opens the following ports: 7000/tcp, 9160/tcp, 8888/tcp, 61210/tcp and 61621/tcp. More details about how these ports are used can be found here.

Using Cassandra

The simplest way to start using Cassandra is its command line tool called cassandra-cli.

These steps create a keyspace called AAPL, modify the replication parameters mentioned above (replication factor and placement strategy) and create a column family called Marketdata. Then we can use Set command to insert data and Get to retrieve them.

Besides the ‘traditional command line interface’, there is also a SQLPlus-like utility known as Cassandra Query Language Shell (cqlsh). This is a utility written in python that supports SQL-like queries (a kind of Hive analogy from Hadoop world).

It supports DDL and DML type of commands so you can run SELECT and INSERT statements as well as CREATE KEYSPACE, CREATE TABLE, ALTER TABLE and DROP TABLE.