Main menu

Category Archives: Heroku

Last time I wrote about Hadoop on Heroku which is on add-on from Treasure Data – this time I am going to cover NoSQL on Heroku.
There are various datastore services – add-ons in Heroku terms – available from MongoDB (MongoHQ) to CouchDB (Cloudant) to Cassandra (Cassandra.io). This post is devoted to Cassandra.io.

Cassandra.io

Cassandra.io is a hosted and managed Cassandra ring based on Apache Cassandra and makes it accessible via RESTful API. As of writing this article, the Cassandra.io client helper libraries are available in Java, Ruby and PHP, and there is also a Objective-C version in private beta. The libraries can be downloaded from github. I use the Java library in my tests.

Heroku – and Cassandra.io add-on, too – is built on Amazon Elastic Compute Cloud (EC2) and it is supported in all Amazon’s locations. Note: Cassandra.io add-on is in public beta now that means you have only one option called Test available – this is free.

Installing Cassandra.io add-on

To install Cassandra.io add-on you just need to follow the standard way of adding an add-on to an application:

The java RESTful API library has one simple configuration file called sdk.properties. It has very few parameters stored in it – the API url and the version. The original sdk.properties file that is cloned from github has the version wrong (v0.1), it needs to be changed to 1. You can verify the required configuration parameters using heroku config command.

Step 1./The code creates a keyspace named AAPL using HTTP POST, url: https://api.cassandra.io/1/keyspace/AAPL/
It uses KeySpaceAPI class with Token and AccountId as parameters for the constructor. Token is used as username, while AccountID is the password. (Remember: these attributes can be retrieved using heroku config command or via Heroku Admin console)

Step 4./ Then the code prepares the data as name/value pairs (Open = “533.96”, Close = “530.38”, etc), defines a rowkey (“18-05-2012”) and the uses DataAPI postBulkData method to upload the data into Cassandra.io. DataAPI credentials are the same as above.

If you want to try out a robust, highly available Casssandra datastore without any upfront infrastructure investment and with an easy to use API, you can certainly have a closer look at Cassandra.io on Heroku. It takes only a few minutes to start up and the APIs offer a simply REST based data management for Java, Ruby and PHP developers.

Like this:

This time I write about Heroku and Treasure Data Hadoop solution – I found it really to be a gemin the Big Data world.

Heroku is a cloud platform as a service (PaaS) owned by Salesforce.com. Originally it started with supporting Ruby as its main programming language but it has been extended to Java, Scala, Node.js, Python and Clojure, too. It also supports a long list of addons including – among others – RDBMS and NoSQL capabilities and Hadoop-based data warehouse developed by Treasure Data.

Treasure Data Hadoop Architecture

The architecture of Treasure Data Hadoop solution is as as follows:

Heroku Toolbelt

Heroku toolbelt is a command line tooling that consists of heroku, foreman and git packages. As it is described on heroku toolbelt website: it is “everything you need to get started using heroku”. (heroku CLI is based on ruby so you need ruby under the hood, too). Once you have signed up for heroku (you need a verified account meaning that you provided your bank details for potential service charges) and you have installed the heroku toolbelt, you can start right away.

Depending on you environment – I am using Ubuntu 12.04 LTS – you can use alternative installation method like:

$ sudo apt-get install git
$ gem install heroku
$ gem install foreman

Heroku and Treasure Data add-on

If you want to use Treasure Data on Heroku, you need to add the Treasure Data Hadoop add-on – you need to login, create an application (heroku will generate a fancy name like boiling-tundra for you) and then you need to add your particular add-on to the application you just created:

Treasure Data Hadoop – td commands

Now we are ready to execute td commands from heroku. td commands are used to create database and tables, import data, run queries, drop tables, etc. Under the hood td commands are basically HiveQL queries. (According to their website, Treasure Data plans to support Pig as well in the future).

By default Treasure Data td-agent prefers json formatted data, though they can process various other formats (apache log, syslog, etc) and you can write your own parser to process the uploaded data.

Now we are ready to run HiveQL (td query) against the dataset – this particular query lists the highest prices of AAPL stock on the top and shows the prices in ascending order. (time value is based on UNIX epoch):