Hadoop, bigdata, cloud computing and mobile BI

Main menu

Monthly Archives: March 2012

Amazon Elastic MapReduce is a service in the AWS portfolio that can be used for data processing and analytics on vast amounts of data. It is based on Hadoop (as of writing this article it is using Hadoop 0.20.205) and relies on other AWS services such as EC2 and S3.

The data processing applications can be implemented using various technologies such as Hive, Pig, Java (Custom Jar) and Streaming (e.g. python or ruby). This post will demonstrate how to use Hive on Amazon Elastic MapReduce – the sample application will calculate the average price of Apple stock in every year from 1984 till 2012. At the time of writing Hive version is 0.7.1 . (Side note: as it will be shown, AAPL started at around 25 USD as an average price in 1984, managed to get down to 18 USD in 1997 and now it is around 500 – 496.32138, to be more precise -, quite some numbers for a company that is in Infinite Loop for decades…)

How to create Elastic MapReduce Jobs?

There are three steps to manage an EMR jobflow:

1./ Upload the script (i.e. hive.q file) and the data to be processed onto S3. If you are unfamiliar with AWS, this is a good place to start to understand its structure and the way how to use it.

The test data used in the post is downloaded from Yahoo! Finance website (Historical data for AAPL stock). Go to http://finance.yahoo.com/q/hp?s=AAPL+Historical+Prices and then scroll down to Download to Spreadsheet link. This will create a csv file (~6,950 lines) with the following columns: Date,Open,High,Low,Close,Volume,Adj Close. Remove the header (the first line) to leave only the relevant data in the csv file.

f./ if you want, you can configure debugging by defining a S3 log path and selecting “Enable Debugging” (optional). I highly recommend to do it if you are in development phase:

g./ Set no bootstrap actions:

h./ review the configuration before you hit the run button:

i./ create job flow:

j./ you can verify the job flow status from STARTING to RUNNING to SHUTDOWN.

Should there be any issues occuring, you can check the stderr, stdout, syslog from “Debug” menu.

3./ Check the result:

After a few minutes of number crunching, the output will be generated in //stokcprice/apple/output folder (e.g. 000000 file). The file will have a text format with date and stock price cloumns (separeted by SOH – start of heading – ascii 001), see:

Share this:

Like this:

Spring for Apache Hadoop is a Spring project to support writing applications that can benefit of the integration of Spring Framework and Hadoop. This post describes how to use Spring Data Apache Hadoop in an Amazon EC2 environment using the “Hello World” equivalent of Hadoop programming – a Wordcount application.

– Select Launch Instance then Classic Wizzard and click on Continue. My test environment was a “Basic Amazon Linux AMI 2011.09” 32-bit., Instant type: Micro (t1.micro , 613 MB), Security group quick-start-1 that enables ssh to be used for login. Select your existing key pair (or create a new one). Obviously you can select another AMI and instance types depending on your favourite flavour. (Should you vote for Windows 2008 based instance, you also need to have cygwin installed as an additional Hadoop prerequisite beside Java JDK and ssh, see “Install Apache Hadoop” section)

2./ Download Apache Hadoop– as of writing this article, 1.0.0 is the latest stable version of Apache Hadoop, that is what was used for testing purposes. I downloaded hadoop-1.0.0.tar.gz and copied it into /home/ec2-user directory using pscp command from my PC running Windows:

Spring Data Hadoop is using gradle as build tool. Check build.grandle build file. The original version packaged in the tar.gz file does not compile, it complains about thrift, version 0.2.0 and jdo2-api, version2.3-ec.

Unfortunatelly, there seems to be no maven repo for thrift 0.2.0 . You should download thrift 0.2.0.jar and thrift.0.2.0.pom file e.g. from this repo: “http://people.apache.org/~rawson/repo” and then add it to local maven repo.

BigData and particulary Hadoop/MapReduce represent a quickly growing part of Business Intelligence and Data Analytics. In his frequently quoted article on O’Reilly Radar, Edd Dumbill gives a good introduction to big data landscape: What is big data?

Three V-words are recurring when experts attempt to give definition what big data is all about: Volume (terabytes and petabytes of information), Velocity (data is literally streaming in with unprecedented speed) and Variety (structured and unstrucuted data). You can convert these V-words into the forth one: Value. BigData promises insights about things that remained hidden until now.

The intention of this blog is to cover various technologies from cloud computing that provides the infrastructure to Hadoop distributions that are used to crunch the numbers to mobile analytics that can support easy access to the results of the complex algorithms and enormous computing capacity.