Hadoop, bigdata, cloud computing and mobile BI

Main menu

Tag Archives: R

Introduction

R is a programming language and a software suite used for data analysis, statistical computing and data visualization. It is highly extensible and has object oriented features and strong graphical capabilities. At its heart R is an interpreted language and comes with a command line interpreter – available for Linux, Windows and Mac machines – but there are IDEs as well to support development like RStudio or JGR.

R and Hadoop can complement each other very well, they are a natural match in big data analytics and visualization. One of the most well-known R packages to support Hadoop functionalities is RHadoop that was developed by RevolutionAnalytics.

To install these R packages, first we need to install R base package. On Ubuntu 12.04 LTS we can do it running:

$ sudo apt-get install r-base

Then we need to install RHadoop packages with their dependencies. rmr requires RCpp, RJSONIO, digest, functional, stringr and plyr, while rhdfs requires rJava.

As part of the installation, we need to reconfigure Java for rJava package and we also need to set HADOOP_CMD variable for rhdfs package. The installation requires the corresponding tar.gz archives to be downloaded and then we can run R CMD INSTALL command with sudo privileges.

GDP data can be downloaded from Worldbank data catalog site. The data needs to be adjusted to be suitable for MapReduce algorithm. The final format that we used for data analysis is as follows (where the last column is the GDP of the given country in millions USD):

Then we will get the data saying how many countries have greater and how many contries have less GDP than Apple Inc.’s revenue in year 2012. The result is that 55 countries had greater GDP than Apple and 138 countries had less.

$key
GDP
1 "greater"
56 "less"
$val
[1] 55 138

The following screenshot from RStudio shows the histogram of GDPs – there are 15 countries having more than 1,000 millions USD GDP; 1 country is in the range of 14,000 – 15,000 millions USD, 1 country is in the range of 7,000 – 8,000 millions USD and 1 country is in the range of 5,000 – 6,000 USD.

Conclusion

If someone needs to combine strong data analytics and visualization features with big data capabilities supported by Hadoop, it is certainly worth to have a closer look at RHadoop features. It has packages to integrate R with MapReduce, HDFS and HBase, the key components of the Hadoop ecosystem. For more details, please read the R and Hadoop Big Data Analytics whitepaper.