Transcription

3 Big Data Analytics with R and Hadoop Copyright 2013 Packt Publishing All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews. Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book. Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information. First published: November 2013 Production Reference: Published by Packt Publishing Ltd. Livery Place 35 Livery Street Birmingham B3 2PB, UK. ISBN Cover Image by Duraid Fatouhi

5 About the Author Vignesh Prajapati, from India, is a Big Data enthusiast, a Pingax (www.pingax. com) consultant and a software professional at Enjay. He is an experienced ML Data engineer. He is experienced with Machine learning and Big Data technologies such as R, Hadoop, Mahout, Pig, Hive, and related Hadoop components to analyze datasets to achieve informative insights by data analytics cycles. He pursued B.E from Gujarat Technological University in 2012 and started his career as Data Engineer at Tatvic. His professional experience includes working on the development of various Data analytics algorithms for Google Analytics data source, for providing economic value to the products. To get the ML in action, he implemented several analytical apps in collaboration with Google Analytics and Google Prediction API services. He also contributes to the R community by developing the RGoogleAnalytics' R library as an open source code Google project and writes articles on Data-driven technologies. Vignesh is not limited to a single domain; he has also worked for developing various interactive apps via various Google APIs, such as Google Analytics API, Realtime API, Google Prediction API, Google Chart API, and Translate API with the Java and PHP platforms. He is highly interested in the development of open source technologies. Vignesh has also reviewed the Apache Mahout Cookbook for Packt Publishing. This book provides a fresh, scope-oriented approach to the Mahout world for beginners as well as advanced users. Mahout Cookbook is specially designed to make users aware of the different possible machine learning applications, strategies, and algorithms to produce an intelligent as well as Big Data application.

6 Acknowledgment First and foremost, I would like to thank my loving parents and younger brother Vaibhav for standing beside me throughout my career as well as while writing this book. Without their support it would have been totally impossible to achieve this knowledge sharing. As I started writing this book, I was continuously motivated by my father (Prahlad Prajapati) and regularly followed up by my mother (Dharmistha Prajapati). Also, thanks to my friends for encouraging me to initiate writing for big technologies such as Hadoop and R. During this writing period I went through some critical phases of my life, which were challenging for me at all times. I am grateful to Ravi Pathak, CEO and founder at Tatvic, who introduced me to this vast field of Machine learning and Big Data and helped me realize my potential. And yes, I can't forget James, Wendell, and Mandar from Packt Publishing for their valuable support, motivation, and guidance to achieve these heights. Special thanks to them for filling up the communication gap on the technical and graphical sections of this book. Thanks to Big Data and Machine learning. Finally a big thanks to God, you have given me the power to believe in myself and pursue my dreams. I could never have done this without the faith I have in you, the Almighty. Let us go forward together into the future of Big Data analytics.

7 About the Reviewers Krishnanand Khambadkone has over 20 years of overall experience. He is currently working as a senior solutions architect in the Big Data and Hadoop Practice of TCS America and is architecting and implementing Hadoop solutions for Fortune 500 clients, mainly large banking organizations. Prior to this he worked on delivering middleware and SOA solutions using the Oracle middleware stack and built and delivered software using the J2EE product stack. He is an avid evangelist and enthusiast of Big Data and Hadoop. He has written several articles and white papers on this subject, and has also presented these at conferences. Muthusamy Manigandan is the Head of Engineering and Architecture with Ozone Media. Mani has more than 15 years of experience in designing large-scale software systems in the areas of virtualization, Distributed Version Control systems, ERP, supply chain management, Machine Learning and Recommendation Engine, behavior-based retargeting, and behavior targeting creative. Prior to joining Ozone Media, Mani handled various responsibilities at VMware, Oracle, AOL, and Manhattan Associates. At Ozone Media he is responsible for products, technology, and research initiatives. Mani can be reached at yahoo.co.uk and

8 Vidyasagar N V had an interest in computer science since an early age. Some of his serious work in computers and computer networks began during his high school days. Later he went to the prestigious Institute Of Technology, Banaras Hindu University for his B.Tech. He is working as a software developer and data expert, developing and building scalable systems. He has worked with a variety of second, third, and fourth generation languages. He has also worked with flat files, indexed files, hierarchical databases, network databases, and relational databases, such as NOSQL databases, Hadoop, and related technologies. Currently, he is working as a senior developer at Collective Inc., developing Big-Data-based structured data extraction techniques using the web and local information. He enjoys developing high-quality software, web-based solutions, and designing secure and scalable data systems. I would like to thank my parents, Mr. N Srinivasa Rao and Mrs. Latha Rao, and my family who supported and backed me throughout my life, and friends for being friends. I would also like to thank all those people who willingly donate their time, effort, and expertise by participating in open source software projects. Thanks to Packt Publishing for selecting me as one of the technical reviewers on this wonderful book. It is my honor to be a part of this book. You can contact me at Siddharth Tiwari has been in the industry since the past three years working on Machine learning, Text Analytics, Big Data Management, and information search and Management. Currently he is employed by EMC Corporation's Big Data management and analytics initiative and product engineering wing for their Hadoop distribution. He is a part of the TeraSort and MinuteSort world records, achieved while working with a large financial services firm. He pursued Bachelor of Technology from Uttar Pradesh Technical University with equivalent CGPA 8.

9 Support files, ebooks, discount offers and more You might want to visit for support files and downloads related to your book. Did you know that Packt offers ebook versions of every book published, with PDF and epub files available? You can upgrade to the ebook version at com and as a print book customer, you are entitled to a discount on the ebook copy. Get in touch with us at for more details. At you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and ebooks. TM Do you need instant solutions to your IT questions? PacktLib is Packt's online digital book library. Here, you can access, read and search across Packt's entire library of books. Why Subscribe? Fully searchable across every book published by Packt Copy and paste, print and bookmark content On demand and accessible via web browser Free Access for Packt account holders If you have an account with Packt at you can use this to access PacktLib today and view nine entirely free books. Simply use your login credentials for immediate access.

16 Preface The volume of data that enterprises acquire every day is increasing exponentially. It is now possible to store these vast amounts of information on low cost platforms such as Hadoop. The conundrum these organizations now face is what to do with all this data and how to glean key insights from this data. Thus R comes into picture. R is a very amazing tool that makes it a snap to run advanced statistical models on data, translate the derived models into colorful graphs and visualizations, and do a lot more functions related to data science. One key drawback of R, though, is that it is not very scalable. The core R engine can process and work on very limited amount of data. As Hadoop is very popular for Big Data processing, corresponding R with Hadoop for scalability is the next logical step. This book is dedicated to R and Hadoop and the intricacies of how data analytics operations of R can be made scalable by using a platform as Hadoop. With this agenda in mind, this book will cater to a wide audience including data scientists, statisticians, data architects, and engineers who are looking for solutions to process and analyze vast amounts of information using R and Hadoop. Using R with Hadoop will provide an elastic data analytics platform that will scale depending on the size of the dataset to be analyzed. Experienced programmers can then write Map/Reduce modules in R and run it using Hadoop's parallel processing Map/Reduce mechanism to identify patterns in the dataset.

17 Preface Introducing R R is an open source software package to perform statistical analysis on data. R is a programming language used by data scientist statisticians and others who need to make statistical analysis of data and glean key insights from data using mechanisms, such as regression, clustering, classification, and text analysis. R is registered under GNU (General Public License). It was developed by Ross Ihaka and Robert Gentleman at the University of Auckland, New Zealand, which is currently handled by the R Development Core Team. It can be considered as a different implementation of S, developed by Johan Chambers at Bell Labs. There are some important differences, but a lot of the code written in S can be unaltered using the R interpreter engine. R provides a wide variety of statistical, machine learning (linear and nonlinear modeling, classic statistical tests, time-series analysis, classification, clustering) and graphical techniques, and is highly extensible. R has various built-in as well as extended functions for statistical, machine learning, and visualization tasks such as: Data extraction Data cleaning Data loading Data transformation Statistical analysis Predictive modeling Data visualization It is one of the most popular open source statistical analysis packages available on the market today. It is crossplatform, has a very wide community support, and a large and ever-growing user community who are adding new packages every day. With its growing list of packages, R can now connect with other data stores, such as MySQL, SQLite, MongoDB, and Hadoop for data storage activities. [ 2 ]

18 Understanding features of R Let's see different useful features of R: Preface Effective programming language Relational database support Data analytics Data visualization Extension through the vast library of R packages Studying the popularity of R The graph provided from KD suggests that R is the most popular language for data analysis and mining: The following graph provides details about the total number of R packages released by R users from 2005 to This is how we explore R users. The growth was exponential in 2012 and it seems that 2013 is on track to beat that. [ 3 ]

19 Preface R allows performing Data analytics by various statistical and machine learning operations as follows: Regression Classification Clustering Recommendation Text mining Introducing Big Data Big Data has to deal with large and complex datasets that can be structured, semi-structured, or unstructured and will typically not fit into memory to be processed. They have to be processed in place, which means that computation has to be done where the data resides for processing. When we talk to developers, the people actually building Big Data systems and applications, we get a better idea of what they mean about 3Vs. They typically would mention the 3Vs model of Big Data, which are velocity, volume, and variety. Velocity refers to the low latency, real-time speed at which the analytics need to be applied. A typical example of this would be to perform analytics on a continuous stream of data originating from a social networking site or aggregation of disparate sources of data. [ 4 ]

20 Preface Volume refers to the size of the dataset. It may be in KB, MB, GB, TB, or PB based on the type of the application that generates or receives the data. Variety refers to the various types of the data that can exist, for example, text, audio, video, and photos. Big Data usually includes datasets with sizes. It is not possible for such systems to process this amount of data within the time frame mandated by the business. Big Data volumes are a constantly moving target, as of 2012 ranging from a few dozen terabytes to many petabytes of data in a single dataset. Faced with this seemingly insurmountable challenge, entirely new platforms are called Big Data platforms. Getting information about popular organizations that hold Big Data Some of the popular organizations that hold Big Data are as follows: Facebook: It has 40 PB of data and captures 100 TB/day Yahoo!: It has 60 PB of data Twitter: It captures 8 TB/day EBay: It has 40 PB of data and captures 50 TB/day [ 5 ]

21 Preface How much data is considered as Big Data differs from company to company. Though true that one company's Big Data is another's small, there is something common: doesn't fit in memory, nor disk, has rapid influx of data that needs to be processed and would benefit from distributed software stacks. For some companies, 10 TB of data would be considered Big Data and for others 1 PB would be Big Data. So only you can determine whether the data is really Big Data. It is sufficient to say that it would start in the low terabyte range. Also, a question well worth asking is, as you are not capturing and retaining enough of your data do you think you do not have a Big Data problem now? In some scenarios, companies literally discard data, because there wasn't a cost effective way to store and process it. With platforms as Hadoop, it is possible to start capturing and storing all that data. Introducing Hadoop Apache Hadoop is an open source Java framework for processing and querying vast amounts of data on large clusters of commodity hardware. Hadoop is a top level Apache project, initiated and led by Yahoo! and Doug Cutting. It relies on an active community of contributors from all over the world for its success. With a significant technology investment by Yahoo!, Apache Hadoop has become an enterprise-ready cloud computing technology. It is becoming the industry de facto framework for Big Data processing. Hadoop changes the economics and the dynamics of large-scale computing. Its impact can be boiled down to four salient characteristics. Hadoop enables scalable, cost-effective, flexible, fault-tolerant solutions. Exploring Hadoop features Apache Hadoop has two main features: HDFS (Hadoop Distributed File System) MapReduce [ 6 ]

22 Studying Hadoop components Hadoop includes an ecosystem of other products built over the core HDFS and MapReduce layer to enable various types of operations on the platform. A few popular Hadoop components are as follows: Preface Mahout: This is an extensive library of machine learning algorithms. Pig: Pig is a high-level language (such as PERL) to analyze large datasets with its own language syntax for expressing data analysis programs, coupled with infrastructure for evaluating these programs. Hive: Hive is a data warehouse system for Hadoop that facilitates easy data summarization, ad hoc queries, and the analysis of large datasets stored in HDFS. It has its own SQL-like query language called Hive Query Language (HQL), which is used to issue query commands to Hadoop. HBase: HBase (Hadoop Database) is a distributed, column-oriented database. HBase uses HDFS for the underlying storage. It supports both batch style computations using MapReduce and atomic queries (random reads). Sqoop: Apache Sqoop is a tool designed for efficiently transferring bulk data between Hadoop and Structured Relational Databases. Sqoop is an abbreviation for (SQ)L to Had(oop). ZooKeper: ZooKeeper is a centralized service to maintain configuration information, naming, providing distributed synchronization, and group services, which are very useful for a variety of distributed systems. Ambari: A web-based tool for provisioning, managing, and monitoring Apache Hadoop clusters, which includes support for Hadoop HDFS, Hadoop MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig, and Sqoop. [ 7 ]

23 Preface Understanding the reason for using R and Hadoop together I would also say that sometimes the data resides on the HDFS (in various formats). Since a lot of data analysts are very productive in R, it is natural to use R to compute with the data stored through Hadoop-related tools. As mentioned earlier, the strengths of R lie in its ability to analyze data using a rich library of packages but fall short when it comes to working on very large datasets. The strength of Hadoop on the other hand is to store and process very large amounts of data in the TB and even PB range. Such vast datasets cannot be processed in memory as the RAM of each machine cannot hold such large datasets. The options would be to run analysis on limited chunks also known as sampling or to correspond the analytical power of R with the storage and processing power of Hadoop and you arrive at an ideal solution. Such solutions can also be achieved in the cloud using platforms such as Amazon EMR. What this book covers Chapter 1, Getting Ready to Use R and Hadoop, gives an introduction as well as the process of installing R and Hadoop. Chapter 2, Writing Hadoop MapReduce Programs, covers basics of Hadoop MapReduce and ways to execute MapReduce using Hadoop. Chapter 3, Integrating R and Hadoop, shows deployment and running of sample MapReduce programs for RHadoop and RHIPE by various data handling processes. Chapter 4, Using Hadoop Streaming with R, shows how to use Hadoop Streaming with R. Chapter 5, Learning Data Analytics with R and Hadoop, introduces the Data analytics project life cycle by demonstrating with real-world Data analytics problems. Chapter 6, Understanding Big Data Analysis with Machine Learning, covers performing Big Data analytics by machine learning techniques with RHadoop. Chapter 7, Importing and Exporting Data from Various DBs, covers how to interface with popular relational databases to import and export data operations with R. Appendix, References, describes links to additional resources regarding the content of all the chapters being present. [ 8 ]

24 Preface What you need for this book As we are going to perform Big Data analytics with R and Hadoop, you should have basic knowledge of R and Hadoop and how to perform the practicals and you will need to have R and Hadoop installed and configured. It would be great if you already have a larger size data and problem definition that can be solved with datadriven technologies, such as R and Hadoop functions. Who this book is for This book is great for R developers who are looking for a way to perform Big Data analytics with Hadoop. They would like all the techniques of integrating R and Hadoop, how to write Hadoop MapReduce, and tutorials for developing and running Hadoop MapReduce within R. Also this book is aimed at those who know Hadoop and want to build some intelligent applications over Big Data with R packages. It would be helpful if readers have basic knowledge of R. Conventions In this book, you will find a number of styles of text that distinguish between different kinds of information. Here are some examples of these styles, and an explanation of their meaning. Code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles are shown as follows: "Preparing the Map() input." A block of code is set as follows: <property> <name>mapred.job.tracker</name> <value>localhost:54311</value> <description>the host and port that the MapReduce job tracker runs at. If "local", then jobs are run in-process as a single map and reduce task. </description> </property> Any command-line input or output is written as follows: // Setting the environment variables for running Java and Hadoop commands export HADOOP_HOME=/usr/local/hadoop export JAVA_HOME=/usr/lib/jvm/java-6-sun [ 9 ]

25 Preface New terms and important words are shown in bold. Words that you see on the screen, in menus or dialog boxes for example, appear in the text like this: "Open the Password tab. ". Warnings or important notes appear in a box like this. Tips and tricks appear like this. Reader feedback Feedback from our readers is always welcome. Let us know what you think about this book what you liked or may have disliked. Reader feedback is important for us to develop titles that you really get the most out of. To send us general feedback, simply send an to and mention the book title via the subject of your message. If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide on Customer support Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase. Downloading the example code You can download the example code files for all Packt books you have purchased from your account at If you purchased this book elsewhere, you can visit and register to have the files ed directly to you. [ 10 ]

Prepared By : Manoj Kumar Joshi & Vikas Sawhney General Agenda Introduction to Hadoop Architecture Acknowledgement Thanks to all the authors who left their selfexplanatory images on the internet. Thanks

Single Node Hadoop Cluster Setup This document describes how to create Hadoop Single Node cluster in just 30 Minutes on Amazon EC2 cloud. You will learn following topics. Click Here to watch these steps

L1: Introduction to Hadoop Feng Li feng.li@cufe.edu.cn School of Statistics and Mathematics Central University of Finance and Economics Revision: December 1, 2014 Today we are going to learn... 1 General

BIG DATA What it is and how to use? Lauri Ilison, PhD Data Scientist 21.11.2014 Big Data definition? There is no clear definition for BIG DATA BIG DATA is more of a concept than precise term 1 21.11.14

Lab 9: Hadoop Development The objective of this lab is to learn how to set up an environment for running distributed Hadoop applications. Introduction Hadoop can be run in one of three modes: Standalone

Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in

CDH 5 Quick Start Guide Important Notice (c) 2010-2015 Cloudera, Inc. All rights reserved. Cloudera, the Cloudera logo, Cloudera Impala, and any other product or service names or slogans contained in this

2. Implementation 2.1 Hadoop a. Hadoop Installation & Configuration First of all, we need to install Java Sun 6, and it is preferred to be version 6 not 7 for running Hadoop. Type the following commands

Big Data Explained An introduction to Big Data Science. 1 Presentation Agenda What is Big Data Why learn Big Data Who is it for How to start learning Big Data When to learn it Objective and Benefits of

CS 698: Special Topics in Big Data Chapter 4. Big Data Analytics Platforms Chase Wu New Jersey Ins0tute of Technology Some of the slides have been provided through the courtesy of Dr. Ching-Yung Lin at

WROX Certified Big Data Analyst Program by AnalytixLabs and Wiley Disclaimer: This material is protected under copyright act AnalytixLabs, 2011. Unauthorized use and/ or duplication of this material or

Workshop on Hadoop with Big Data Hadoop? Apache Hadoop is an open source framework for distributed storage and processing of large sets of data on commodity hardware. Hadoop enables businesses to quickly

BIG DATA TRENDS AND TECHNOLOGIES THE WORLD OF DATA IS CHANGING Cloud WHAT IS BIG DATA? Big data are datasets that grow so large that they become awkward to work with using onhand database management tools.

4 Overview A very short Intro to Hadoop photo by: exfordy, flickr 5 How to Crunch a Petabyte? Lots of disks, spinning all the time Redundancy, since disks die Lots of CPU cores, working all the time Retry,

Big-Data and Hadoop Developer Training with Oracle WDP What is this course about? Big Data is a collection of large and complex data sets that cannot be processed using regular database management tools

W H I T E P A P E R Deriving Intelligence from Large Data Using Hadoop and Applying Analytics Abstract This white paper is focused on discussing the challenges facing large scale data processing and the

Hadoop implementation of MapReduce computational model Ján Vaňo What is MapReduce? A computational model published in a paper by Google in 2004 Based on distributed computation Complements Google s distributed

Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open

Architecting the Future of Big Data Whitepaper Apache Hadoop: The Big Data Refinery Introduction Big data has become an extremely popular term, due to the well-documented explosion in the amount of data

Hadoop http://hadoop.apache.org/ What Is Apache Hadoop? The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using

Setup Hadoop On Ubuntu Linux ---Multi-Node Cluster We have installed the JDK and Hadoop for you. The JAVA_HOME is /usr/lib/jvm/java/jdk1.6.0_22 The Hadoop home is /home/user/hadoop-0.20.2 1. Network Edit

CTOlabs.com White Paper: What You Need To Know About Hadoop June 2011 A White Paper providing succinct information for the enterprise technologist. Inside: What is Hadoop, really? Issues the Hadoop stack

Large scale processing using Hadoop Ján Vaňo What is Hadoop? Software platform that lets one easily write and run applications that process vast amounts of data Includes: MapReduce offline computing engine

Big Data Course Highlights The Big Data course will start with the basics of Linux which are required to get started with Big Data and then slowly progress from some of the basics of Hadoop/Big Data (like

Cloudera Manager Introduction Important Notice (c) 2010-2013 Cloudera, Inc. All rights reserved. Cloudera, the Cloudera logo, Cloudera Impala, and any other product or service names or slogans contained

Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 3, Issue. 9, September 2014,

Tutorial for Assignment 2.0 Florian Klien & Christian Körner IMPORTANT The presented information has been tested on the following operating systems Mac OS X 10.6 Ubuntu Linux The installation on Windows

Big Data and Market Surveillance April 28, 2014 Copyright 2014 Scila AB. All rights reserved. Scila AB reserves the right to make changes to the information contained herein without prior notice. No part

What is this course about? This course is an overview of Big Data tools and technologies. It establishes a strong working knowledge of the concepts, techniques, and products associated with Big Data. Attendees