Michele Usuelli

Michele works for Microsoft as a Lead Data Scientist and is the author of two books about Machine Learning with R. He worked for Revolution Analytics, startup that developed a big data extension for R. The company got acquired by Microsoft in 2015.

R machine learning essentials will be published soon. The target audience is readers wanting to quickly get familiar with machine learning. The only requirement is knowing a bit about data analysis and/or coding concepts.

My previous article shows an example in which data analysis requires a structured framework with R and OOP. In order to explain how to build the framework this article describes how to do that in more detail.

Using OOP means creating new data structures and defining their methods that are functions performing a specific tasks on the object. Defining a new data structure requires creating a new class and this articles shows how to create it through S4 R classes.

Data analysis deals with different kinds of data.
For instance we can have supermarket sales with
- a transactional table, with customer ID, item ID, date of purchase
- an item table, with the item ID and its price
- a customer table, with customer ID and its anagraphic details (age, gender)
In this example data are tables with different structures.

R can be connected with Hadoop through the rmr2 package. The core of this package is mapreduce() function that allows to write some custom MapReduce algorithms. The aim of this article is to show how it works and to provide an example.

As mentioned in the previous article, a possibility for dealing with some Big Data problems is to integrate R within the Hadoop ecosystem. Therefore, it's necessary to have a bridge between the two environments. It means that R should be capable of handling data the are stored through the Hadoop Distributed File System (HDFS). In order to process the distributed data, all the algorithms must follow the MapReduce model. This allows to handle the data and to parallelize the jobs. Another requirement is to have an unique analysis procedure, so there must be a connection between in-memory and HDFS places.

Since R uses the computer RAM, it may handle only rather small sets of data. Nevertheless, there are some packages that allow to treat larger volumes and the best solution is to connect R with a Big Data environment. This post introduces some Big Data concepts that are fundamental to understand how R can work in this environment. Afterwards, some other posts will explain in detail how R can be connected with Hadoop.