How can R and Hadoop be used together?

How can R and Hadoop be used together?

The most common way to link R and Hadoop is to use HDFS (potentially managed by Hive or HBase) as the long-term store for all data, and use MapReduce jobs (potentially submitted from Hive, Pig, or Oozie) to encode, enrich, and sample data sets from HDFS into R. Data analysts can then perform complex modeling exercises on a subset of prepared data in R. Itamar Rosenn has spoken about Facebook’s use of this workflow .

Hive [8], Pig [9], and HBase [10] all have or will have facilities for sampling data; some web interfaces to Hadoop such as Hue [13] allow for directly exporting data from Hadoop into R .

On August 2, 2010, Revolution Computing issued a press release that mentioned the integration of their distribution of R with Hadoop . A few months after issuing this press release, Revolution Computing releasedR Hadoop . R Hadoop consists of the R packages rhbase and rhdfs, which allow users to read from and write to HBase and HDFS respectively from R, and rmr, which allows users to submit Hadoop Streaming jobs from R.

RHIPE allows you to submit jobs to the Hadoop MapReduce implementation from the R interpreter. Many features of Hadoop MapReduce cannot be used from this environment.

For example code that exercises both RHIPE and rmr, .

Using Hadoop Streaming, you can write your Map and Reduce functions in any language that can read from stdin and write to stdout, including R. The hadoopstreaming R package facilitates this operation, but seems ill-maintained.

There’s also para-r , which seems to have had similar goals to RHIPE but has since been abandoned.

There’s also some discussion of how to use R with Amazon’s Elastic MapReduce that doesn’t turn up any new information. The Segue package from J.D. Long seems to offer a lightweight mechanism for spinning up an EMR cluster from R for compute-intensive processing.

I’ve also heard that some researchers at Duke have gotten their RIOT system pushing work down from R into Hadoop, but have no real evidence to back up that claim.

The Korean company NexR spoke about RHive at Hadoop World 2011. RHive allows Hive queries to be written and launched from R.

IBM published a paper at SIGMOD 2010 on a system called Ricardo that attempts to mediate between R, Jaql, and Hadoop.

Best Hadoop Training Institute in Chennai means Credo Systemz because we provide quality Hadoop training with placement assistance to all candidates.Credo Systemz is the Best Hadoop training in Chennai ranked by YET5, Sulekha etc..