Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

FastR+Apache Flink

During the past few years R has become an important language for data analysis, data representation and visualization. R is a very expressive language which combines functional and dynamic aspects, with laziness and object oriented programming. However the default R implementation is neither fast nor distributed, both features crucial for "big data" processing.

Here, FastR-Flink compiler is presented, a compiler based on Oracle's R implementation FastR with support for some operations of Apache Flink, a Java/Scala framework for distributed data processing. The Apache Flink constructs such as map, reduce or filter are integrated at the compiler level to allow the execution of distributed stream and batch data processing applications directly from the R programming language.

19.
OpenJDK Graal - New JIT for Java
● Graal Compiler is a JIT compiler for Java which is written in Java
● The Graal VM is a modification of the HotSpot VM -> It replaces the client and server compilers with
the Graal compiler.
● Open Source: http://openjdk.java.net/projects/graal/

37.
6. FastR-Flink Execution, JIT Compilation
● Each Flink execution thread has a copy of
the R code
● It builds the AST and start the
interpretation using Truffle
● After a while it will reach the compiled
code (JIT)
● The language context (RContext) is
shared among the threads
● Scope variables are passed by broadcast

38.
7. Data Transformation - Get the result
RVectorFlink
DataSets
Build RVectors:
The R user gets the final results within the right R data representation

42.
Some Questions
● Is there any heuristic to know the ideal heap distribution?
● Is there any way to format the name of the operations for the Flink Web
Interface? R functions names? Custom names for easy track
● Is there any way to attach custom initialization code into TaskManager?

47.
And Distributed Memory?
● It works (I will show you in demo in a bit)
● Some performance issues concerning:
○ Multi-thread context issues.
○ Each Truffle language has a context initialized within the application
○ It will need pre-initialization within the Flink TaskManager
● This is work in progress