Category Archives: Spark

I use Play Framework to build web application. I also use Akka and Spark in my projects. Currently, my programming language is Scala, even though in the past, I used C/C++, Python, and etc. But I’m still not really understand what is Typesafe Reactive Platform. In this post, I want to reveal its truth.

What is the Problem?

Every solution should try to solve one problem. So what is the problem which Typesafe Reactive Platform wants to deal with?

In the past, a large application had tens or so servers, seconds of response time, hours of offline maintenance and gigabytes of data. Old solutions emphasized on managing servers and containers. Scaling was achieved through buying more larger servers and concurrent processing via multi-threading.

But today applications are deployed on cloud-based clustering running thousands of multicore processors, dealing with thousand requests and handling petabytes. So old traditional scaling method is not suitable for new updated requirements.

What is Reactive Platform?

Because old solution is not good enough, we introduce a new solution, named Reactive Platform. This platform allows developers to build systems which are responsive resilient, elastic, and message-driven in order to deliver highly responsive user experiences with a real-time feel, backed by a elastic and resilient application stack, ready to be deployed on multicore and cloud computing architectures.

So Typesafe Reactive Platform is this kind of platform, which contains Play Framework, Akka, Scala, Activator and Spark. It is powerful tool to build modern applications that react to events, react to load, react to failure, and react to users.

Step1: Install

Step2: Configuration

DStream means discretized stream, which represents a continuous stream of data. In fact, DStream is a sequence of RDDs. Each RDD is a DStream contains data from a certain interval. Any operation applied on a DStream translates to operations on the underlying RDDs.

Window Operations

window operations allow to apply transformation over a sliding window of data. There are two parameters: windowLenght and slideInterval.

What is RDD?

RDD means Resilient Distributed Datasets(immutable, only-read), which can’t be changed, only can be transferred from other RDD. RDD contains how is it transferred from others and how to re-build the data information.

One RDD contains many partitions; each partition is a dataset fragment. One Partition will produce one task.

RDD is a fault-tolerant collection of elements that can be operated on in parallel. There are two ways to create RDD, one is to parallelize the collection, the other is to reference a dataset in an external storage system. When you use parallelization to get RDD, it will cut the dataset into partitions. You can point out the number of partitions by following way. Default, Spark tries to set the number of partitions automatically based on your cluster.

sc.parallelize(data, 10)

Resilient

RDD records update method to make sure resilient. But the update information is too much; this cost is not less. So RDD only records coarse-grained transformation: lineage.

RDD transformations will not be actually evaluated your RDD by Spark, only until it run actions on your RDD. So you can use

myRDD.foreach(println)

to println out RDD. If you want to println all RDD on one console, you can use following.

myRDD.collect().foreach(println)

But it is not good when you have millions lines, so you can use following.

myRDD.take(n).foreach(println)

If you are running this on a cluster then println, which won’t print back to your context. You need to bring your RDD data to your seesion. In order to do this, you can force it to local array and then print it out.