Friday, 3 June 2016

How to write Spark jobs in Java for Spark Job Server

In the previous post,
we learnt about setting up Spark job server, and running the spark jobs. So far, we have used Scala programs to run on job server. Now
we’ll see, how to write the Spark jobs in java to run on job server.

As in Scala, job must implement the SparkJob trait. So the job looks like this:

runJob method contains
the implementation of the Job. The SparkContext is managed by the
JobServer and will be provided to the job through this method. This
relieves the developer from the boiler-plate configuration management that
comes with the creation of a Spark job and allows the Job Server to manage
and re-use contexts.

validate method allows for an
initial validation of the context and any provided configuration. If the
context and configuration are OK to run the job, returning spark.jobserver.SparkJobValid will
let the job execute, otherwise returning spark.jobserver.SparkJobInvalid(reason) prevents the job
from running and provides means to convey the reason of failure. In this
case, the call immediately returns an HTTP/1.1 400 Bad
Request status code.
validate helps preventing running jobs that will eventually fail
due to missing or wrong configuration and save both time and resources.

In Java, we need to extend JavaSparkJob
class. It has following methods which will be overridden in the program:

runJob(jsc: JavaSparkContext, jobConfig: Config)

validate(sc: SparkContext, config: Config)

invalidate(jsc: JavaSparkContext, config: Config)

JavaSparkJob class is available in job-server-api package. Build
the job-server-api
source code and add this jar to your project. Add spark and other required dependencies in
your pom.xml.

Let’s start with the basic WordCount example:

WordCount.java:

package spark.jobserver;

import java.io.Serializable;

import java.util.Arrays;

import java.util.List;

import java.util.regex.Pattern;

import org.apache.commons.lang.StringUtils;

import org.apache.spark.SparkContext;

import
org.apache.spark.api.java.JavaPairRDD;

import org.apache.spark.api.java.JavaRDD;

import
org.apache.spark.api.java.JavaSparkContext;

import
org.apache.spark.api.java.function.FlatMapFunction;

import
org.apache.spark.api.java.function.Function2;

import
org.apache.spark.api.java.function.PairFunction;

import scala.Tuple2;

import spark.jobserver.JavaSparkJob;

import spark.jobserver.SparkJobInvalid;

import spark.jobserver.SparkJobValid$;

import spark.jobserver.SparkJobValidation;

import com.typesafe.config.Config;

publicclass Wordcount extends JavaSparkJob implements Serializable {

privatestaticfinallongserialVersionUID = 1L;

privatestaticfinal Pattern SPACE = Pattern.compile(" ");

static String fileName = StringUtils.EMPTY;

public Object runJob(JavaSparkContext jsc,
Config config) {

try {

JavaRDD<String>
lines = jsc.textFile(

config.getString("input.filename"), 1);

JavaRDD<String>
words = lines

.flatMap(new FlatMapFunction<String,
String>() {

public Iterable<String> call(String s)
{

return Arrays.asList(SPACE.split(s));

}

});

JavaPairRDD<String,
Integer> counts = words.mapToPair(

new PairFunction<String, String,
Integer>() {

public Tuple2<String, Integer>
call(String s) {

returnnew Tuple2<String, Integer>(s, 1);

}

}).reduceByKey(newFunction2<Integer, Integer,
Integer>() {

public Integer call(Integer i1, Integer i2)
{

return i1 + i2;

}

});

List<Tuple2<String,
Integer>> output = counts.collect();

System.out.println(output);

return output;

} catch (Exception e) {

e.printStackTrace();

returnnull;

}

}

public SparkJobValidation
validate(SparkContext sc, Config config) {

String filename
= config.getString("input.filename");

if (!filename.isEmpty()) {

return SparkJobValid$.MODULE$;

} else {

returnnew SparkJobInvalid(

"Input paramerter is missing. Please mention the
filename");

}

}

public String invalidate(JavaSparkContext
jsc, Config config) {

returnnull;

}

}

Next step is : compile the code and build the jar. Then upload it to the Job server.

Thanks for the tip, appreciate it. Your article definitely helped me to understand the core concepts.I’m most excited about the details your article touch based! I assume it doesn’t come out of the box, it sounds like you are saying we’d need to write in the handlers ourselves. Are there any other articles you would recommend to understand this better? I want to download and use UiPath Community Edition to study.I am planning to download it to the company’s computer where I work and use it.The company I work for has more than 250 computers and has sales of over $1 million throughout the company. but I want to download it for personal study.In the case, is it possible to use UiPath Community Edition?By the way do you have any YouTube videos, would love to watch it. I would like to connect you on LinkedIn, great to have experts like you in my connection (In case, if you don’t have any issues).

Hey, Thank You so much for this blog. It helped me lot. I am a Technical Recruiter by profession and first time working on this technology was bit tough for me, this article really helped me a lot to understand the details to get started with. As per Forrester Wave 2017 Q1[1] report best RPA vendors are• Automation Anywhere• UiPath• NICE• BluePrism• EdgeVerve• Workfusion• Pega / OpenSpan

forrester.png602x521 114 KB

UiPath scores best in the technology category, AA has the biggest market presence and breadth of use-cases while BP scores best when it comes to bot governance and deployment features though, I think, they are a bit underrated.

Secondly it depends on the use case and who you are. For example WF is great when it comes to digitization(OCR) processes while UiPath offers a free community edition.Appreciate your effort for making such useful blogs and the community Obrigado,

Hi, Great.. Tutorial is just awesome..It is really helpful for a newbie like me.. I am a regular follower of your blog. Really very informative post you shared here. Kindly keep blogging. If anyone wants to become a Java developer learn from Java Training in Chennai. or learn thru Java EE Online Training from India . Nowadays Java has tons of job opportunities on various vertical industry.