Technical blog

Friday, 2 December 2016

Is Agile an effective way to herd the data scientists into
the production pen or just an excuse to avoid documentation and planning?What
components in Agile do we recommend for Analytics PoCs and full-fledged
projects? So let's discuss about it.Every organization starts with the ambitions of business and
further creates roadmap of technology, people and investment needed to unlock
that business potential. To unlock the objective, we go through the phase of
initial discussions, understand the requirements, technical workloads like – “I
need a Linux server, database, recommendation engine, tools to handle the big
data...”

Technical requirements are quite straightforward most of the
times, but analytical activity is quite vague and there is uncertainty as we
don’t know what can be the best approach to solve the problem, the amount of
time to get the best solution.

If we develop it in traditional waterfall model approach, how
it will go:

Developing
a traditional analytics project:

Let’s say we need to build a recommendation engine for users.
Use case seems pretty easy. A traditional analytics team would go endlessly
building an engine by which will use the entire user data, run CBR(content
based recommendation) or CF(Collaborative Filtering), and after a long effort
possibly providing a powerful recommendation engine which can provide near real
time recommendation to the users. In the
entire hassle free development, there was no interaction with business people.

Challenges
in Traditional Approach:

We developed the entire engine but are not sure
about the correctness of the model. What if, we used wrong data, or wrong
variables? We don’t even know if our data exploration and insights were
correct? Oops, assume stakeholders reject it and give the feedback for existing
model, as it didn’t meet the expectations. Let’s rework now. Wouldn’t it be
awesome if we could have used Agile before?

Agile approach would have played a great role here, rapid and
iterative product development and getting rapid customer feedback cycles.

Now our problem and opportunity come at the interaction of
two trends: how we can incorporate data science and analytics, which is applied
research and needs exhaustive effort on an unpredictable timeline, into the
agile application? How can analytics applications do better than traditional
waterfall approach model? How can we craft application for unknown, evolving
data models?

What is Agile

Agile
Software development focuses on the four values(from Agile
Manifesto):

Individuals and Interactions over process and tools

Working software over comprehensive documentation

Customer collaboration over contract negotiation

Responding to change over following a plan

Engineering
Products and Engineering data science, both are different as

data
science is less deterministic. It needs lots of creativity and though

process to
derive the best approach. Agile helps to
manage those in the

cycles,
where team explore,
learn something about the data, share the

insights with the business
team/stakeholders, align the needs and approach, take the feedback and start in the same
direction.

How Agile Analytics approach unfolds

The main
difference from traditional to Agile analytics approach is using iterative
process, sharing the learnings with stakeholders, getting rapid feedbacks and
learn with new business questions and describing datasets.

A team of Data
scientists, Business analysts and other SMEs work with the stakeholders to
discuss each question until they have:

The clear and as narrow as possible scope

Potential datasets and variables to be used for analysis

Questions to be answered

Data
scientists provide the insights on the nature and quality of the dataset, hone
the questions, hypothesis, and provide a concrete list of algorithms that can
be viable to answer those questions. These outputs turn into Proof of concepts
or prototypes of an analytics solution.

It is a voyoge of discovery. The below structure known as data-value pyramid explains that.

Every
project needs an investment. And building Analytics solution is generally costlier
than developing application software. As each business silo can point to a
different domain or different data source. There is high risk in the
investment.Agile
Analytics helps to minimize the risk of pursuing the blind alleys. With the
iterative approach, cyclic interaction with business team, it mitigates the
risk of implementing models which turns out to be garbage.

Friday, 3 June 2016

In the previous post,
we learnt about setting up Spark job server, and running the spark jobs. So far, we have used Scala programs to run on job server. Now
we’ll see, how to write the Spark jobs in java to run on job server.

As in Scala, job must implement the SparkJob trait. So the job looks like this:

runJob method contains
the implementation of the Job. The SparkContext is managed by the
JobServer and will be provided to the job through this method. This
relieves the developer from the boiler-plate configuration management that
comes with the creation of a Spark job and allows the Job Server to manage
and re-use contexts.

validate method allows for an
initial validation of the context and any provided configuration. If the
context and configuration are OK to run the job, returning spark.jobserver.SparkJobValid will
let the job execute, otherwise returning spark.jobserver.SparkJobInvalid(reason) prevents the job
from running and provides means to convey the reason of failure. In this
case, the call immediately returns an HTTP/1.1 400 Bad
Request status code.
validate helps preventing running jobs that will eventually fail
due to missing or wrong configuration and save both time and resources.

In Java, we need to extend JavaSparkJob
class. It has following methods which will be overridden in the program:

runJob(jsc: JavaSparkContext, jobConfig: Config)

validate(sc: SparkContext, config: Config)

invalidate(jsc: JavaSparkContext, config: Config)

JavaSparkJob class is available in job-server-api package. Build
the job-server-api
source code and add this jar to your project. Add spark and other required dependencies in
your pom.xml.

Let’s start with the basic WordCount example:

WordCount.java:

package spark.jobserver;

import java.io.Serializable;

import java.util.Arrays;

import java.util.List;

import java.util.regex.Pattern;

import org.apache.commons.lang.StringUtils;

import org.apache.spark.SparkContext;

import
org.apache.spark.api.java.JavaPairRDD;

import org.apache.spark.api.java.JavaRDD;

import
org.apache.spark.api.java.JavaSparkContext;

import
org.apache.spark.api.java.function.FlatMapFunction;

import
org.apache.spark.api.java.function.Function2;

import
org.apache.spark.api.java.function.PairFunction;

import scala.Tuple2;

import spark.jobserver.JavaSparkJob;

import spark.jobserver.SparkJobInvalid;

import spark.jobserver.SparkJobValid$;

import spark.jobserver.SparkJobValidation;

import com.typesafe.config.Config;

publicclass Wordcount extends JavaSparkJob implements Serializable {

privatestaticfinallongserialVersionUID = 1L;

privatestaticfinal Pattern SPACE = Pattern.compile(" ");

static String fileName = StringUtils.EMPTY;

public Object runJob(JavaSparkContext jsc,
Config config) {

try {

JavaRDD<String>
lines = jsc.textFile(

config.getString("input.filename"), 1);

JavaRDD<String>
words = lines

.flatMap(new FlatMapFunction<String,
String>() {

public Iterable<String> call(String s)
{

return Arrays.asList(SPACE.split(s));

}

});

JavaPairRDD<String,
Integer> counts = words.mapToPair(

new PairFunction<String, String,
Integer>() {

public Tuple2<String, Integer>
call(String s) {

returnnew Tuple2<String, Integer>(s, 1);

}

}).reduceByKey(newFunction2<Integer, Integer,
Integer>() {

public Integer call(Integer i1, Integer i2)
{

return i1 + i2;

}

});

List<Tuple2<String,
Integer>> output = counts.collect();

System.out.println(output);

return output;

} catch (Exception e) {

e.printStackTrace();

returnnull;

}

}

public SparkJobValidation
validate(SparkContext sc, Config config) {

String filename
= config.getString("input.filename");

if (!filename.isEmpty()) {

return SparkJobValid$.MODULE$;

} else {

returnnew SparkJobInvalid(

"Input paramerter is missing. Please mention the
filename");

}

}

public String invalidate(JavaSparkContext
jsc, Config config) {

returnnull;

}

}

Next step is : compile the code and build the jar. Then upload it to the Job server.

Thursday, 26 May 2016

Spark
Job server provides a RESTful interface for submission and management of
Spark jobs, jars and job contexts. It facilitates sharing of jobs and RDD data
in a single context. It can run standalone job as well. Job History and configuration is persisted.

Run sbt command in the cloned repo. It will build the project and give the sbt
shell. If you are running sbt
command first time,it will take much time. Then type re-start to start the server on sbt shell: