"Pattern" is an open source project which takes models trained in popular analytics frameworks, such as SAS, Microstrategy, SQL Server, etc., and runs them at scale on Apache Hadoop. This machine
…

"Pattern" is an open source project which takes models trained in popular analytics frameworks, such as SAS, Microstrategy, SQL Server, etc., and runs them at scale on Apache Hadoop. This machine learning library works by translating PMML -- an established XML standard for predictive model markup -- into data workflows based on the Cascading API in Java. PMML models can be run in a pre-defined JAR file with no coding required. PMML can also be combined with other flows based on ANSI SQL (Lingual), Scala (Scalding), Clojure (Cascalog), etc. Multiple companies have collaborated to implement parallelized algorithms: Random Forest, Logistic Regression, K-Means, Hierarchical Clustering, etc., with more machine learning support being added. Benefits include greatly reduced development costs and less licensing issues at scale ?- while leveraging a combination of Apache Hadoop clusters, existing intellectual property in predictive models, and the core competencies of analytics staff. Sample code in the talk will show apps using predictive models built in SAS and R, e.g., anti-fraud classifiers. In addition, examples will show how to compare variations of models for large-scale customer experiments. Portions of this material come from the O`Reilly book "Enterprise Data Workflows with Cascading", due June 2013.

3.
Cascading – origins
API author Chris Wensel worked as a system architect
at an Enterprise firm well-known for many popular data
products.
Wensel was following the Nutch open source project –
where Hadoop started.
Observation: would be difficult to find Java developers
to write complex Enterprise apps in MapReduce –
potential blocker for leveraging new open source
technology.

4.
Cascading – functional programming
Key insight: MapReduce is based on functional programming
– back to LISP in 1970s. Apache Hadoop use cases are
mostly about data pipelines, which are functional in nature.
To ease staffing problems as “Main Street” Enterprise firms
began to embrace Hadoop, Cascading was introduced
in late 2007, as a new Java API to implement functional
programming for large-scale data workflows:
leverages JVM and Java-based tools without any
need to create new languages
allows programmers who have J2EE expertise
to leverage the economics of Hadoop clusters
•
•

22.
Literate
Programming
by Don Knuth
Literate Programming
Univ of Chicago Press, 1992
literateprogramming.com/
“Instead of imagining that our main task is
to instruct a computer what to do, let us
concentrate rather on explaining to human
beings what we want a computer to do.”

23.
Workflow Abstraction – business process
following the essence of literate programming, Cascading
workflows provide statements of business process
this recalls a sense of business process management
for Enterprise apps (think BPM/BPEL for Big Data)
Cascading creates a separation of concerns between
business process and implementation details (Hadoop, etc.)
this is especially apparent in large-scale Cascalog apps:
“Specify what you require, not how to achieve it.”
by virtue of the pattern language, the flow planner then
determines how to translate business process into efficient,
parallel jobs at scale

24.
Business
Process
by Edgar Codd
“A relational model of data for large shared data banks”
Communications of the ACM, 1970
dl.acm.org/citation.cfm?id=362685
rather than arguing between SQL vs. NoSQL…
structured vs. unstructured data frameworks…
this approach focuses on what apps do:
the process of structuring data

29.
established XML standard for predictive model markup
organized by Data Mining Group (DMG), since 1997
http://dmg.org/
members: IBM, SAS, Visa, NASA, Equifax, Microstrategy,
Microsoft, etc.
PMML concepts for metadata, ensembles, etc., translate
directly into Cascading tuple flows
“PMML is the leading standard for statistical and data mining models and
supported by over 20 vendors and organizations. With PMML, it is easy
to develop a model on one system using one application and deploy the
model on another system using another application.”
•
•
•
•
PMML – standard
wikipedia.org/wiki/Predictive_Model_Markup_Language

43.
Experiments – comparing models
much customer interest in leveraging Cascading and
Apache Hadoop to run customer experiments at scale
run multiple variants, then measure relative “lift”
Concurrent runtime – tag and track models
the following example compares two models trained
with different machine learning algorithms
this is exaggerated, one has an important variable
intentionally omitted to help illustrate the experiment
•
•
•

48.
Two Cultures
“A new research community using these tools sprang up. Their goal
was predictive accuracy. The community consisted of young computer
scientists, physicists and engineers plus a few aging statisticians.
They began using the new tools in working on complex prediction
problems where it was obvious that data models were not applicable:
speech recognition, image recognition, nonlinear time series prediction,
handwriting recognition, prediction in financial markets.”
Statistical Modeling: The Two Cultures
Leo Breiman, 2001
bit.ly/eUTh9L
in other words, seeing the forest for the trees…
this paper chronicled a sea change from data modeling practices
(silos, manual process) to the rising use of algorithmic modeling
(machine data for automation/optimization)

50.
Algorithmic Modeling
“The trick to being a scientist is to be open to using
a wide variety of tools.” – Breiman
circa 2001: Random Forest, bootstrap aggregation, etc.,
yield dramatic increases in predictive power over earlier
modeling such as Logistic Regression
major learnings from the Netflix Prize: the power of
ensembles, model chaining, etc.
the problems at hand have become simply too big and too
complex for ONE distribution, ONE model, ONE team…

58.
Anatomy of an Enterprise app
Definition a typical Enterprise workflow which crosses through
multiple departments, languages, and technologies…
ETL
data
prep
predictive
model
data
sources
end
uses
SAS for predictive modelsANSI SQL for ETL most of the licensing costs…

59.
Anatomy of an Enterprise app
Definition a typical Enterprise workflow which crosses through
multiple departments, languages, and technologies…
ETL
data
prep
predictive
model
data
sources
end
usesJ2EE for business logic
most of the project costs…