Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

3.
What is Spark?
Developed in 2009 at UC Berkeley AMPLab, then
open sourced in 2010, Spark has since become
one of the largest OSS communities in big data,
with over 200 contributors in 50+ organizations
spark.apache.org
“Organizations that are looking at big data challenges –
including collection, ETL, storage, exploration and analytics –
should consider Spark for its in-memory performance and
the breadth of its model. It supports advanced analytics
solutions on Hadoop clusters, including the iterative model
required for machine learning and graph analysis.”
Gartner, Advanced Analytics and Data Science (2014)
3

14.
A Brief History: Functional Programming for Big Data
circa late 1990s:
explosive growth e-commerce and machine data
implied that workloads could not fit on a single
computer anymore…
notable firms led the shift to horizontal scale-out
on clusters of commodity hardware, especially
for machine learning use cases at scale
14

40.
Unifying the Pieces: Spark SQL
// http://spark.apache.org/docs/latest/sql-programming-guide.html!
!
val sqlContext = new org.apache.spark.sql.SQLContext(sc)!
import sqlContext._!
!
// define the schema using a case class!
case class Person(name: String, age: Int)!
!
// create an RDD of Person objects and register it as a table!
val people = sc.textFile("examples/src/main/resources/
people.txt").map(_.split(",")).map(p => Person(p(0), p(1).trim.toInt))!
!
people.registerAsTempTable("people")!
!
// SQL statements can be run using the SQL methods provided by sqlContext!
val teenagers = sql("SELECT name FROM people WHERE age >= 13 AND age <= 19")!
!
// results of SQL queries are SchemaRDDs and support all the !
// normal RDD operations…!
// columns of a row in the result can be accessed by ordinal!
teenagers.map(t => "Name: " + t(0)).collect().foreach(println)
40

58.
Because Use Cases: Stratio
Stratio Streaming: a new approach to
Spark Streaming
David Morales, Oscar Mendez
2014-06-30
spark-summit.org/2014/talk/stratio-streaming-
a-new-approach-to-spark-streaming
• Stratio Streaming is the union of a real-time
messaging bus with a complex event processing
engine using Spark Streaming
• allows the creation of streams and queries on the fly
• paired with Siddhi CEP engine and Apache Kafka
• added global features to the engine such as auditing
58
and statistics

59.
Because Use Cases: Ooyala
Productionizing a 24/7 Spark Streaming
service on YARN
Issac Buenrostro, Arup Malakar
2014-06-30
spark-summit.org/2014/talk/
productionizing-a-247-spark-streaming-service-
on-yarn
• state-of-the-art ingestion pipeline, processing over
two billion video events a day
• how do you ensure 24/7 availability and fault
tolerance?
• what are the best practices for Spark Streaming and
its integration with Kafka and YARN?
• how do you monitor and instrument the various
59
stages of the pipeline?

60.
Collaborative Filtering with Spark
Chris Johnson
slideshare.net/MrChrisJohnson/collaborative-filtering-
with-spark
• collab filter (ALS) for music recommendation
• Hadoop suffers from I/O overhead
• show a progression of code rewrites, converting
a Hadoop-based app into efficient use of Spark
60
Because Use Cases: Spotify

61.
Because Use Cases: Sharethrough
Sharethrough Uses Spark Streaming to
Optimize Bidding in Real Time
Russell Cardullo, Michael Ruggier
2014-03-25
databricks.com/blog/2014/03/25/
sharethrough-and-spark-streaming.html
• the profile of a 24 x 7 streaming app is different than
an hourly batch job…
• take time to validate output against the input…
• confirm that supporting objects are being serialized…
• the output of your Spark Streaming job is only as
reliable as the queue that feeds Spark…
• monoids…
61