Zeppelin Tutorial

This tutorial walks you through some of the fundamental Zeppelin concepts. We will assume you have already installed Zeppelin. If not, please see here first.

Current main backend processing engine of Zeppelin is Apache Spark. If you're new to this system, you might want to start by getting an idea of how it processes data to get the most out of Zeppelin.

Tutorial with Local File

Data Refine

Before you start Zeppelin tutorial, you will need to download bank.zip.

First, to transform csv format data into RDD of Bank objects, run following script. This will also remove header using filter function.

valbankText=sc.textFile("yourPath/bank/bank-full.csv")caseclassBank(age:Integer,job:String,marital:String,education:String,balance:Integer)// split each line, filter out header (starts with "age"), and map it into Bank case classvalbank=bankText.map(s=>s.split(";")).filter(s=>s(0)!="\"age\"").map(s=>Bank(s(0).toInt,s(1).replaceAll("\"",""),s(2).replaceAll("\"",""),s(3).replaceAll("\"",""),s(5).replaceAll("\"","").toInt))// convert to DataFrame and create temporal tablebank.toDF().registerTempTable("bank")

Data Retrieval

Suppose we want to see age distribution from bank. To do this, run:

%sqlselectage,count(1)frombankwhereage<30groupbyageorderbyage

You can make input box for setting age condition by replacing 30 with ${maxAge=30}.

Tutorial with Streaming Data

Data Refine

Since this tutorial is based on Twitter's sample tweet stream, you must configure authentication with a Twitter account. To do this, take a look at Twitter Credential Setup. After you get API keys, you should fill out credential related values(apiKey, apiSecret, accessToken, accessTokenSecret) with your API keys on following script.

This will create a RDD of Tweet objects and register these stream data as a table:

importorg.apache.spark.streaming._importorg.apache.spark.streaming.twitter._importorg.apache.spark.storage.StorageLevelimportscala.io.Sourceimportscala.collection.mutable.HashMapimportjava.io.Fileimportorg.apache.log4j.Loggerimportorg.apache.log4j.Levelimportsys.process.stringSeqToProcess/** Configures the Oauth Credentials for accessing Twitter */defconfigureTwitterCredentials(apiKey:String,apiSecret:String,accessToken:String,accessTokenSecret:String){valconfigs=newHashMap[String, String]++=Seq("apiKey"->apiKey,"apiSecret"->apiSecret,"accessToken"->accessToken,"accessTokenSecret"->accessTokenSecret)println("Configuring Twitter OAuth")configs.foreach{case(key,value)=>if(value.trim.isEmpty){thrownewException("Error setting authentication - value for "+key+" not set")}valfullKey="twitter4j.oauth."+key.replace("api","consumer")System.setProperty(fullKey,value.trim)println("\tProperty "+fullKey+" set as ["+value.trim+"]")}println()}// Configure Twitter credentialsvalapiKey="xxxxxxxxxxxxxxxxxxxxxxxxx"valapiSecret="xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"valaccessToken="xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"valaccessTokenSecret="xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"configureTwitterCredentials(apiKey,apiSecret,accessToken,accessTokenSecret)importorg.apache.spark.streaming.twitter._valssc=newStreamingContext(sc,Seconds(2))valtweets=TwitterUtils.createStream(ssc,None)valtwt=tweets.window(Seconds(60))caseclassTweet(createdAt:Long,text:String)twt.map(status=>Tweet(status.getCreatedAt().getTime()/1000,status.getText())).foreachRDD(rdd=>// Below line works only in spark 1.3.0.// For spark 1.1.x and spark 1.2.x,// use rdd.registerTempTable("tweets") instead.rdd.toDF().registerAsTable("tweets"))twt.printssc.start()

Data Retrieval

For each following script, every time you click run button you will see different result since it is based on real-time data.

You can make user-defined function and use it in Spark SQL. Let's try it by making function named sentiment. This function will return one of the three attitudes( positive, negative, neutral ) towards the parameter.

defsentiment(s:String):String={valpositive=Array("like","love","good","great","happy","cool","the","one","that")valnegative=Array("hate","bad","stupid","is")varst=0;valwords=s.split(" ")positive.foreach(p=>words.foreach(w=>if(p==w)st=st+1))negative.foreach(p=>words.foreach(w=>if(p==w)st=st-1))if(st>0)"positivie"elseif(st<0)"negative"else"neutral"}// Below line works only in spark 1.3.0.// For spark 1.1.x and spark 1.2.x,// use sqlc.registerFunction("sentiment", sentiment _) instead.sqlc.udf.register("sentiment",sentiment_)

To check how people think about girls using sentiment function we've made above, run this: