Setting up twiiter stream

For streaming data from twitter you need access keys and token. You can go to https://apps.twitter.com and creata a new app to get these. After creating an app, click on “Keys and access token” and copy following:

Consumer Key (API Key)

Consumer Secret (API Secret)

Access Token

Access Token Secret

We will use twitter4j. Build a configuration using token and key

123456

valcb=newConfigurationBuilder()cb.setDebugEnabled(true)cb.setOAuthConsumerKey("p5vABCjRWWSXNBkypnb8ZnSzk")//replace this with your own keyscb.setOAuthConsumerSecret("wCVFIpwWxEyOcM9lrHa9TYExbNsLGvEUgJucePPjcTx83bD1Gt")//replace this with your own keyscb.setOAuthAccessToken("487652626-kDOFZLu8bDjFyCKUOCDa7FtHsr22WC3PMH4iuNtn")//replace this with your own keyscb.setOAuthAccessTokenSecret("4W3LaQTAgGoW5SsHUAgp6gK9b5AKgl8hRcFnNYgvPTylU")//replace this with your own keys

You can now open a stream and listen for tweets with some specific keyswords or hashtags:

Zookeeper setup

ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services. All of these kinds of services are used in some form or another by distributed applications. In our example, both Kafka and Solr will need zookeeper for their state and config management, so you need to first start zookeeper.

Download it from http://apache.org/dist/zookeeper/zookeeper-3.4.6/zookeeper-3.4.6.tar.gz

Extract it and go inside conf directory

Make a copy of zoo_sample.conf as zoo.cfg

Run it using bin/zkServer.sh start

Verify its started successfully by running bin/zkServer.sh status command.

Putting data in Kafka

Here’s steps to send data to kafka.

Start kafka server and broker(s)

Create a topic in kafka to which data will be send

Define a avro schema for the tweets

Create a kafka producer which will serialize tweets using avro schema and send it to kafka

Avro schema

Avro is a data serialization system. It has a JSON like data model, but can be represented as either JSON or in a compact binary form. It comes with a very sophisticated schema description language that describes data. Lets define avro schema for our Tweet type:

Now we need to define a kafka consumer which will read data from solr and send it to SolrWriter

Kafka Consumer

Consumer will read data from kafka, deserialize it using avro schema, and convert it to Tweet type and forward the message to a destination actor. We will keep the consumer generic so that any destination actor(solr or cassandra) can be passed to it.

123456789

classKafkaTweetConsumer(zkHost:String,groupId:String,topic:String,destination:ActorRef)extendsActorwithLogging{...defread()=try{...destination!tweet//destination will be either solr or cassandra...}}

{"responseHeader":{"zkConnected":true,"status":0,"QTime":1,"params":{"q":"*:*","rows":"1","wt":"json"}},"response":{"numFound":42,"start":0,"docs":[{"id":"923302396612182016","username":"Tawanna Kessler","userId":898322458742337536,"userScreenName":"tawanna_kessler","userDesc":"null","userProfileImgUrl":"http://pbs.twimg.com/profile_images/898323854417940484/lke3BSjt_normal.jpg","favCount":0,"retweetCount":183,"lang":"en","place":"null","message":"RT @craigbrownphd: Two upcoming webinars: Two new Microsoft webinars are taking place over the next week that may… https://t.co/SAb9CMmVXY…","isSensitive":false,"isTruncated":false,"isFavorited":false,"isRetweeted":false,"isRetweet":true,"createdAt":"2017-10-26T03:07:00Z","_version_":1582267022370144256}]}}

Querying solr data with banana

Banana is a data visualization tool that uses solr for data analysis and display. It can be run in same container as solr. Here’s how to set it up:

Here’s how to set it up for our tweet data. We will run it in same container as solr:

Visulizing cassandra data with zeppelin

Zeppelin is a web-based notebook that can be used for interactive data analytics on cassandra data using spark.

Download the binary from https://zeppelin.apache.org/download.html and uncompress it.
Default port used by it is 8080 which conflicts with spark master web ui port, so change the port in conf/zeppelin-site.xml.