The first post discussed creating a machine learning model using Apache Spark’s K-means algorithm to cluster Uber data based on location. This second post will discuss using the saved K-means model with streaming data to do real-time analysis of where and when Uber cars are clustered.

The enriched Data Records are in JSON format. An example line is shown below:

Spark Kafka Consumer Producer Code

Parsing the Data Set Records

A Scala Uber case class defines the schema corresponding to the CSV records. The parseUber function parses the comma separated values into the Uber case class.

Loading the K-Means Model

The Spark KMeansModel class is used to load the saved K-means model fitted on the historical Uber trip data.

The output of model clusterCenters:

Below the cluster centers are displayed on a Google Map:

Spark Streaming Code

These are the basic steps for the Spark Streaming Consumer Producer code:

Configure Kafka Consumer Producer properties.

Initialize a Spark StreamingContext object. Using this context, create a DStream that reads a message from a Topic.

Apply transformations (which create new DStreams).

Write messages from the transformed DStream to a Topic.

Start receiving data and processing. Wait for the processing to be stopped.

We will go through each of these steps with the example application code.

1. Configure Kafka Consumer Producer Properties

The first step is to set the KafkaConsumer and KafkaProducer configuration properties, which will be used later to create a DStream for receiving/sending messages to topics. You need to set the following parameters:

Key and value deserializers: For deserializing the message.

Auto offset reset: To start reading from the earliest or latest message.

Bootstrap servers: This can be set to a dummy host:port since the broker address is not actually used by MapR Streams.

2. Initialize a Spark StreamingContext Object

ConsumerStrategies.Subscribe, as shown below, is used to set the topics and Kafka configuration parameters. We use the KafkaUtils createDirectStream method with a StreamingContext, the consumer and location strategies, to create an input stream from a MapR Streams topic. This creates a DStream that represents the stream of incoming data, where each message is a key value pair. We use the DStream map transformation to create a DStream with the message values.

3. Apply Transformations (Which Create New DStreams)

We use the DStream foreachRDD method to apply processing to each RDD in this DStream. We parse the message values into Uber objects, with the map operation on the DStream. Then we convert the RDD to a DataFrame, which allows you to use DataFrames and SQL operations on streaming data.

Here is example output from the df.show:

A VectorAssembler is used to transform and return a new DataFrame with the latitude and longitude feature columns in a vector column.

Then the model is used to get the clusters of the features with the model transform method, which returns a DataFrame with the cluster predictions.

The output of categories.show is below:

The DataFrame is then registered as a table so that it can be used in SQL statements. The output of the SQL query is shown below:

4. Write Messages From the Transformed DStream to a Topic

The Dataset result of the query is converted to JSON RDD Strings, then the RDD sendToKafka method is used to send the JSON key-value messages to a topic (the key is null in this case).

Software

This example runs on MapR 5.2 with Spark 2.0.1. If you are running on the MapR v5.2 Sandbox, you need to upgrade Spark to 2.0.1 (MEP 2.0). For more information on upgrading, see here and here.

Summary

In this blog post, you learned how to use a Spark machine learning model in a Spark Streaming application, and how to integrate Spark Streaming with MapR Streams to consume and produce messages using the Kafka API.