Analyzing Clickstream Data With Spark

Let’s look at a concrete example with the Click-Through Rate Prediction dataset of ad impressions and clicks from the data science website Kaggle. The goal of this workflow is to create a machine learning model that, given a new ad impression, predicts whether or not there will be a click.

To build our advanced analytics workflow, let’s focus on the three main steps:

ETL

Data Exploration, for example, using SQL

Advanced Analytics / Machine Learning

The Databricks blog has a couple other examples, but this was the most interesting one for me.