Real world application project for Big Data – with Apache Spark and AWS-EMR

Hey readers, I am learning Data Engineering from last few months and I thought of sharing my learning with you all. Recently I made a project on a real life application of a Big Data pipeline and thought to share with you so that if anyone is interested they can go through this and practice for self learning .

Use Case

Nikita is a new Data Engineer at a great startup. The company is currently getting lots of Apache/application logs being copied to Amazon S3. Nikita’s team is assigned the task to process all of the high volume logs, and to expose this data to the Analyst and Scientist of the organisation.

The Analyst and Scientist prefer consuming the data as SQL or python. The primary objective for Nikita’s team is to make the access to data easier and optimised.

Project Details

Here I am talking about the problem statements and not about the solutions because at this point I don’t want to bias your opinions with my solutions. After knowing the problems we will discuss how to implement the solutions.

Data Lake

We want to build a Data lake to store all our structured and unstructured data in one place.

Big Data Concepts

We need to get better at the Big data concepts by applying optimization techniques on our data, like:

Partitioned data/tables

Using columnar format like parquet

Getting the Data

Since this is a real world project, it might be worth having some real world data as well. Since this is apache log format data we might need to get some log data for our project. It might be cool if we can copy the log data to S3 in intervals to mimic the production like scenario.

Big Data ETLs

We will create daily/hourly job to process new incoming data and to create aggregate/summary tables on our data to make queries faster on the data. The ETLs we need to create are

Hourly/daily summary job to aggregate raw tables. These tables will be exposed to the analysts for querying. Since these tables are summarised view, the queries are being faster as compared to the raw table.

Note: All the ETLs have to be incremental, that can run on hourly/daily schedule.

Big Data Processing Engine and infrastructure

We need to decide any Big data engine for our data processing ETLs and we need to decide any cloud based infrastructure that supports our execution engine. We will talk about the solution later in this post.

Job Scheduler

Now we have Big data in place, we need to schedule all our Big data ETLs to run hourly/daily basis.

Exposing the Data to Analysts

Once the ETLs starts running incrementally we will have all our data organised in tables, we will now need to expose our data to Analysts. The Analysts and Scientists are the final consumers of the data. Since the Analysts are comfortable with SQL and Python, we have to keep that in consideration while selecting the right tools.

Potential Solution

This is one of the very standard industries problems and there are several ways the problem can be approached and solved. There is no right and wrong solutions to the problems as long as we can balance the expectations and tradeoffs.

Project Design

Here is the design I came of with, and use could choose to replace any of the component with something you are more comfortable with.

Amazon S3 as Data Lake

Here I have used Amazon S3 as data lake which can be used to store and retrieve any amount of data, at any time, from anywhere on the web. You can choose any other storage service to store your data in one place. I have used Amazon S3 because :

It is easy to use

It stores both structured and unstructured data

It is cost effective

It allows the user to run the Big data analytics on a particular system without moving it to another analytics system.

Airflow as Job Scheduler

In this project, I have schedule my Big data ETLs through Airflow which can be used to manage data pipeline. I choose Airflow because:

It provides nice visibility over daily runs

Easy failure and crash recovery

Lots of predefined operators and library for integration

Apache Spark on Amazon EMR Cluster

Here I have created EMR cluster which helps in big data processing and analysis. EMR basically provides the infrastructure to run Apache Hadoop, Spark, Zeppelin, Hive, etc on Amazon Web Services.

I chose Apache Spark because :

It supports multiple languages

It supports hive-SQL

It has integration with S3 and can read structured/unstructured data.

It supports both streaming and batch.

Big Data ETLs

For this project,I have generated a log data and stored my generated data in Amazon S3.We also need to create a clean data or extract table from raw log data partitioned by date via Spark-Sql or Hive and store in S3 as parquet file format. And then we can also create aggregated summary table and store to S3.

Apache Zeppelin for consumption layer

Here I chose Apache Zeppelin for exposing the Data to Analysts. Apache Zeppelin is a web-based notebook, where we can explore our data with built in visualizations and supports SQL, Scala and Python. By default, Apache Zeppelin provides some basic charts to display the result of our processing data.