Part 2: How to create EMR cluster with Apache Spark and Apache Zeppelin

This is part-2 of the blog series – How to analyze Kaggle data with Apache Spark and Zeppelin. In the first part we saw how to copy Kaggle data to Amazon S3. We would now like to analyze our data on EMR. I’m choosing Spark and Zeppelin for this task.

What is EMR ?

Amazon EMR provides a managed Hadoop framework that makes it easy, fast, and cost-effective to process vast amounts of data across dynamically scalable Amazon EC2 instances. You can also run other popular distributed frameworks such as Apache Spark, HBase, Presto, and Flink in Amazon EMR, and interact with data in other AWS data stores such as Amazon S3 and Amazon DynamoDB.

What is Apache Spark ?

Apache Spark is a unified analytics engine for large-scale data processing. Apache Spark achieves high performance for both batch and streaming data, using a state-of-the-art DAG scheduler, a query optimizer, and a physical execution engine.

Steps to create EMR Cluster

Login to AWS web console

Create EMR cluster

Go to advanced option

Software configuration

In advanced setting, we have to choose software on which we will work.

Hardware

After setting Software configuration, we have to choose node i.e; master and core depending on purchasing option.

Note : I’m using spot purchasing option for core node. Spot nodes use bidding and are much cheaper than on-demand nodes.

General cluster settings

Security

In this step, we have to provide the key-pair to login into our EMR cluster. It’s very important that this key-pair must be downloaded in local system else we would not be allowed to login to our EMR cluster.

We can now see our cluster will start in a few minutes.

Now, finally our cluster is ready to use.

Login to EMR cluster

Now we can login our cluster via terminal.

ssh -i <path/to/ssh-key.pem> hadoop@<ip address of master node>

Note: You can get the ip address of the master node from the AWS web console in the hardware section.

Also, Note: If you get permission denied, it might be worth checking the permission of the pem file. The permission level should be 400 for the pem file. You can use below command to fix the permissions.

chmod 400<path/to/ssh-key.pem>

Use Hive and Spark on our cluster

Finally we are ready to use our cluster via spark/ hive.

Access Zeppelin

Now lets access Zeppelin via browser. The list of all the EMR web interfaces can be found here – EMR web interfaces.

So we can access Zeppelin at –

http://<master-ip-address>:8890/

The master IP address can be found from the EMR web interface:

If the page doesn’t load up, or, you’re having issues with accessing Zeppelin via the web interface, we would need to open up a tunnel to the EMR cluster. Please refer to this blog post to tunnel into the EMR cluster.

If you tunnel in, you would have to use this command for ssh’ing into the cluster: