Overview

In this lab, you will use Amazon EMR to analyze Ngrams from Google Books. An n-gram is a contiguous sequence of n items from a given sequence of text or speech. For example, consider this sentence:

Some 2-grams from this sentence are "the sun", "in the" and "sets in". A sample 3-gram is "sets in the" and a sample 4-gram is "rises in the east".

N-grams are used to predict the probability of certain words appearing in a sequence. This can be useful for providing typing suggestions on web pages and mobile phones.

The steps in this lab are very similar to the activities that a Data Scientist would perform when analyzing a new set of data. This includes loading the data, examining the data attributes and writing SQL to analyze the data. You will run SQL against publicly available Ngrams data stored in Amazon S3 to gain interesting insights.

If you are prompted for a token, use the one distributed to you (or credits you have purchased).

A status bar shows the progress of the lab environment creation process. The AWS Management Console is accessible during lab resource creation, but your AWS resources may not be fully available until the process is complete.

Open your lab by clicking Open Console

This will automatically log you into the AWS Management Console.

Please do not change the Region unless instructed.

Common login errors

Error : Federated login credentials

If you see this message:

Close the browser tab to return to your initial lab window

Wait a few seconds

Click Open Console again

You should now be able to access the AWS Management Console.

Error: You must first log out

If you see the message, You must first log out before logging into a different AWS account:

Welcome to Your First Lab!

This lab demonstrates how to launch an Amazon Elastic MapReduce (EMR) cluster for Big Data processing and use Hive with SQL-style queries to analyze data. You will create a Hadoop cluster using Amazon EMR which will allow to run interactive Hive queries against data stored in Amazon S3. You will use Hive to normalize the data in a more useful way, and you will run queries to analyze the data.

This lab is included in the quest
Big Data on AWS.
If you complete this lab you'll receive credit for it when you enroll in this quest.