At course completionAfter completing this course, students will be able to:

Deploy HDInsight Clusters

Authorizing Users to Access Resources

Loading Data into HDInsight

Troubleshooting HDInsight

Implement Batch Solutions

Design Batch ETL Solutions for Big Data with Spark

Analyze Data with Spark SQL

Analyze Data with Hive and Phoenix

Describe Stream Analytics

Implement Spark Streaming Using the DStream API

Develop Big Data Real-Time Processing Solutions with Apache Storm

Build Solutions that use Kafka and HBase

Pre-requisitesBefore attending this course, students must have:

Programming experience using R, and familiarity with common R packages

Knowledge of common statistical methods and data analysis best practices

Basic knowledge of the Microsoft Windows operating system and its core functionality

Working knowledge of relational databases

Course Outline

Module 1: Getting Started with HDInsightThis module introduces Hadoop, the MapReduce paradigm, and HDInsight.- What is Big Data?- Introduction to Hadoop- Working with MapReduce Function- Introducing HDInsight

Module 3: Authorizing Users to Access ResourcesThis module provides an overview of non-domain and domain-joined Microsoft HDInsight clusters, in addition to the creation and configuration of domain-joined HDInsight clusters. The module also demonstrates how to manage domain-joined clusters using the Ambari management UI and the Ranger Admin UI. This module includes the labs that will provide the steps to create and manage domain-joined clusters. - Non-domain Joined clusters- Configuring domain-joined HDInsight clusters- Manage domain-joined HDInsight clusters

Module 4: Loading data into HDInsightThis module provides an introduction to loading data into Microsoft Azure Blob storage and Microsoft Azure Data Lake storage. At the end of this lesson, you will know how to use multiple tools to transfer data to an HDInsight cluster. You will also learn how to load and transform data to decrease your query run time. - Storing data for HDInsight processing- Using data loading tools- Maximising value from stored data

Module 5: Troubleshooting HDInsightIn this module, you will learn how to interpret logs associated with the various services of Microsoft Azure HDInsight cluster to troubleshoot any issues you might have with these services. You will also learn about Operations Management Suite (OMS) and its capabilities.- Analyze HDInsight logs- YARN logs- Heap dumps- Operations management suite (OMS)

Module 6: Implementing Batch SolutionsIn this module, you will look at implementing batch solutions in Microsoft Azure HDInsight by using Hive and Pig. You will also discuss the approaches for data pipeline operationalization that are available for big data workloads on an HDInsight stack.- Apache Hive storage- HDInsight data queries using Hive and Pig- Operationalize HDInsight

Module 7: Design Batch ETL solutions for big data with SparkThis module provides an overview of Apache Spark, describing its main characteristics and key features. Before you start, it’s helpful to understand the basic architecture of Apache Spark and the different components that are available. The module also explains how to design batch Extract, Transform, Load (ETL) solutions for big data with Spark on HDInsight. The final lesson includes some guidelines to improve Spark performance.- What is Spark?- ETL with Spark- Spark performance

Module 8: Analyze Data with Spark SQLThis module describes how to analyze data by using Spark SQL. In it, you will be able to explain the differences between RDD, Datasets and Dataframes, identify the uses cases between Iterative and Interactive queries, and describe best practices for Caching, Partitioning and Persistence. You will also look at how to use Apache Zeppelin and Jupyter notebooks, carry out exploratory data analysis, then submit Spark jobs remotely to a Spark cluster. - Implementing iterative and interactive queries- Perform exploratory data analysis

Module 9: Analyze Data with Hive and PhoenixIn this module, you will learn about running interactive queries using Interactive Hive (also known as Hive LLAP or Live Long and Process) and Apache Phoenix. You will also learn about the various aspects of running interactive queries using Apache Phoenix with HBase as the underlying query engine. - Implement interactive queries for big data with interactive hive.- Perform exploratory data analysis by using Hive- Perform interactive processing by using Apache Phoenix

Module 10: Stream AnalyticsThe Microsoft Azure Stream Analytics service has some built-in features and capabilities that make it as easy to use as a flexible stream processing service in the cloud. You will see that there are a number of advantages to using Stream Analytics for your streaming solutions, which you will discuss in more detail. You will also compare features of Stream Analytics to other services available within the Microsoft Azure HDInsight stack, such as Apache Storm. You will learn how to deploy a Stream Analytics job, connect it to the Microsoft Azure Event Hub to ingest real-time data, and execute a Stream Analytics query to gain low-latency insights. After that, you will learn how Stream Analytics jobs can be monitored when deployed and used in production settings. - Stream analytics- Process streaming data from stream analytics- Managing stream analytics jobs

Module 11: Implementing Streaming Solutions with Kafka and HBaseIn this module, you will learn how to use Kafka to build streaming solutions. You will also see how to use Kafka to persist data to HDFS by using Apache HBase, and then query this data.- Building and Deploying a Kafka Cluster- Publishing, Consuming, and Processing data using the Kafka Cluster- Using HBase to store and Query Data