Upcoming Classes

Looking For Private Training?

Machine Learning With Apache Spark

This course teaches doing Machine Learning at Scale with the popular Apache Spark framework.

This course is intended for data scientists and software engineers. We assume no previous knowledge of Machine Learning – We teach popular Machine Learning algorithms from scratch.

For each machine learning concept, we first discuss the foundations, its applicability, and limitations. Then we explain the implementation and use, and specific use cases. This is achieved through a combination of about 50% lecture, 50% lab work.

This course is taught using Spark & Python.

Highlights of this course

No previous knowledge of machine learning is required

Not just learn the APIs, learn the theory behind it

Work with real-world datasets from Uber, Netflix, Walmart, Prosper ..etc

Objectives :

Learn popular machine learning algorithms, their applicability, and limitations

Practice the application of these methods in the Spark machine learning environment

Learn practical use cases and limitations of algorithms

What you will learn

ML Concepts

Regressions

Classifications

Clustering

Principal Component Analysis (PCA)

Recommendations

Duration

3 days

Audience

Data Scientists and Software Engineers

Prerequisites

If students are new to Apache Spark, we can offer one day of ‘Introduction to Spark’ training

programming background

familiarity with Python would be a plus, but not required

No machine learning knowledge is assumed

Lab environment:

Working Spark environment will be provided for students. Students would only need an SSH client and a browse.

Zero Install: There is no need to install software on students’ machines.

Detailed Course Outline:

Section 1: Machine Learning (ML) Overview

Machine Learning landscape

Machine Learning applications

Understanding ML algorithms & models

Section 2: ML in Python and Spark

Spark ML Overview

Introduction to Jupyter notebooks

Lab: Working with Jupyter + Python + Spark

Lab: Spark ML utilities

Section 3: Machine Learning Concepts

Statistics Primer

Covariance, Correlation, Covariance Matrix

Errors, Residuals

Overfitting / Underfitting

Cross-validation, bootstrapping

Confusion Matrix

ROC curve, Area Under Curve (AUC)

Lab: Basic stats

Section 4: Feature Engineering (FE)

Preparing data for ML

Extracting features, enhancing data

Data cleanup

Visualizing Data

Lab: data cleanup

Lab: visualizing data

Section 5: Linear regression

Simple Linear Regression

Multiple Linear Regression

Running LR

Evaluating LR model performance

Lab

Use case: House price estimates

Section 6: Logistic Regression

Understanding Logistic Regression

Calculating Logistic Regression

Evaluating model performance

Lab

Use case: credit card application, college admissions

Section 7: Classification: SVM (Supervised Vector Machines)

SVM concepts and theory

SVM with kernel

Lab

Use case: Customer churn data

Section 8: Classification: Decision Trees & Random Forests

Theory behind trees

Classification and Regression Trees (CART)

Random Forest concepts

Labs

Use case: predicting loan defaults, estimating election contributions

Section 9: Classification: Naive Bayes

Theory

Lab

Use case: spam filtering

Section 10: Clustering (K-Means)

Theory behind K-Means

Running K-Means algorithm

Estimating the performance

Lab

Use case: grouping cars data, grouping shopping data

Section 11: Principal Component Analysis (PCA)

Understanding PCA concepts

PCA applications

Running a PCA algorithm

Evaluating results

Lab

Use case: analyzing retail shopping data

Section 12: Recommendations (Collaborative filtering)

Recommender systems overview

Collaborative Filtering concepts

Lab

Use case: movie recommendations, music recommendations

Section 13: Performance

Best practices for scaling and optimizing Apache Spark

Memory caching

Testing and validation

Section 14: Final workshop (time permitting)

Students will analyze a couple of datasets and run ML algorithms.
This is done as a group exercise. Each group will present their findings to the class.