Lecture Notes:

Homework 3 Released

Homework 2 Due

8

02/09/2017

Prediction and Inference [Yu]

In this lecture we will explore the key types and challenges of inference and predictions. We will provide an overview of the categories of prediction problems and introduce some of the key machine learning tools in python.

Feature Engineering, Over-fitting, and Cross Validation [Gonzalez]

In this lecture we will begin to do some machine learning. We will explore how simple linear techniques can be used to address complex non-linear relationships on a wide range of data types. We will start to use scikit-learn to build and visualize models in higher dimensional spaces. We will address a key challenge in machine learning – over-fitting and discuss how cross-validation can be used to address over-fitting.

The following interactive (html) notebooks walk through the concepts we use in lecture and are suggested reading materials.

Spring Break

22

03/30/2017

Spring Break

12

23

04/04/2017

Regularization and the Bias Variance tradeoff [Gonzalez]

In this lecture will continue our exploration of over-fitting and derive the fundamental bias variance tradeoff for the least squares model. We will then introduce the concept of regularization and explore the commonly used L1 and L2 regularization functions.

Finish Logistic Regression and Start K-Means [Gonzalez and Yu]

In this lecture we will finish our discussion on logistic regression and begin to explore unsupervised learning techniques. In particular we will start with K-means work towards the more general EM algorithm.

Optional Reading:

Map-Reduce, Spark, and Big Data [Gonzalez]

In this lecture we will introduce the Map-Reduce model of distributed computation and then dive into the Apache Spark Map-Reduce system developed at Berkeley. We will talk about how to use the computational frameworks to scale data processing.

Homework 6 Due

Homework 7 Released

Guest Lecturer on Data Science and Ethics [Charis Thompson]

Finish Discussion on Spark and Classification

In the previous lectures we moved quickly through some important concepts in distributed data processing and classification. Because both of these ideas are critical in many data science applications, we will return to the discussion on Spark and review how the relational operators we learned earlier in the class enable scalable distributed computing. We will then return to the topic of classification and review logistic regression and how it can be made to run in a distributed computing environment. Time permitting we will touch on Deep Learning as a generalization of the ideas in logistic regression.

PCA and the Berkeley Data Science Major [Nolan and Cathryn Carson]

In this lecture we will provide an overview of dimensionality reduction and discuss the PCA method. We will conclude with a discussion from Cathryn Carson on the development and status of the Berkeley Data Science Major.