Data Science on Google Cloud Platform: Machine Learning

Advanced6 Steps7h 51m42 Credits

This is the second of two Quests of hands-on labs derived from the exercises from the book Data Science on Google Cloud Platform by Valliappa Lakshmanan, published by O'Reilly Media, Inc. In this second Quest, covering chapter 9 through the end of the book, you extend the skills practiced in the first Quest, and run full-fledged machine learning jobs with state-of-the-art tools and real-world data sets, all using Google Cloud Platform tools and services.

DataMachine Learning

Prerequisites

This Quest assumes two prerequisites: 1) you have access to the O’Reilly book Data Science on the Google Cloud Platform, as the labs only include the exercises from the end of each chapter and do not contain the concepts or teaching from the text itself. 2) You have already completed the first Quest in this sequence: Data Science on the Google Cloud Platform as well as all the prerequisites required for that Quest. WIthout these prerequisites, students will not have the skills or experience needed to succeed here.

Quest Outline

In this lab you will learn how to implement logistic regression using a machine learning library for Apache Spark running on a Google Cloud Dataproc cluster to develop a model for data from a multivariable dataset.

Deploy a Java application using Maven to process data with Cloud Dataflow. The Java application implements time-windowed aggregation to augment the raw data in order to produce consistent training and test datasets.

Learn the process for partitioning a data set into two separate parts: a training set to develop a model, and a test set to evaluate the accuracy of the model and then independently evaluate predictive models in a repeatable manner.