Summary

Apache Spark is becoming the new lingua franca for distributed
computing. In this talk I'll show how many machine learning tasks can
be scaled up almost trivially using Spark. For instance, we'll see
how a semi-supervised NLP algorithm can be trained on a billion
training examples using a Spark cluster.

Description

Apache Spark is becoming the new lingua franca for distributed
computing. In this talk I'll show how many machine learning tasks can be
scaled up almost trivially using Spark.

After introducing the Spark computational model I'll detail some useful
design principles for running Spark programs on large datasets. I'll
also give some tips for effective configuration of a PySpark cluster.

The talk will include a step-by-step walk through of the scaling-up of
several NLP algorithms. For instance, we'll see how a semi-supervised
NLP algorithm can be trained on a billion training examples using a
PySpark cluster.