Track: Papers in Production: Modern CS in the Real World

Location: Cyril Magnin III

Day of week: Tuesday

What are the papers making a real-world impact today? This track looks at important papers that are influencing and changing software today. We're exploring topics around speech, infrastructure, self-driving cars, GANs, probabilistic data structures, and more on deep learning. The Papers In Production track aims to show research that is being used in production.

Track Host: Sid Anand

Chief Data Engineer @PayPal

Sid Anand currently serves as PayPal's Chief Data Engineer, focusing on ways to realize the value of data. Prior to joining PayPal, he held several positions including Agari's Data Architect, a Technical Lead in Search @ LinkedIn, Netflix’s Cloud Data Architect, Etsy’s VP of Engineering, and several technical roles at eBay. Sid earned his BS and MS degrees in CS from Cornell University, where he focused on Distributed Systems. In his spare time, he is a maintainer/committer on Apache Airflow, a co-chair for QCon, and a frequent speaker at conferences. When not working, Sid spends time with his wife, Shalini, and their 2 kids.

Data produced and managed by Big Data systems like Apache Spark and Hive cannot be directly consumed by Deep Learning systems like Tensorflow and PyTorch. Petastorm bridges this gap by enabling direct consumption of data in Apache Parqet format into Tensorflow and PyTorch. In this talk, we describe how Petastorm facilitates tighter integration between Big Data and Deep Learning worlds; simplifies data management and data pipelines; and speeds up model experimentation.

In this talk I will introduce wav2letter++, a fast open-source deep learning speech recognition framework. wav2letter++ is written entirely in C++, and uses the ArrayFire tensor library for maximum efficiency. I will explain the architecture and design of the wav2letter++ system and compare it to other major open-source speech recognition systems. In some cases wav2letter++ is more than 2x faster than other optimized frameworks for training end-to-end neural networks for speech recognition. I will also show that wav2letter++'s training times scale linearly to 64 GPUs, the highest that has been tested, for models with 100 million parameters. High-performance frameworks enable fast iteration, which is often a crucial factor in successful research and model tuning on new datasets and tasks.

The next generation of AI applications will continuously interact with the environment and learn from these interactions. To develop these applications, data scientists and engineers will need to seamlessly scale their work from running interactively to production clusters. In this talk, I’ll cover some major open source AI + Data Science libraries my collaborators and I at the RISELab have been working on.

At a high level, I’ll talk about my work on the following: Ray, a distributed execution framework for emerging AI applications; Tune, a scalable hyperparameter optimization framework for reinforcement learning and deep learning; RLlib, an open-source library for reinforcement learning that offers both a collection of reference algorithms and scalable primitives for composing new ones; and Modin, an open-source dataframe library for scaling pandas workflows by changing one line of code.

NERSC has successfully applied Deep Learning to a range of scientific workloads. Motivated by the volume and complexity of scientific datasets, and the computationally demanding nature of DL, we have undertaken several projects targeted at scaling DL on the largest CPU and GPU-based systems in the world. This talk will explore 2D and 3D convolutional architectures for solving pattern classification, regression and segmentation problems in high-energy physics, cosmology and climate science. Our efforts have resulted in a number of first-time results: scaling Caffe to 9600 Cori/KNL nodes obtaining 15PF performance (SC’17), scaling TensorFlow to 8192 Cori/KNL nodes obtaining 3.5PF performance (SC’18), and finally, scaling TensorFlow to 4560 Summit/Volta nodes, obtaining 1EF performance (SC’18). The talk will review lessons learnt from these projects, and outline future challenges for the DL community.

One of the more interesting trends in the AI field is "AI on the edge". There is a lot of value in pushing computation to the edge, much of the time, the edge is where data comes from. But there are also real challenges. Gwen Shapira will present the paper "Towards a Solution to the Red Wedding Problem", where the authors introduce and present a solution to one of the biggest challenges in edge computing: Concurrent updates of shared data.

This paper describes how you can group data points together by looking at the number of data points around it. Besides being an efficient way of grouping data it is also able to flag points which are likely noise. Roland will both explain the algorithm, as well as show a possible application for autonomous vehicles.

When we think of the "best" data structure for a workload, we classically mean the least worst - the data structure that performs least poorly given a mostly unknown data distribution. We could custom-design an optimal structure if we knew the data distribution, of course, but that involves significant programming effort. Enter machine learning, and Google's research paper "The Case for Learned Indexes" (https://ai.google/research/pubs/pub46518). By replacing a standard B-Tree with a machine learned model, this paper demonstrates a 70% improvement in speed and a 10x improvement in space on real-world workloads. That in itself is revolutionary. Far more importantly, though, it introduces us to a whole new class of techniques to build software.

Facebook partners with humanitarian and academic organizations, as well as community-driven projects, like OpenStreetMap, on a number of Data for Good efforts. Examples of the outputs are

the High-Resolution Settlement Layer, for which we identify the locations of human-built structures from high-resolution satellite images and add population data to it in collaboration with Columbia University,

our Disaster Maps, which contain aggregated, anonymized information about the availability of network coverage and power availability, as well as human mobility in the context of natural disasters,

our large-scale input into OpenStreetMap, for which we detect roads from high-resolution satellite images, prepare them for human review, and feed the results into OpenStreetMap

We will present details about the methods, challenges, and community feedback involved in producing these datasets, as well as the impact they've each had over the last two years.