Selected highlights for Data Scientists

With a tsunami of data, scale of computing resources available, and rapid development of easy-to-learn open source Machine Learning frameworks, data science and machine learning concepts are much easier to learn and implement today than they were a decade ago.

As a result, across all industries, practitioners are using cutting-edge ML algorithms to solve tough data problems and eager to learn new techniques.

Learning is a life-long journey, but what are the top skills that data scientists need to stay abreast? According to Inside Big Data and KDnuggets, Python and R Programming, Graphs, NLP, Apache Spark and Hadoop, and unbiased modeling are some of the key areas to explore.

Below is a selection of some of the Spark + AI Summit 2019 sessions across the data science as well as python and advanced analytics tracks that will help you sharpen some of these skills.

Data Science

Relationships are one of the most predictive indicators of behavior and preferences. If you’re looking at understanding algorithms available to help identify groups dynamics to better predict communities behaviors, Predicting Influence and Communities Using Graph Algorithms is for you. In this session, Any Hodler and Mark Needham of Neo4j will explore how to run community detection and centrality algorithms in Apache Spark, including best practices, and examples of running graph algorithms in Neo4j Graph Platform.

Expanding on our data science toolkit, Running R at Scale with Apache Arrow on Spark with Javier Luraschi of RStudio, will introduce the Apache Arrow project and recent developments that enable running R with Apache Arrow on Apache Spark to significantly improve performance and efficiency.

Last but not least, as we continue to implement Machine Learning systems in virtually all fields and domains of applications, it is absolutely crucial to prevent discrimination, privacy, and even accuracy issues in our systems. In Interpretable AI: Not Just For Regulators, Patrick Hall and Mark Chan of H2O.ai will talk about how to train explainable, fair, trustworthy, and accurate predictive modeling systems.

Python & Advanced Analytics

One of the biggest challenges faced by Python users is to move huge amounts of data around. In Make your PySpark Data Fly with Arrow!, Bryan Cutler of IBM will give an overview of new framework Arrow Flight and demonstrate how to build high-performance connections with Arrow.

Managing the data science workflow – from experimentation to production – is another of the biggest challenges faced by data scientists and practitioners every day. In Data Agility—A Journey to Advanced Analytics and Machine Learning at Scale, Hari Subramanian of Uber, will demonstrate how Uber’s Big data platforms and Data science workbench put the power of Spark in the hands of Data scientists and data analysts for advanced analytics and ML/DL use cases, at scale. Related to this topic, Databricks unveiled MLflow in June 2018, a new open-source framework to manage the complete Machine Learning lifecycle. Don’t miss Matei Zaharia’s keynote for the latest updates on this initiative.

Finally, in Automating Predictive Modeling at Zynga with PySpark and Pandas UDFs, Ben Weber will explain how Zynga overcame scale challenges when working on large data sets, leveraging Pandas UDFs to help scale and automate the feature engineering process. As a result, teams can now use hundreds of propensity models in production to help personalize game experiences, and data scientists are now spending more of their time engaging with game teams to help build new features.

Data Science Classes

If you are someone who learns best by doing, don’t miss our tutorial on Managing the Complete Machine Learning Lifecycle with MLflow, a 80-minutes session with an expert-led talk designed to introduce you to MLflow, a new open source framework for managing the ML lifecycle, followed by hands-on exercises allowing attendees to learn by doing.