Development

The radanalytics.io community has several ongoing projects with frequent
releases. These are all collected in our GitHub organization. Each project
addresses a specific concern within the OpenShift realm and provide solid
solutions for your own data driven applications.

Presentations

The following presentations are about the technologies involved in and
related to the radanalytics.io projects. We love our community and the
passion they have for these technologies. If you know of a presentation
that would fit in here, please open a
pull request
and add it to the list!

2018

Why Data Scientists Love Kubernetes

Sophie Watson & William Benton

KubeCon & CloudNativeCon North America • Seattle, WA • December 2018

This talk will introduce the workflows and concerns of data scientists and machine learning engineers and demonstrate how to make Kubernetes a powerhouse for intelligent applications.

We’ll show how community projects like Kubeflow and radanalytics.io support the entire intelligent application development lifecycle. We’ll cover several key benefits of Kubernetes for a data scientist’s workflow, from experiment design to publishing results. You’ll see how well scale-out data processing frameworks like Apache Spark work in Kubernetes.

System operators will learn how Kubernetes can support data science and machine learning workflows. Application developers will learn how Kubernetes can enable intelligent applications and cross-functional collaboration. Data scientists will leave with concrete suggestions for how to use Kubernetes and open-source tools to make their work more productive.

Building an Implicit Recommendation Engine with Spark

Sophie Watson

Spark + AI Summit Europe • London, England • October 2018

Many of today’s most engaging — and commercially important — applications provide personalised experiences to users. Collaborative filtering algorithms capture the commonality between users and enable applications to make personalised recommendations quickly and efficiently.

The Alternating Least Squares (ALS) algorithm is still deemed the industry standard in collaborative filtering. In this talk Sophie will show you how to implement ALS using Apache Spark to build your own recommendation engine. Sophie will show that, by splitting the recommendation engine into multiple cooperating services, it is possible to reduce the system’s complexity and produce a robust collaborative filtering platform with support for continuous model training.

In this presentation you will learn how to build a recommendation system for the case where recorded data is explicitly given as a rating, as well as for the case where the data is less succinct. You will walk away from this talk with the knowledge and tools needed to implement your own recommendation system using collaborative filtering and microservices.

Extending Structured Streaming Made Easy with Algebra

Erik Erlandson

Spark+AI Summit EU • London, England • October 2018

Apache Spark’s Structured Streaming library provides a powerful set of primitives for building streaming pipelines for data processing. However, it is not always obvious how to take full advantage of this power in a way that works naturally with your application’s unique business logic. If you associate algebra with solving equations while wishing you were doing something else, think again: we’ll see how we can apply the properties of operations we all understand — like addition, multiplication, and set union — to reason about our data engineering pipelines.

Apache Spark for Library Developers (Deep Dive Part 2)

Erik Erlandson & William Benton

October 2018

This is part 2 of a 2-session deep dive, which covers:

Extending data frames with custom aggregates

Exposing JVM libraries for Python

Publishing Spark libraries

As a developer, data engineer, or data scientist, you’ve seen how Apache Spark is expressive enough to let you solve problems elegantly and efficient enough to let you scale out to handle more data. However, if you’re solving the same problems again and again, you probably want to capture and distribute your solutions so that you can focus on new problems and so other people can reuse and remix them: you want to develop a library that extends Spark.

You faced a learning curve when you first started using Spark, and you’ll face a different learning curve as you start to develop reusable abstractions atop Spark. In this talk, two experienced Spark library developers will give you the background and context you’ll need to turn your code into a library that you can share with the world. We’ll cover: Issues to consider when developing parallel algorithms with Spark, Designing generic, robust functions that operate on data frames and datasets, Extending data frames with user-defined functions (UDFs) and user-defined aggregates (UDAFs), Best practices around caching and broadcasting, and why these are especially important for library developers, Integrating with ML pipelines, Exposing key functionality in both Python and Scala, and How to test, build, and publish your library for the community.

We’ll back up our advice with concrete examples from real packages built atop Spark. You’ll leave this talk informed and inspired to take your Spark proficiency to the next level and develop and publish an awesome library of your own.

As a developer, data engineer, or data scientist, you’ve seen how Apache Spark is expressive enough to let you solve problems elegantly and efficient enough to let you scale out to handle more data. However, if you’re solving the same problems again and again, you probably want to capture and distribute your solutions so that you can focus on new problems and so other people can reuse and remix them: you want to develop a library that extends Spark.

You faced a learning curve when you first started using Spark, and you’ll face a different learning curve as you start to develop reusable abstractions atop Spark. In this talk, two experienced Spark library developers will give you the background and context you’ll need to turn your code into a library that you can share with the world. We’ll cover: Issues to consider when developing parallel algorithms with Spark, Designing generic, robust functions that operate on data frames and datasets, Extending data frames with user-defined functions (UDFs) and user-defined aggregates (UDAFs), Best practices around caching and broadcasting, and why these are especially important for library developers, Integrating with ML pipelines, Exposing key functionality in both Python and Scala, and How to test, build, and publish your library for the community.

We’ll back up our advice with concrete examples from real packages built atop Spark. You’ll leave this talk informed and inspired to take your Spark proficiency to the next level and develop and publish an awesome library of your own.

From Research to Production: What they didn’t teach you in Grad School

Sophie Watson

Berlin Buzzwords • Berlin, Germany • June 2018

Academic researchers find novel solutions to thorny problems in idealized environments. A research background is excellent preparation for advancing the state of the art, but newly-minted professional data scientists can find themselves in industry with an arsenal of problem-solving techniques that are not as potent as they seemed in graduate school: data sets are larger and messier, solutions are judged by their outcomes rather than by their novelty, and products, unlike publications, require ongoing maintenance and support.

This talk will draw on the speaker’s experience bringing a mathematics research background to a team in industry. We will show both the challenges that data scientists face when entering industry from academia and the unique skills that they bring from their research background. We shall frame the discussion with a running example of cutting-edge statistical research embodied in an imperfect implementation. We’ll demonstrate iterative refinements to our implementation, showing how to take a research prototype to production code, with particular attention to real-world pitfalls that might not appear in a researcher’s daily work. Finally, we’ll show how trained researchers can turn their background into a superpower for applied teams in industry.

Early-career attendees who are considering joining industry from academia will learn how to navigate the challenges they’ll face on a mixed team and how to best use their gifts and skills in a new environment. Established practitioners will learn how to support, engage, and nurture their colleagues who are transitioning from academia. Everyone will learn how to adapt implementations and ideas from the research world for production applications.

Building Streaming Recommendation Engines on Spark

Rui Vieira

Berlin Buzzwords • Berlin, Germany • June 2018

Collaborative filtering is a well known method to implement recommendation engines. Although modern techniques, such as Alternating Least Squares (ALS), allow us to perform rating predictions with large amounts of observations, typically ALS is implemented as a distributed batch algorithm where retraining must be performed with the entirety of the data. However, when dealing with large amounts of data as a stream, batch retraining might be problematic.

In this talk Rui will guide us in building a streaming ALS implementation using Apache Spark and based on Stochastic Gradient Descent, where training can be performed using observations as they arrive.

The advantages of real-time streaming collaborative filtering will be discussed as well as the scenarios where batch ALS might be preferable.

Apache Spark from notebook to cloud native application

Rebecca Simmonds

Spark+AI Summit • San Francisco, CA • June 2018

Data engineering teams love Apache Spark because it’s powerful and easy to manage, but managing a shared resource for experimental analyses and queries is very different from developing production applications in contemporary cloud environments: the gap between understanding Spark and being able to deploy and manage it in production can be vast.

This session will cover a developer’s journey learning Spark and using it to develop a containerized, cloud native application with analysis and visualization components. More specifically, these topics will be covered:

Exploratory analysis in a Jupyter notebook running against an ephemeral Spark cluster

Using PySpark for loading and analyzing data from external data sources like PostgreSQL

So, whether you’re an application developer or a Spark expert this session is for you. If you’re a developer wanting to deploy a spark cluster into production, this session will help guide you through techniques to make this transition easier and quicker. However, if you’re an expert, then this talk should give you some insight into how application developers work and help you to coordinate with the development team.

Intelligent applications on OpenShift from prototype to production

Rebecca Simmonds and Michael McCune

Red Hat Summit • San Francisco, CA • May 2018

Today’s users demand tailored, dynamic, and constantly refined experiences. They expect intelligent applications that will learn from data and improve with longevity and popularity. Application intelligence takes many forms, including anomaly and fraud detection, product recommendations, natural-language understanding, even speech and image recognition. All of these capabilities will need to be put into production and managed alongside conventional application components.

In this session, you’ll:

See how developers can integrate intelligent features into their products without involving data scientists or machine learning engineers.

Understand how application intelligence is created and refined by cross-functional teams and how using OpenShift can accelerate this process.

Pythonic Apache Spark app patterns for the cloud

Michael McCune

DevConf.cz • Brno, Czechia • January 2018

In this presentation Michael will demonstrate how to create and deploy Python based Apache Spark applications to cloud native environments. We will explore design patterns to help you integrate your analytics and machine learning algorithms into applications which can take full advantage of cloud native platforms like OpenShift Origin. You will see code samples and live demonstrations of techniques for building and deploying Apache Spark applications written in Python. These samples and techniques will provide a solid basis that you can use to create your own intelligent applications for the cloud.

Probabilistic Structures for Scalable Computing

William Benton

DevConf.cz • Brno, Czechia • January 2018

In this talk you’ll learn about streaming algorithms and approximate data structures to characterize data sources that are too big to keep around or difficult to replay. We’ll start simple, with an algorithm for on-line mean and variance estimates of a stream of samples. Then we’ll look at Bloom filters (for approximate set membership), count-min sketch (for approximate member count in a multiset), and HyperLogLog (for approximate set cardinality). We’ll cover implementing these algorithms, using them for data analysis (and even machine learning), and provide some intuition for why they work at scale. Come with reading knowledge of Python and leave with some cool new options in your scalable data processing toolbox!

Note that the YouTube video for this talk is audio-only; the actual talk was delivered without slides due to projector malfunction.

Collaborative Filtering Microservices on Spark

Rui Vieira, Sophie Watson

DevConf.cz • Brno, Czechia • January 2018

The Alternating Least Squares (ALS) algorithm is still deemed the industry standard in collaborative filtering. In this talk we will focus on Apache Spark’s ALS implementation and discuss the steps we took to build a distributed recommendation engine, focusing on continuous model training and model management.

We show that, by splitting the recommendation engine into microservices, we were able to reduce the system’s complexity and produce a robust collaborative filtering platform with support for continuous model training.

At the end of this talk, you should be equipped with enough tools and ideas to implement your own collaborative algorithm and avoid some common pitfalls.

2017

Containerizing TensorFlow Applications on OpenShift

Subin Modeel

December 2017

Deep learning and GPU have become hot topics in recent times. TensorFlow has become a popular open source project for deep learning applications. But how can we use OpenShift for TensorFlow application development?
In this presentation you will learn how to create custom container images with TensorFlow binaries, use Project Jupyter for TensorFlow model development, and deployment of those models in OpenShift. You will also learn how to use continuous integration for TensorFlow applications on Openshift.
Learn all of this through examples with MNIST handwriting recognition, application of the Inception model, a neural style transfer with GPUs and transfer learning for celebrity detection.

One-Pass Data Science in Apache Spark with Generative T-Digests

Erik Erlandson

Spark Summit EU • Dublin, Ireland • October 2017

The T-Digest has earned a reputation as a highly efficient and versatile sketching data structure; however, its applications as a fast generative model are less appreciated. Several common algorithms from machine learning use randomization of feature columns as a building block. Column randomization is an awkward and expensive operation when performed directly, but when implemented with generative T-Digests, it can be accomplished elegantly in a single pass that also parallelizes across Spark data partitions. In this talk Erik will review the principles of T-Digest sketching, and how T-Digests can be applied as generative models. He will explain how generative T-Digests can be used to implement fast randomization of columnar data, and conclude with demonstrations of T-Digest randomization applied to Variable Importance, Random Forest Clustering and Feature Reduction. Attendees will leave this talk with an understanding of T-Digest sketching, how T-Digests can be used as generative models, and insights into applying generative T-Digests to accelerate their own data science projects.

Fire in the Sky: An Introduction to Monitoring Apache Spark in the Cloud

Michael McCune

Spark Summit EU • Dublin, Ireland • October 2017

Writing intelligent cloud native applications is hard enough when things go well, but what happens when there are performance and debugging issues that arise during production? Inspecting the logs is a good start, but what if the logs don’t show the whole picture? Now you have to go deeper, examining the live performance metrics that are generated by Spark, or even deploying specialized microservices to monitor and act upon that data. Spark provides several built-in sinks for exposing metrics data about the internal state of its executors and drivers, but getting at that information when your cluster is in the cloud can be a time consuming and arduous process. In this presentation, Michael McCune will walk through the options available for gaining access to the metrics data even when a Spark cluster lives in a cloud native containerized environment. Attendees will see demonstrations of techniques that will help them to integrate a full-fledged metrics story into their deployments. Michael will also discuss the pain points and challenges around publishing this data outside of the cloud and explain how to overcome them. In this talk you will learn about: Deploying metrics sinks as microservices, Common configuration options, and Accessing metrics data through a variety of mechanisms.

Building Machine Learning Algorithms on Apache Spark

William Benton

Spark Summit EU • Dublin, Ireland • October 2017

There are many reasons why you might want to implement your own machine learning algorithms on Spark: you might want to experiment with a new idea, try and reproduce results from a recent research paper, or simply to use an existing technique that isn’t implemented in MLlib. In this talk, we’ll walk through the process of developing a new machine learning model for Spark. We’ll start with the basics, by considering how we’d design a parallel implementation of a particular unsupervised learning technique. The bulk of the talk will focus on the details you need to know to turn an algorithm design into an efficient parallel implementation on Spark: we’ll start by reviewing a simple RDD-based implementation, show some improvements, point out some pitfalls to avoid, and iteratively extend our implementation to support contemporary Spark features like ML Pipelines and structured query processing. You’ll leave this talk with everything you need to build a new machine learning technique that runs on Spark.

Analyzing Blockchain transaction graph with Spark

Jirka Kremser

OpenSlava • Bratislava, Slovakia • October 2017

Cryptocurrencies attract various groups of people. Among other it could be investors,
people from retail, tech enthusiasts, crypto-anarchists, etc. We are not going to
focus on anything else than the raw technology behind the Blockchain, leaving aside all
the ideology and hype that comes with the Bitcoin.

In this presentation we will show how the graph data can be processed in Spark. Blockchain binary
data is transformed into large graph of transactions so that we can work with the graph from Spark
using GraphX and GraphFrames libraries.
The demo shows two notebooks with multiple examples of calculating interesting features
of the transaction graph.

The GraphX based notebook uses the spark-notebook
as the notebook technology, while the second one uses GraphFrames and
Jupyter notebook. Also the second notebook
connects to an existing spark cluster that was created by Oshinko tools.

From notebooks to cloud native: a modern path for data driven applications

Michael McCune

Strata Data • New York, NY • September 2017

The world of application development and deployment is changing rapidly with the advent of container-based orchestration platforms. Adjusting to these changes takes an open mind and a willingness to explore new techniques and methodologies. Notebook interfaces like Apache Zeppelin and Project Jupyter are excellent starting points for sketching out ideas and exploring data-driven algorithms, but where does the process lead after the notebook work has been completed? Combining the power and flexibility of notebooks with that of containers presents new opportunities to increase your productivity, such as creating processing clusters on demand, increased repeatability, and using continuous delivery techniques.

Michael McCune explains how to use notebook interfaces to create insightful data-driven demonstrations, which can then be ported directly into cloud-native applications, as he walks you through evolving an Apache Spark financial services application from a notebook to a microservice to a packaged container before finally deploying it through continuous delivery to a Kubernetes-backed platform. Along the way, Michael discusses the benefits and challenges that exist when migrating Apache Spark-based applications into containerized orchestration platforms.

The Revolution Will Be Containerized • Architecting the Intelligent Applications of Tomorrow

William Benton

Berlin Buzzwords • Berlin, Germany • June 2017

Linux containers are increasingly popular with application developers: they offer improved elasticity, fault-tolerance, and portability between different public and private clouds, along with an unbeatable development workflow. It’s hard to imagine a technology that has had more impact on application developers in the last decade than containers, with the possible exception of ubiquitous analytics. Indeed, analytics is no longer a separate workload that occasionally generates reports on things that happened yesterday; instead, it pulses beneath the rhythms of contemporary business and supports today’s most interesting and vital applications. Since applications depend on analytic capabilities, it makes good sense to deploy our data-processing frameworks alongside our applications.

In this talk, you’ll learn from our expertise deploying Apache Spark and other data-processing frameworks in Linux containers on Kubernetes. We’ll explain what containers are and why you should care about them. We’ll cover the benefits of containerizing applications, architectures for analytic applications that make sense in containers, and how to handle external data sources. You’ll also get practical advice on how to ensure security and isolation, how to achieve high performance, and how to sidestep and negotiate potential challenges. Throughout the talk, we’ll refer back to concrete lessons we’ve learned about containerized analytic jobs ranging from interactive notebooks to production applications. You’ll leave inspired and enabled to deploy high-performance analytic applications without giving up the security you need or the developer-friendly workflow you want.

Smart Scalable Feature Reduction With Random Forests

Erik Erlandson

Spark Summit • San Francisco, CA • June 2017

Modern datacenters and IoT networks generate a wide variety of telemetry that makes excellent fodder for machine learning algorithms. Combined with feature extraction and expansion techniques such as word2vec or polynomial expansion, these data yield an embarrassment of riches for learning models and the data scientists who train them. However, these extremely rich feature sets come at a cost. High-dimensional feature spaces almost always include many redundant or noisy dimensions. These low-information features waste space and computation, and reduce the quality of learning models by diluting useful features.

In this talk, Erlandson will describe how Random Forest Clustering identifies useful features in data having many low-quality features, and will demonstrate a feature reduction application using Apache Spark to analyze compute infrastructure telemetry data.

Learn the principles of how Random Forest Clustering solves feature reduction problems, and how you can apply Random Forest tools in Apache Spark to improve your model training scalability, the quality of your models, and your understanding of application domains.

Data crunching and web serving have existed very separate worlds. Access by a web application to analysis required a long process of Extract, Transform, Load (ETL), database work, and imports and exports, as well as getting network and storage assistance. The rise of containers, orchestration, more cost-effective computing and networking has resulted in a convergence, creating the possibility of using the same hardware and, more importantly, clustering software to converge both types of workloads. In this session, we’ll discuss a high level vision of this approach with containers, Kubernetes, web servers, and Apache Spark. We’ll show a demo of how this convergence helps data analysis move from custom R or Python scripts on an analyst’s desktop to an accessible web app, while letting the analyst simultaneously constrain the analysis to prevent statistical overreach.

Sketching Data with T-Digest In Apache Spark

Erik Erlandson

Spark Summit East • Boston, MA • February 2017

Algorithms for sketching probability distributions from large data sets are a fundamental building block of modern data science. Sketching plays a role in diverse applications ranging from visualization, optimizing data encodings, estimating quantiles, data synthesis and imputation. The T-Digest is a versatile sketching data structure. It operates on any numeric data, models tricky distribution tails with high fidelity, and most crucially it works smoothly with aggregators and map-reduce.

T-Digest is a perfect fit for Apache Spark; it is single-pass and intermediate results can be aggregated across partitions in batch jobs or aggregated across windows in streaming jobs. In this talk I will describe a native Scala implementation of the T-Digest sketching algorithm and demonstrate its use in Spark applications for visualization, quantile estimations and data synthesis.

Attendees of this talk will leave with an understanding of data sketching with T-Digest sketches, and insights about how to apply T-Digest to their own data analysis applications.

Developers love Linux containers, which neatly package up an application and its dependencies and are easy to create and share. However, this unbeatable developer experience hides some deployment challenges for real applications: how do you wire together pieces of a multi-container application? Where do you store your persistent data if your containers are ephemeral? Do containers really contain and isolate your application, or are they merely hiding potential security vulnerabilities? Are your containers scheduled across your compute resources efficiently, or are they trampling on one another?

Container application platforms like Kubernetes provide the answers to some of these questions. We’ll draw on expertise in Linux security, distributed scheduling, and the Java Virtual Machine to dig deep on the performance and security implications of running in containers. This talk will provide a deep dive into tuning and orchestrating containerized Spark applications. You’ll leave this talk with an understanding of the relevant issues, best practices for containerizing data-processing workloads, and tips for taking advantage of the latest features and fixes in Linux Containers, the JDK, and Kubernetes. You’ll leave inspired and enabled to deploy high-performance Spark applications without giving up the security you need or the developer-friendly workflow you want.

Teaching Apache Spark Clusters to Manage Their Workers Elastically

Erik Erlandson, Trevor Mckay

Spark Summit East • Boston, MA • February 2017

Devops engineers have applied a great deal of creativity and energy to invent tools that automate infrastructure management, in the service of deploying capable and functional applications. For data-driven applications running on Apache Spark, the details of instantiating and managing the backing Spark cluster can be a distraction from focusing on the application logic. In the spirit of devops, automating Spark cluster management tasks allows engineers to focus their attention on application code that provides value to end-users.

Using Openshift Origin as a laboratory, we implemented a platform where Apache Spark applications create their own clusters and then dynamically manage their own scale via host-platform APIs. This makes it possible to launch a fully elastic Spark application with little more than the click of a button.

We will present a live demo of turn-key deployment for elastic Apache Spark applications, and share what we’ve learned about developing Spark applications that manage their own resources dynamically with platform APIs.

The audience for this talk will be anyone looking for ways to streamline their Apache Spark cluster management, reduce the workload for Spark application deployment, or create self-scaling elastic applications. Attendees can expect to learn about leveraging APIs in the Kubernetes ecosystem that enable application deployments to manipulate their own scale elastically.

Big Data In Production: Bare Metal to OpenShift

William Benton

DevConf.cz • Brno, Czechia • January 2017

Apache Spark is one of the most exciting open-source data-processing frameworks today. It features a range of useful capabilities and an unusually developer-friendly programming model. However, the ease of getting a simple Spark application running can hide some of the challenges you might face while going from a proof of concept to a real-world application. This talk will distill our experiences as early adopters of Spark in production, present a case study where using Spark effectively provided huge benefits over legacy solutions, explain why we migrated from a dedicated Spark cluster to OpenShift, and provide concrete advice regarding:

how to evaluate predictive models and make sense of the analytic components of insightful applications, and

integrating Spark into microservice applications on OpenShift

This talk assumes some familiarity with Apache Spark but will provide context for attendees who are new to Spark. You’ll learn from a seasoned Red Hat engineer with over three years of experience running Spark in production and contributing to the Spark community.

Insightful Apps with Apache Spark and OpenShift

William Benton, Michael McCune

DevConf.cz • Brno, Czechia • January 2017

Nearly all of today’s most exciting applications are insightful applications: they employ machine learning and large-scale data processing to improve with longevity and popularity. It’s an easy bet that the important applications of tomorrow will be insightful as well. It’s also an easy bet that you’ll want to be deploying tomorrow’s applications on a contemporary container platform with a great developer workflow like OpenShift.

Insightful applications pose some new challenges for developers, but this hands-on workshop will show you how to navigate them confidently. You’ll learn how to develop an insightful application on OpenShift with Apache Spark from the ground up. We’ll cover:

architectures for analytic applications and microservices;

a crash course in Apache Spark, some data science techniques, and OpenShift;

how to deploy Apache Spark as part of an OpenShift application; and

building a data-driven application from the ground up.

This workshop is largely self-contained: the only prerequisite is some familiarity with Python. Learn from the experience of Red Hat emerging technology engineers who are focused on bringing data-driven application development to OpenShift!

Building My Own Little World with Open Data

Steven Pousty

linux.conf.au • Hobart, Australia • January 2017

Everybody cares about the place (they live, they grew up in, they had a great vacation, in the news….). With the rise of open data, big data tooling, and new visualisation technology, we can actually now build applications that give people new ways to explore beyond “where is the closest Starbucks”. I have collected Open Data from my home town (Santa Cruz, CA) and compiled it into the beginnings of a visualization and analysis platform. The goal of this talk is to show the process of collecting open data from disparate sources, some of the caveats on being able to put them together, general lessons learned, and some fun visualizations. I want to move past thinking about sources for open data and moving on to tools and lessons so you can get cracking! I want to show how we can enable people to gather open data and turn it to open knowledge. Data sources will be from Government (e.g. United States Geologic Survey) and Non-Government sources (e.g Audubon Society eBird Data) while some of the tools covered will be Apache Spark, PostGIS, Leaflet, and various others.

Building Cloud Native Apache Spark Applications with OpenShift

Michael McCune

January 2017

Apache Spark based applications are often comprised of many separate, interconnected components that are a good match for an orchestrated containerized platform like OpenShift which is built on Kubernetes. But with the increased flexibility afforded by these technologies comes a new set of challenges for building rich data-centric applications. Mike starts off with how to build Apache Spark Application pipelines and then walks thru a demo of building one on OpenShift. He also gave some great insights into the road ahead for Apache Spark on OpenShift.

2016

Building Apache Spark Application Pipelines for the Kubernetes Ecosystem

Michael McCune

Apache: Big Data • Seville, Spain • December 2016

Apache Spark based applications are often comprised of many separate, interconnected components that are a good match for an orchestrated containerized platform like Kubernetes. But with the increased flexibility afforded by these technologies comes a new set challenges for building rich data-centric applications.

In this presentation we will discuss techniques for building multi-component Apache Spark based applications that can be easily deployed and managed on a Kubernetes infrastructure. Building on experiences learned while developing and deploying cloud native applications on an OpenShift platform, we will explore common issues that arise during the engineering process and demonstrate workflows for easing the maintenance factors associated with complex installations.

Converging Big Data and Application Infrastructure

Steve Pousty

Big Data Spain • Madrid, Spain • December 2016

For most of my lifetime in the computing world, data crunching and web serving were two very separate worlds. If a web app wanted access to the analysis there was a long process of ETL, DB works, imports and exports, and bribing various network and storage people for the resources you needed. With the rise of containers, orchestration, cheap computing and networking, and over 10 years of people tackling large problems at new scales we have finally come to a convergence. It is now possible for us to actually use the same hardware, and more importantly, clustering software to converge both types of workloads. I am going to lay out how this can look with Containers, Kubernetes, web servers, and Apache Spark. This can be considered a germ of what we can look to build in the future. I will demo this in action and show this is actually now achievable for mere mortals such as myself. Finally I will close with some thought experiments on what this can enable for the future. I know this is a keynote but I am hoping we can make it interactive with discussion and experience sharing!

Running Apache Spark Natively on Kubernetes with OpenShift

Erik Erlandson

November 2016

Apache Spark can be made natively aware of Kubernetes by implementing a Spark scheduler back-end that can run Spark application Drivers and bare Executors in kubernetes pods. In this talk, Erik will explain the design of a native-Kubernetes scheduler back-end in Spark and demonstrate a Spark application submission with OpenShift.

Containerized Spark on Kubernetes

William Benton

Spark Summit EU • Brussels, Belgium • October 2016

Consider two recent trends in application development: more and more applications are taking advantage of architectures involving containerized microservices in order to enable improved elasticity, fault-tolerance, and scalability — whether in the public cloud or on-premise. In addition, analytic capabilities and scalable data processing have increasingly become a basic requirement for contemporary applications. The confluence of these trends suggests that there are a lot of good reasons to want to manage Spark with a container orchestration platform, but it’s not quite as simple as packaging up a standalone cluster in containers. This talk will present our team’s experiences migrating a production Spark cluster from a multi-tenant Mesos cluster to a shared compute resource managed by Kubernetes. We’ll explain the motivation behind microservices and containers and identify the architectures that make sense for containerized applications that depend on Spark. We’ll pay special attention to practical concerns of running Spark in containers, including networking, access control, persistent storage, and multitenancy. You’ll leave this talk with a better understanding of why you might want to run Spark in containers and some concrete ideas for how to get started doing it.

Big Data and Apache Spark on OpenShift Pt. II

The first meeting of the Big Data Special Interest Group, expanded on a previous Commons session entitled Big Data and Apache Spark on OpenShift (Part 1) which kicked off the Big Data SIG.

In the previous session, Red Hat’s Will Benton gave us a vocabulary for talking about data-driven applications and outlined some example architectures for building data-driven applications with microservices. In this SIG session, he gave us an introduction to using Apache Spark on OpenShift and walk through an example data-driven application.

Big Data and Apache Spark on OpenShift Pt. I

William Benton

July 2016

In this introductory Big Data briefing session, Red Hat’s Will Benton gave an overview into Big Data architecture and concepts to help level the playing field. This video will give us a better understanding of what a data-intensive application should actually look like on a modern container orchestration platform, and to help kick off the OpenShift Common Big Data SIG.

In this recording, you’ll learn about the anatomy of data-intensive applications, how they come to life, and what they have to accomplish. We walked through a few applications and explored their responsibilities, saw how they use data, discuss trade-offs they must negotiate, and point to some example architectures that make sense for realizing data-intensive applications on OpenShift.

Analyzing Log Data With Apache Spark

William Benton

Spark Summit • San Francisco, CA • June 2016

Contemporary applications and infrastructure software leave behind a tremendous volume of metric and log data. This aggregated “digital exhaust” is inscrutable to humans and difficult for computers to analyze, since it is vast, complex, and not explicitly structured. This session will introduce the log processing domain and provide practical advice for analyzing log data with Apache Spark, including:

best practices for tuning Spark, training models against structured data, and ingesting data from external sources like ElasticSearch; and

a few relatively painless ways to visualize your results.

You’ll have a better understanding of the unique challenges posed by infrastructure log data after this session. You’ll also learn the most important lessons from our efforts both to develop analytic capabilities for an open-source log aggregation service and to evaluate these at enterprise scale.

2015

Diagnosing Open-Source Community Health with Spark

William Benton

Spark Summit • San Francisco, CA • June 2015

Successful companies use analytic measures to identify and reward their best projects and contributors. Successful open source developers often make similar decisions when they evaluate whether or not to reward a project or community by investing their time. This talk will show how Spark enables a data-driven understanding of the dynamics of open source communities, using operational data from the Fedora Project as an example. With thousands of contributors and millions of users, Fedora is one of the world’s largest open-source communities. Notably, Fedora also has completely open infrastructure: every event related to the project’s daily operation is logged to a public messaging bus, and historical event data are available in bulk. We’ll demonstrate best practices for using Spark SQL to ingest bulk data with rich, nested structure, using ML pipelines to make sense of software community data, and keeping insights current by processing streaming updates.

2014

Analyzing endurance-sports activity data with Spark

William Benton

Spark Summit • San Francisco, CA • July 2014

Spark’s support for efficient execution and rapid interactive prototyping enable novel approaches to understanding data-rich domains that have historically been underserved by analytical techniques. One such field is endurance sports, where athletes are faced with GPS and elevation traces as well as samples from heart rate, cadence, temperature, and wattage sensors. These data streams can be somewhat comprehensible at any given moment, when looking at a small window of samples on one’s watch or cycle computer, but are overwhelming in the aggregate.

In this talk, I’ll present my recent efforts using Spark and MLLib to mine my personal cycling training data for deeper insights and help me design workouts to meet particular fitness goals. This work incorporates analysis of geographic and time-series data, computational geometry, visualization, and domain knowledge of exercise physiology. I’ll show how Spark made this work possible, demonstrate some novel techniques for analyzing fitness data, and discuss how these approaches could be applied to make sense of data from an entire community of cyclists.