Traffic accidents in the UK, 1979-2004.

Whether you are a journalist, a researcher or a data geek, in order to start working with large data sets, you have to complete laborious tasks of setting-up an infrastructure, configuring an environment, learning new unfamiliar tools and coding complicated apps – with DC/OS you can start crunching those numbers within minutes.

Let’s start with a problem of analyzing a set of data and take a road safety data from Great Britain, 1979-2004. While the data set might seem small, some of the analysis might require distributed processing and we should have an environment that allows our processing jobs to scale horizontally. To achieve this, we’ll be running a DC/OS cluster on top of a cluster of virtual machines. We’ll be using AWS EC2 in this scenario, but the same solution can be ported to other public and private clouds.

DC/OS sets up a cluster and deploys pre-configured components services needed to complete a task on hand. You don’t have to entirely understand the complexity of the infrastructure and how to set it up, DC/OS helps you creating necessary abstractions. Once complete, you will have a running cluster with interactive research notebook (container of Jupyter Python Notebook with Apache Spark) and distributed file system (HDFS), ready to tackle any large-scale data processing task.

Run the notebook. First you will notice new tasks in Mesos, these are Spark executors:

Your Jupyter notebook will look like this:

As you’ve seen in this post, you can start containerized services in minutes. DC/OS gives you complete environment and lets you focus on your problem not on routine deployment or service configuration adjustments.