Spark

Spark is one of the most popular open-source distributed computation engines and offers a scalable, flexible framework for processing huge amounts of data efficiently. The recent 2.0 release milestone brought a number of significant improvements including DataSets, an improved version of DataFrames, more support for SparkR, and a lot more. One of the great things about Spark is that it’s relatively autonomous and doesn’t require a lot of extra infrastructure to work. While Spark’s latest release is at 2.1.0 at the time of publishing, we’ll use the example of 2.0.1 throughout this post.

Jupyter

Jupyter notebooks are an interactive way to code that can enable rapid prototyping and exploration. It essentially connects a browser-based frontend, the Jupyter Server, to an interactive REPL underneath that can process snippets of code. The advantage to the user is being able to write code in small chunks which can be run independently but share the same namespace, greatly facilitating testing or trying multiple approaches in a modular fashion. The platform supports a number of kernels (the things that actually run the code) besides the out-of-the-box Python, but connecting Jupyter to Spark is a little trickier. Enter Apache Toree, a project meant to solve this problem by acting as a middleman between a running Spark cluster and other applications.

In this post I’ll describe how we go from a clean Ubuntu installation to being able to run Spark 2.0 code on Jupyter. Continue reading →

We all know how important keeping track of your open-source licensing is for the average startup. While most people think of open-source licenses as all being the same, there are meaningful differences that could have potentially serious legal implications for your code base. From permissive licenses like MIT or BSD to so-called “reciprocal” or “copyleft” licenses, keeping track of the alphabet soup of dependencies in your source code can be a pain.

Christian Moscardi is Director of Technology at The Data Incubator. This post was originally posted on his blog.

Anyone who’s ever tried to write a nontrivial application on Google App Engine has encountered at least seven* design decisions that have led to serious head-scratching moments. One of those happened to me about a month ago, while integrating Chef into our course at The Data Incubator. Our goal was to allow for one-click spinning up (on DigitalOcean’s cloud) and monitoring of our Fellows’ course machines, already under Chef management.

* No basis in fact – there are probably more than seven. It should be noted that the Google Cloud Platform is going to greatly improve this situation by allowing you to deploy Docker containers – woohoo!

A First Look

Chef servers have an HTTP API. Seems like it’d be an easy integration, right? While GAE doesn’t let you do many things (including making SMTP connections), one thing you, thankfully, can do with relative ease is make HTTP requests (although everyone’s favorite Python HTTP library, requests, is a totalnightmare – but that’s for another blogpost). This was going to be a quick job – we’d spend a couple days coding, write some tests, and have one-click deployment, right? Right? As you probably guessed, that timeline was anything but right. Continue reading →

Here at The Data Incubator, our Fellows deploy their own fully functional, public-facing web app to showcase their data science skills to employers. This not only gives them valuable experience dynamically fetching and displaying data, but also encourages them to think about end user interaction. To demo the process, we decided to marry together some of our favorite technologies:

The goal is to create some distant ancestor of Google Finance: a form capable of accepting a stock ticker as input and producing a plot of the daily close price. Here’s the finished product. So how do we get there?