February 28, 2015

Introduction

Confession, I’ve never actually used Numba. Well that’s not
quite true; I’ve indirectly used Numba thousands of times because Blaze
auto-generates numba ufuncs.
Still I’ve never used it for a particular problem. I usually define problems
with large array operations and compile those down. Numba takes a different
approach and translates Python for loops to efficient LLVM code.
This is all lower in the hardware stack than where I usually think.

But when I was looking for applications to motivate recent work in
nearest-neighbor communications in
daska
friend pointed me
towards the Ising model, a simple
physical system that is both easy to code up and has nice macro-scale
properties. I took this as an example to play with Numba. This post details
my experience.

Ising Model

Disclaimer: I am not a physicist

The Ising model represents a regular grid of points where each point has two
possible states, spin up and spin down. States like to have the same spin as
their immediate neighbors so when a spin-down state is surrounded by more
spin-up states it will switch to spin-up and vice versa. Also, due to random
fluctuations, points might switch spins, even if this switch is not favorable.
In pseudocode an Ising update step might look like the following

for every point in the grid:
energy = my spin * sum of all of the spins (+1 or -1) of neighboring points
if energy is improved by switching:
switch
else if we're particulalry unlucky
switch anyway

For this kind of algorithm imperative for-loopy code is probably the most
clear. You can do this with high-level array operations too (e.g.
NumPy/Blaze/Theano), but it’s a mess.

Numba code

Here is my solution to the problem with Numba and a gif of the solution

importnumbaimportnumpyasnpfrommathimportexp,log,e,sqrtkT=2/log(1+sqrt(2),e)@numba.jit(nopython=True)def_update(x,i,j):n,m=x.shapedE=2*x[i,j]*(x[(i-1)%n,(j-1)%m]+x[(i-1)%n,j]+x[(i-1)%n,(j+1)%m]+x[i,(j-1)%m]+x[i,(j+1)%m]+x[(i+1)%n,(j-1)%m]+x[(i+1)%n,j]+x[(i+1)%n,(j+1)%m])ifdE<=0orexp(-dE/kT)>np.random.random():x[i,j]*=-1@numba.jit(nopython=True)defupdate(x):n,m=x.shapeforiinrange(n):forjinrange(0,m,2):# Even columns first to avoid overlap_update(x,i,j)foriinrange(n):forjinrange(1,m,2):# Odd columns second to avoid overlap_update(x,i,j)if__name__=='__main__':x=np.random.randint(2,size=(1000,1000)).astype('i1')x[x==0]=-1foriinrange(100):update(x)

My old beater laptop executes one update step on a 1000x1000 grid in 50ms.
Without Numba this takes 12s. This wasn’t a canned demo by an expert user /
numba developer; this was just my out-of-the-box experience.

Thoughts

I really like the following:

I can remove @numba.jit and use the Python debugger

I can assert that I’m only using LLVM with nopython=True

I can manage data with NumPy (or dask.array) separately from managing
computation with Numba

I ran in to some issues and learned some things too:

random is only present in the developer preview builds of Numba
(conda install -c numba numba). It will be officially released in the
0.18 version this March.

You don’t have to provide type signature strings. I tried providing these
at first but I didn’t know the syntax and so repeatedly failed to write down
the type signature correctly. Turns out the cost of not writing it down is
that Numba will jit whenever it sees a new signature. For my application
this is essentially free.

February 26, 2015

Registration is now open for the workshop "HBP Hippocamp CA1: Collaborative and Integrative Modeling of Hippocampal Area CA1", to be held March 31st - April 1st, 2015 at UCL School of Pharmacy in London, supported by the Human Brain Project (www.humanbrainproject.eu).

In short, the aims of the workshop are two-fold. First, to engage the larger community of experimentalists and modelers working on hippocampus, and highlight existing modeling efforts and strategic datasets for modeling hippocampal area CA1. Second, to define and bootstrap an inclusive community-driven model and data-integration process to achieve open pre-competitive reference models of area CA1 (and, ultimately, the rest of the hippocampus), which are well documented, validated, and released at regular intervals (supported in part by IT infrastructure funded by HBP). Involvement from the community interested in characterization and modeling of the hippocampus is highly encouraged. To keep the meeting focused on the task, participation will be limited to ~30 people, so registration is required.

February 24, 2015

Donald Knuth famously quipped that "premature optimization is the root of all evil." The reasons are straightforward: optimized code tends to be much more difficult to read and debug than simpler implementations of the same algorithm, and optimizing too early leads to greater costs down the road. In the Python world, there is another cost to optimization: optimized code often is written in a compiled language like Fortran or C, and this leads to barriers to its development, use, and deployment.

Too often, tutorials about optimizing Python use trivial or toy examples which may not map well to the real world. I've certainly been guilty of this myself. Here, I'm going to take a different route: in this post I will outline the process of understanding, implementing, and optimizing a non-trivial algorithm in Python, in this case the Non-uniform Fast Fourier Transform (NUFFT). Along the way, we'll dig into the process of optimizing Python code, and see how a relatively straightforward pure Python implementation, with a little help from Numba, can be made to nearly match the performance of a highly-optimized Fortran implementation of the same algorithm.

Why a Python Implementation?

First, I want to answer the inevitable question: why spend the time to make a Python implementation of an algorithm that's already out there in Fortran? The reason is that I've found in my research and teaching that pure-Python implementations of algorithms are far more valuable than C or Fortran implementations, even if they might be a bit slower. This is for a number of reasons:

Pure-Python code is easier to read, understand, and contribute to. Good Python implementations are much higher-level than C or Fortran, and abstract-away loop indices, bit twiddling, workspace arrays, and other sources of code clutter. A typical student reading good Python code can immediately understand and modify the algorithm, while the same student would be lost trying to understand typical optimized Fortran code.

Pure-python packages are much easier to install than Python-wrapped C or Fortran code. This is especially true on non-Linux systems. Fortran in particular can require some installation prerequisites that are non-trivial for many users. In practice, I've seen people give up on better tools when there is an installation barrier for those tools.

Pure-python code often works for many data types. Because of the way it is written, pure Python code is often automatically applicable to single or double precision, and perhaps even to extensions to complex numbers. For compiled packages, supporting and compiling for all possible types can be a burden.

Pure-python is easier to use at scale. Because it does not require complicated installation, pure Python packages can be much easier to install on cloud VMs and/or shared clusters for computation at scale. If you can easily pip-install a pure-Python package on a VM, then services like AWS and TravisCI are much easier to set up.

Certainly code speed will overcome these considerations if the performance gap is great enough, but I've found that for many applications a pure Python package, cleverly designed and optimized, can be made fast enough that these larger considerations win-out. The challenge is making the Python fast. We'll explore this below.

Background: The Non-Uniform Fast Fourier Transform

The Fast Fourier Transform (FFT) is perhaps the most important and fundamental of modern numerical algorithms. It provides a fast, \(O[N\log N]\) method of computing the discrete Fourier transform: \[
Y_k^\pm = \sum_{n=0}^{N-1} y_n e^{\pm i k n / N}
\] You can read more about the FFT in my previous post on the subject.

One important limitation of the FFT is that it requires that input data be evenly-spaced: that is, we can think of the values \(y_n\) as samples of a function \(y_n = y(x_n)\) where \(x_n = x_0 + n\Delta x\) is a regular grid of points. But what about when your grid is not uniform? That is, what if you want to compute this result: \[
Y_k^\pm = \sum_{j=1}^N y(x_j) e^{\pm i k x_j}
\] where \(y(x)\) is evaluated at an arbitrary set of points \(x_j\)? In this case, the FFT is no longer directly applicable, and you're stuck using a much slower \(O[N^2]\) direct summation.

Stuck, that is, until the NUFFT came along.

The NUFFT is an algorithm which converts the non-uniform transform into an approximate uniform transform, not with error-prone interpolation, but instead using a clever "gridding" operation motivated by the convolution theorem. If you'd like to read about the algorithm in detail, the Courant Institute's NUFFT page has a nice set of resources.

Below we'll take a look at implementing this algorithm in Python.

Direct Non-Uniform Fourier Transform

When developing optimized code, it is important to start with something easy to make sure you're on the right track. Here we'll start with a straightforward direct version of the non-uniform Fourier transform. We'll allow non-uniform inputs \(x_j\), but compute the output on a grid of \(M\) evenly-spaced frequencies in the range \(-M/2 \le f/\delta f < M/2\). This is what the NUFFT group calls the Type-1 NUFFT.

First we'll implement nufftfreqs(), which returns the frequency grid for a given \(M\), and nudft() which computes the non-uniform discrete Fourier transform using a slow direct method. The arguments for the latter include iflag, which is a positive or negative number indicating the desired sign of the exponent:

Again, I can't emphasize this enough: when writing fast code, start with a slow-and-simple version of the code which you know gives the correct result, and then optimize from there.

Comparing to the Fortran NUFFT

We can double-check that this is producing the desired result by comparing to the Fortran NUFFT implementation, using Python wrappers written by Dan Foreman-Mackey, available at http://github.com/dfm/python-nufft/:

The results match! A quick check shows that, as we might expect, the Fortran algorithm is orders of magnitude faster:

In [3]:

%timeitnudft(x,y,1000)%timeitnufft_fortran(x,y,1000)

10 loops, best of 3: 47.9 ms per loop
1000 loops, best of 3: 265 µs per loop

On top of this, for \(N\) points and \(N\) frequencies, the Fortran NUFFT will scale as \(O[N\log N]\), while our simple implementation will scale as \(O[N^2]\), making the difference even greater as \(N\) increases! Let's see if we can do better.

NUFFT with Python

Here we'll attempt a pure-Python version of the fast, FFT-based NUFFT. We'll follow the basics of the algorithm presented on the NUFFT page, using NumPy broadcasting tricks to push loops into the compiled layer of NumPy. For later convenience, we'll start by defining a utility to compute the grid parameters as detailed in the NUFFT paper.

The good news is that our Python implementation works; the bad news is that it remains several orders of magnitude slower than the Fortran result!

Let's make it faster.

Making Code Faster: Line Profiling

We know that our Python function is slow, but we'd like to determine where this speed bottleneck lies. One convenient way to do this is with the line_profiler utility, a Python/IPython addon which can be installed using $ pip install line_profiler Once it's installed, we can load the line profiler extension into the IPython notebook using the %load_ext magic function:

In [7]:

%load_extline_profiler

With the line profiler loaded, the %lprun magic function is now available, which we can use to profile our function line-by-line. In order to display these results here, we'll save them to file and then use %cat to view the file. The lines wrap in the notebook, making it difficult to read, so you may wish to view the raw file instead.

The output shows us where, line-by-line, the algorithm is spending the most time. We see that nearly 99% of the execution time is being spent in the single for loop at the center of our code. The loop is so expensive that even the FFT computation is just a trivial piece of the cost! This is actually pretty typical: due to dynamic typing, loops are generally very slow in Python.

One of the surest strategies for speeding-up your code is to use broadcasting tricks in NumPy to remove these kinds of large loops: you can read one of my course lectures on the subject here. We'll do this next.

NUFFT with NumPy Broadcasting

Let's rewrite the above implementation and use broadcasting tricks to elliminate the loops. Because of the structure of this problem, the approach is a bit complicated here, but it turns out that we can take advantage here of the little-known at() method of NumPy's ufunc (available since NumPy 1.8). Briefly,

>>> np.add.at(x, i, y)

is similar to

>>> x[i] += y

but works as desired even if the incides i have duplicate entries.

Using this, we can adjust our implementation as follows:

In [9]:

defnufft_numpy(x,y,M,df=1.0,iflag=1,eps=1E-15):"""Fast Non-Uniform Fourier Transform"""Msp,Mr,tau=_compute_grid_params(M,eps)N=len(x)# Construct the convolved grid ftau:# this replaces the loop used aboveftau=np.zeros(Mr,dtype=y.dtype)hx=2*np.pi/Mrxmod=(x*df)%(2*np.pi)m=1+(xmod//hx).astype(int)mm=np.arange(-Msp,Msp)mpmm=m+mm[:,np.newaxis]spread=y*np.exp(-0.25*(xmod-hx*mpmm)**2/tau)np.add.at(ftau,mpmm%Mr,spread)# Compute the FFT on the convolved gridififlag<0:Ftau=(1/Mr)*np.fft.fft(ftau)else:Ftau=np.fft.ifft(ftau)Ftau=np.concatenate([Ftau[-(M//2):],Ftau[:M//2+M%2]])# Deconvolve the grid using convolution theoremk=nufftfreqs(M)return(1/N)*np.sqrt(np.pi/tau)*np.exp(tau*k**2)*Ftau

It worked! We gained around a factor of 4 speedup in replacing the Python loop with the np.add.at() call. Still, though, we're sitting at about a factor of 10 slower than the Fortran version. The problem is that the np.add.at() call here requires construction of some very large and costly temporary arrays. If we want a faster execution time, we need to further optimize that main loop, and we can't do this with NumPy alone.

Optimization with Numba

When NumPy broadcasting tricks aren't enough, there are a few options: you can write Fortran or C code directly, you can use Cython, Weave, or other tools as a bridge to include compiled snippets in your script, or you can use a tool like Numba to speed-up your loops without ever leaving Python.

Numba is a slick tool which runs Python functions through an LLVM just-in-time (JIT) compiler, leading to orders-of-magnitude faster code for certain operations. In this case, we need to optimize what amounts to a nested for-loop, so Numba fits the bill perfectly. For clarity, we'll pull-out the grid construction code that we want to optimize, and write it as follows:

In [11]:

importnumba# nopython=True means an error will be raised# if fast compilation is not possible.@numba.jit(nopython=True)defbuild_grid(x,c,tau,Msp,ftau):Mr=ftau.shape[0]hx=2*np.pi/Mrforiinrange(x.shape[0]):xi=x[i]%(2*np.pi)m=1+int(xi//hx)formminrange(-Msp,Msp):ftau[(m+mm)%Mr]+=c[i]*np.exp(-0.25*(xi-hx*(m+mm))**2/tau)returnftaudefnufft_numba(x,c,M,df=1.0,eps=1E-15,iflag=1):"""Fast Non-Uniform Fourier Transform with Numba"""Msp,Mr,tau=_compute_grid_params(M,eps)N=len(x)# Construct the convolved gridftau=build_grid(x*df,c,tau,Msp,np.zeros(Mr,dtype=c.dtype))# Compute the FFT on the convolved gridififlag<0:Ftau=(1/Mr)*np.fft.fft(ftau)else:Ftau=np.fft.ifft(ftau)Ftau=np.concatenate([Ftau[-(M//2):],Ftau[:M//2+M%2]])# Deconvolve the grid using convolution theoremk=nufftfreqs(M)return(1/N)*np.sqrt(np.pi/tau)*np.exp(tau*k**2)*Ftau

Much better! We're now within about a factor of 3 of the Fortran speed, and we're still writing pure Python!

Having plucked all the low-hanging fruit, any further optimization will now be very low-level: that is, thinking about things like reduction of the number of exp() evaluations through application of mathematical identities. This type of careful logic is one reason the Fortran implementation is so fast, and many of these low-level strategies are discussed in the NUFFT paper linked above.

To gain some more speed, we can follow their advice and optimize the expressions at this level by precomputing expensive expressions and recombining these expressions later: This obfuscates the logic of the algorithm a bit, but it does lead to some faster execution. Here is an example of this:

This is looking good! With a bit of effort we are now within about 25% of the Fortran speed, and we retain all the advantages of having pure Python code!

Final Timing Comparison

For good measure, let's take a look at the scaling with \(M\) for all the fast algorithms we created. We'll compute the times for a range of input sizes for each algorithm. Be aware that the following code will take several minutes to run!

As we see, all the algorithms scale as \(\sim O[N\log N]\) in the large \(N\) limit, albeit with very different constants of proportionality. Our final optimized Numba implementation nearly matches the Fortran version as \(N\) grows large, and because it is written in pure Python, it retains all the advantages of pure Python code listed above. For that benefit, I think the cost of a ~25% slow-down is well worth it!

Conclusion

I hope you've enjoyed this exploration of how to write fast numerical code in pure Python. As you think about writing efficient implementations of useful algorithms, I invite you to consider the points I listed above: in particular, how difficult will it be for your users to install, read, modify, and contribute to your code? In the long run, this may be much more important than shaving a few milliseconds off the execution time. Writing a fast implementation of a useful algorithm is an excellent and useful pursuit, but we should be careful to not forget the costs that come along with such optimization.

If you're interested in using the pure-Python NUFFT implementation, I've adapted much of the above code in a repository at http://github.com/jakevdp/nufftpy/. It contains a packaged and unit-tested version of some of the above code, as well as a (hopefully) growing compendium of related routines.

This post was written entirely in the IPython notebook. You can download this notebook, or see a static view here.

February 23, 2015

Short version

Thank you to everyone who has already submitted, retweeted, and spread the word about our book, Elegant SciPy! We are still looking for code submissions meeting these criteria:
– Submissions must use NumPy, SciPy, or a closely related library in a non-trivial way.
– Submissions must be (re)licensed as BSD, MIT, public domain, or something similarly liberal. (This is easy if you are the author.)
– Code should be satisfying in some way, such as speed, conciseness, broad applicability…
– Preferably, nominate someone else’s code that impressed you.
– Include a scientific application on real data.

Long version

A big thank you to everyone that has submitted code, retweeted, posted on mailing lists, and otherwise spread the word about the book! We’re still pretty far from a book-length title though. I was also a bit vague about the kinds of submissions we wanted. I’ll elaborate a bit on each of the above points:

NumPy and SciPy use

Some excellent submissions did not use the SciPy library, but rather did amazing things with the Python standard library. I should have mentioned that the book will be specifically focused on the NumPy and SciPy libraries. That’s just the scope of the book that O’Reilly has contracted us to write. Therefore, although we might try to fit examples of great Python uses into a chapter, they are not suitable to be the centerpieces.

We will make some exceptions, for example for very closely related libraries such as pandas and scikit-learn. But, generally, the scope is SciPy the library.

Licensing

This one’s pretty obvious. We can’t use your submission if it’s under a restrictive license. And we don’t want to publish GPL-licensed code, which could force our readers to GPL-license their own code when using it. For more on this, see Choose A License, as well as Jake Vanderplas’s excellent blog post encouraging the use of the BSD license for scientific code.

Note that the author of a piece of code is free to relicense as they choose, so if you think some GPL code is perfect and might be amenable to relicensing, do let us know about it!

Submitting someone else’s code

I suspect that a lot of people are shy about submitting their own code. Two things should alleviate this. First, you can now submit via email, so you don’t have to be public about the self-promotion. (Not that there’s anything wrong with that, but I know I sometimes struggle with it.) And second, I want to explicitly state that we prefer it if you submit others’ code. This is not to discourage self-promotion, but to drive code quality even higher. It’s a high bar to convince ourselves that our code is worthy of being called elegant, but it takes another level entirely to find someone else’s code elegant! (The usual reaction when reading other people’s code is more like, “and what the #$&^ is going on here????“) So, try to think about times you saw great code during a code review, reading a blog post, or while grokking and fiddling with someone else’s library.

Data

Beautiful code is kind of a goal unto itself, but we really want to demonstrate how useful SciPy is in real scientific analysis. Therefore, although cute examples on synthetic data can illustrate quite well what a piece of code does, it would be extremely useful for us if you can point to examples with real scientific data behind them.

Today we take a break from ND-Arrays and show how task scheduling can attack
other collections like the simple list of Python objects.

Unstructured Data

Often before we have an ndarray or a table/DataFrame we have unstructured
data like log files. In this case tuned subsets of the language (e.g.
numpy/pandas) aren’t sufficient, we need the full Python language.

This workflow grows complex for most users when you introduce many processes.
To resolve this we build our normal tricks into a new dask.Bag collection.

Bag

In the same way that dask.array mimics NumPy operations (e.g. matrix
multiply, slicing), dask.bag mimics functional operations like map,
filter, reduce found in the standard library as well as many of the
streaming functions found in toolz.

We use all of our cores and stream through memory on each core. We use
multiprocessing but could get fancier with some work.

Related Work

There are a lot of much larger much more powerful systems that have a similar
API, notably Spark and
DPark. If you have Big Data and need to
use very many machines then you should stop reading this and go install them.

I mostly made dask.bag because

It was very easy given the work already done on dask.array

I often only need multiprocessing + a heavy machine

I wanted something trivially pip installable that didn’t use the JVM

But again, if you have Big Data, then this isn’t for you.

Design

As before, a Bag is just a dict holding tasks, along with a little meta data.

Manipulations of bags create task dependency graphs. We eventually execute
these graphs in parallel.

Execution

We repurpose the threaded scheduler we used for arrays to support
multiprocessing to provide parallelism even on pure Python code. We’re
careful to avoid unnecessary data transfer. None of the operations listed above
requires significant communication. Notably we don’t have any concept of
shuffling or scatter/gather.

>>>list(b.map(lambdax:x*10))# This works![0,10,20,30,40,0,10,20,30,40,0,10,20,30,40]>>>list(b.map(lambdax:x/0))# This errs gracefully!ZeroDivisionError:ExecptioninremoteProcessintegerdivisionormodulobyzeroTraceback:...

These tricks remove need for user expertise.

Productive Sweet Spot

I think that there is a productive sweet spot in the following configuration

Pure Python functions

Streaming/lazy data

Multiprocessing

A single large machine or a few machines in an informal cluster

This setup is common and it’s capable of handling terabyte scale workflows.
In my brief experience people rarely take this route. They use single-threaded
in-memory Python until it breaks, and then seek out Big Data Infrastructure
like Hadoop/Spark at relatively high productivity overhead.

Your workstation can scale bigger than you think.

Example

Here is about a gigabyte of
network flow data,
recording which computers made connections to which other computers on the
UC-Berkeley campus in 1996.

Easier/simpler for Python programmers to hack on.
The implementation is 350 lines including comments.

Again, this is really just a toy experiment to show that the dask model isn’t
just about arrays. I absolutely do not want to throw Dask in the ring with
Spark.

Conclusion

However I do want to stress the importance of single-machine parallelism.
Dask.bag targets this application well and leverages common hardware in a
simple way that is both natural and accessible to most Python programmers.

A skilled developer could extend this to work in a distributed memory context.
The logic to create the task dependency graphs is separate from the scheduler.

Special thanks to Erik Welch for finely crafting
the dask optimization passes that keep the data flowly smoothly.

Stack and Concatenate

We often store large arrays on disk in many different files. We
want to stack or concatenate these arrays together into one logical array.
Dask solves this problem with the stack and concatenate functions, which
stitch many arrays together into a single array, either creating a new
dimension with stack or along an existing dimension with concatenate.

Stack

We stack many existing dask arrays into a new array, creating a new dimension
as we go.

>>>importdask.arrayasda>>>arrays=[from_array(np.ones((4,4)),blockshape=(2,2))...foriinrange(3)]# A small stack of dask arrays>>>da.stack(arrays,axis=0).shape(3,4,4)>>>da.stack(arrays,axis=1).shape(4,3,4)>>>da.stack(arrays,axis=2).shape(4,4,3)

This creates a new dimension with length equal to the number of slices

Concatenate

We concatenate existing arrays into a new array, extending them along an
existing dimension

>>>importdask.arrayasda>>>arrays=[from_array(np.ones((4,4)),blockshape=(2,2))...foriinrange(3)]# small stack of dask arrays>>>da.concatenate(arrays,axis=0).shape(12,4)>>>da.concatenate(arrays,axis=1).shape(4,12)

I suspect that the temperature scale is in Kelvin. It looks like the random
day is taken during Northern Summer. Another fun one, lets look at the
difference between the temperatures at 00:00 and at 12:00

>>>imshow(x[::4].mean(axis=0)-x[2::4].mean(axis=0),cmap='RdBu_r')

Even though this looks and feels like NumPy we’re actually operating off of
disk using blocked algorithms. We execute these operations using only a small
amount of memory. If these operations were computationally intense (they
aren’t) then we also would also benefit from multiple cores.

What just happened

To be totally clear the following steps just occurred:

Open up a bunch of netCDF files and located a temperature variable
within each file. This is cheap.

For each of those temperature variables create a da.Array object,
specifying how we want to block up that variable. This is also cheap.

Make a new da.Array by concatenating all of our da.Arrays for each
day. This, like the other steps, is just book-keeping. We haven’t loaded
data or computed anything yet.

Write numpy-style code x[::2].mean(axis=0) - x[2::2].mean(axis=0).
This creates yet another da.Array with a more complex task graph. It
takes a few hundred milliseconds to create this dictionary.

Callimshow on our da.Array object

imshow calls np.array on its input, this starts the multi-core task
scheduler

A flurry of chunks fly out of all the netCDF files. These chunks meet
various NumPy functions and create new chunks. Well organized magic occurs
and an np.ndarray emerges.

Matplotlib makes a pretty graph

Problems that Popped Up

The threaded scheduler is introducing significant overhead in its planning.
For this workflow the single-threaded naive scheduler is actually significantly
faster. We’ll have to find better solutions to reduce scheduling overhead.

Conclusion

I hope that this shows off how dask.array can be useful when dealing with
collections of on-disk arrays. As always I’m very happy to hear how we can
make this project more useful for your work. If you have large n-dimensional
datasets I’d love to hear about what you do and how dask.array can help. I
can be reached either in the comments below or at blaze-dev@continuum.io.

Acknowledgements

First, other projects can already do this. In particular if this seemed useful
for your work then you should probably also know about
Biggus,
produced by the UK Met office, which has been around for much longer than
dask.array and is used in production.

February 12, 2015

The Continuum Analytics team will be attending the Strata + Hadoop World Conference from February 17-20.
We’re excited to be showcasing the story of Anaconda with the power of Python at Strata, and we look forward to meeting those attending.

February 11, 2015

tl;dr into now handles data on remote machines, including HDFS and the Hive
Metastore (kinda).

Motivation

Last week I wrote about
into, a library to migrate data between
formats. We saw that a network of pairwise data conversions can robustly
migrate data, eliminating some of the frustration of data science.

This frustration compounds when data lives on other computers or distributed
file systems like HDFS. Moving data from your local machine into something
like the Hive metastore often requires several steps.

scp data to cluster

hadoop fs -cp data to HDFS

CREATE TABLE in Hive/Impala to register data with metastore

Write SQL queries

While each of these steps may be relatively straightforward, their combination
can be daunting to the casual analyst.

Remote data and into

So we took this as a case study and extended the into network appropriately.
We integrate the following libraries and protocols

Because we’re connected to the network, lots of other things work too.

>>>df=into(pd.DataFrame,'ssh://hostname:myfile.json',**auth)

Note that we’re not calling any code on the remote machine so fancy conversions
always happen locally.

If you’d like to use ssh generally you might want to take a look at
Paramiko which is really doing all of the heavy
lifting here.

HDFS

WebHDFS is a web interface to the Hadoop File System. It is surprisingly high
performance (I often erroneously think of HTTP as slow) but isn’t always turned
on in every instance. If it is then you should be able to transfer data in and
out easily, just as we did for SSH

But Hive is also a bit finicky.
Blaze uses the
PyHive sqlalchemy dialect to query Hive
tables; unfortunately the way Hive works we need to create them by hand. Hive
is different from most databases in that it doesn’t have an internal format.
Instead, it represents tables as directories of CSV files (or other things).
This distinction mucks up into’s approach a bit but things work ok in normal
situations.

Lessons Learned

We had to add a couple new ideas to into to expand out to these systems.

Type Modifiers

First, we needed a way to refer to different variants of the same format of
file. For example, for CSV files we now have the following variants

A local CSV file
A CSV file accessible through HDFS
A CSV file accessible through SSH
A directory of CSV files
A directory of CSV files on HDFS
...

And the same for JSON, text, etc.. Into decides what conversion functions to
run based on the type of the data, so in principle we need subclasses for all
combinations of format and location. Yuck.

To solve this problem we create functions, SSH, HDFS, Directory to create
subclasses, we call these type modifiers. So SSH(CSV) is a new type that
acts like a CSV file and like the hidden _SSH superclass.

Note that users don’t usually see these (unless they want to be explicit) they
usually interact with uri strings.

Temporary files

Second, we need a way to route through temporary files. E.g. consider the
following route:

SSH(CSV) -> CSV -> pd.DataFrame

Both steps of this path are easy given paramiko and pandas. However we
don’t want the intermediate CSV file to hang around (users would hate us if we
slowly filled up their /tmp folder.) We need to clean it up when we’re done.

To solve this problem, we introduce a new type modifier, Temp, that drops
itself on garbage collection (drop is another magic function in into, see
docs). This lets us tie the
Python garbage collector to persistent data outside of the Python process.
It’s not fool-proof, but it is pretty effective.

SSH(CSV) -> Temp(CSV) -> pd.DataFrame

This is also a good example of how we build type modifiers. You can safely
ignore the following code:

I won’t be surprised if this approach concerns a few people but I’ve found it to
be effective so far.

Keyword Arguments

The number of possible keyword arguments to a single into call is increasing.
We don’t have a good mechanism to help users discover the valid options for
their situation. Docstrings are hard here because the options depend on the
source and target inputs. For the moment we’re solving this with online
documentation for each complicated format but
there is probably a better solution out there.

Help!

The new behavior around ssh:// and hdfs:// and hive:// is new, error
prone, and could really use play-testing. I strongly welcome feedback and
error reporting here. You could
file an issue
or e-mail blaze-dev@continuum.io.

Other

I didn’t mention anything about S3 and RedShift support that was also
recently merged. This is because I think Phil Cloud might write a separate
blogpost about it. We did this work in parallel in an effort to hash out how
best to solve the problems above. I think it worked decently well

Also, we’ve added an into command line interface. It works just like the
into function with strings, except that we’ve reversed the order of the
arguments to be more like cp. An example is below:

February 09, 2015

While not represented in our blog activity, 2014 turned out to be a very busy year for NumFOCUS. We sponsored more projects, gained new board members, and added programs. It has been hard to keep up with all the activity within the organization, but perhaps much harder for those outside. Let me go through a few of the highlights from the year to get folks caught up.

First and foremost, we had the first election process. This has allowed us to welcome Lorena Barba, Brian Granger, Stefan Karpinski and Cindee Madison to the board of directors. In an effort to grow the number of people helping out we have also formed two new ways for people to contribute: Advisor Council and Vice President roles. The inaugural advisor council consists of Fernando Perez and Travis Oliphant. They will help guide the organization as it grows in both scope and operations. The new Vice President are Sheila Miguez, James Powell and Ben Zaitlen, all committed community members who have helped in various ways to get us started. We are always looking for more volunteers to help build our organization, but with these added team members we hope to continue to deliver high quality services to our community.

Over the year we held four PyData conferences, supported one John Hunter Technology Fellow, lead Women in Technology bootcamps, helped between 20 and 30 conference attendees travel to important conferences, and more. As the new year starts, we will be doing many of these things plus working with the Training Up program. We see the need to not only sponsor projects but to create important programs that get the community together. As always, we are always looking for more volunteers for these events, so please do get in touch.

Perhaps one of the most exciting things that we did this year is expand our language scope from a Python centric organization to include several non-Python projects. We have ROpenSci and Julia now a part of our line up. Joining us on the education front are Software Carpentry and sister organization Data Carpentry. We also finished signing our Pyhton projects that we had been working with for a while including: SymPy, IPython, and AstroPy. While fiscal sponsorship doesn’t make sense for every project, it is a service we happily provide to help projects sustain and grow in natural ways.

In the coming year, we plan to be a great deal more vocal about our activities. Personally, two major goals I have is to expand our reach to Europe and start making headway on project governance recommendations. But that is another story.

This is an excellent application of new long-read technology to further
illuminate the characteristics of several medium-to-high complexity
microbial communities. The methods are expert, the results are
fascinating, and the discussion is well done.

Diversity of closely-related strains & rare organisms account for major
portion of the communities.

The portion of the results that is most novel and most fascinating is
the extensive analysis of rare sequences and the disparity in
observations from Illumina (assemblies) and Moleculo (long reads and
assemblies). The basic results are, on first examination,
counter-intuitive: many long-read sequences are obtained from abundant
organisms that simply don't show up in Illumina short-read assemblies.
The statement is made that this is because of strain variation in the
community, i.e. that Illumina assemblies are fragmented due to strain
variation and this blocks the observation of the majority of the
community. This is to some extent born out by the low mapping
percentages (which is the primary evidence offered by the authors),
and also matches our own observations.

we try to make distinctions between solidly tested and supported
code, and experimental code; see issue #471 for our decisions
on this front (basically, we allow a certain amount of fragility in
the sandbox/ directory).

the khmer-recipes project is where we put little recipes that we
want to make available, so these would be useful parts of data analysis
pipelines but not entire pipelines;

February 05, 2015

The Continuum Analytics team believes in engaging with our community and fostering relationships throughout the tech world.
To keep you all in the loop, we’re updating our blog each month with a list of events where you can find us attending, hosting, or participating!

February 04, 2015

It’s official! Harriet Dashnow, Stéfan van der Walt, and I will be writing an O’Reilly book about the SciPy library and the surrounding ecosystem. The book is called Elegant SciPy, and is intended to teach SciPy to fledgling Pythonistas, guided by the most elegant SciPy code examples we can find.

So, if you recently came across scientific Python code that made you go “Wow!” with its elegance, simplicity, cleverness, or power, please point us to it! As an example, have a look at Vighnesh Birodkar’s code to build a region adjacency graph from an n-dimensional image, which I highlighted previously here.

Each chapter will have one or two snippets that we will work towards. Each of these will be credited as “written by/nominated by”, and needs to be published under a permissive license such as MIT, BSD, or public domain to be considered for inclusion. We would especially like nominations in the following categories:

statistics (using scipy.stats)

image processing and computer vision

Fourier transforms

sparse matrix operations

eigendecomposition and linear algebra

optimization

streaming data analysis

spatial and geophysical data analysis

We’ll also consider other parts of the SciPy library and ecosystem.

We invite you to submit code snippets for inclusion in the book. We’d also appreciate a small description of about one paragraph explaining what the code is used for and why you think it’s elegant, even though this is often self-evident. =)

February 03, 2015

Motivation

We spend a lot of time migrating data from common interchange formats, like
CSV, to efficient computation formats like an array, a database or binary
store. Worse, many don’t migrate data to efficient formats because they don’t
know how or can’t manage the particular migration process for their tools.

Your choice of data format is important. It strongly impacts performance (10x
is a good rule of thumb) and who can easily use and interpret your data.

When advocating for Blaze I often say
“Blaze can help you query your data in a variety of formats.” This assumes
that you’re able to actually get it in to that format.

Enter the into project

The into function efficiently migrates data between formats.
These formats include both in-memory data structures like the following:

How does it work?

This is challenging. Robust and efficient conversions between any two pairs of
formats is fraught with special cases and bizarre libraries. The common
solution is to convert through a common format like a DataFrame, or streaming
in-memory lists, dicts, etc. (see dat)
or through a serialization format like
ProtoBuf or
Thrift. These are excellent options and often
what you want. Sometimes however this can be slow, particularly when dealing
with live computational systems or with finicky storage solutions.

Consider for example, migrating between a numpy.recarray and a
pandas.DataFrame. We can migrate this data very quickly in place. The bytes
of data don’t need to change, only the metadata surrounding them. We don’t
need to serialize to an interchange format or translate to intermediate
pure Python objects.

Consider migrating data from a CSV file to a PostgreSQL database. Using
Python iterators through SQLAlchemy we rarely exceed migration speeds greater
than 2000 records per second. However using direct CSV loaders native to
PostgreSQL we can achieve speeds greater than 50000 records per second. This
is the difference between an overnight job and a cup of coffee. However this
requires that we’re flexible enough to use special code in special situations.

Expert pairwise interactions are often an order of magnitude faster than
generic solutions.

Into is a network of these pairwise migrations. We visualize that network
below:

Into's decentralized migration scheme. Complex but powerful

Each node is a data format. Each directed edge is a function that transforms
data between two formats. A single call to into may traverse multiple edges
and multiple intermediate formats. For example, we when migrate a CSV file to
a Mongo database we might take the following route:

Load in to a DataFrame (pandas.read_csv)

Convert to np.recarray (DataFrame.to_records)

Then to a Python Iterator (np.ndarray.tolist)

Finally to Mongo (pymongo.Collection.insert)

Alternatively we could write a special function that uses MongoDB’s native CSV
loader and shortcut this entire process with a direct edge CSV -> Mongo.

To find the most efficient route we weight the edges of this network with
relative costs (measured ad-hoc.) We use networkx to find the shortest path
before we start the migration. If for some reason an edge fails (raises
NotImplementedError) we can reroute automatically. In this way we’re both
efficient and robust to failure.

Note that we color some nodes red. These nodes can be larger than memory.
When we migrate between two red nodes (both the input and output may be larger
than memory) then we limit our path to the red subgraph to ensure that we don’t
blow up mid-migration. One format to note is chunks(...) like
chunks(DataFrame) which is an iterable of in-memory DataFrames. This
convenient meta-format allows us to use compact data structures like numpy
arrays and pandas DataFrames on large data while keeping only a few tens of
megabytes in memory at a time.

The networked approach allows developers to write specialized code for special
situations and know that this code will only be used in the right situation.
This approach allows us to handle a very complex problem in an isolated and
separable manner. The central dispatching system keeps us sane.

History

I wrote about into long ago in
connection to Blaze. I then promptly shut up about it. This was because the
old implementation (before the network approach) was difficult to
extend/maintain and wasn’t ready for prime-time.

I am very happy with this network. Unexpected applications very often just
work and into is now ready for prime-time. It’s also available independently
from Blaze, both via conda and via pip. The
major dependencies are NumPy, Pandas, and NetworkX so it’s relatively
lightweight for most people who read my blog. If you want to take advantage of
some of the higher performing formats, like HDF5, you’ll need to install
those libraries as well (pro-tip, use conda).

Or just give it a shot without reading anything. My hope is that the interface
is simple enough (just one function!) that users can pick it up naturally. If
you run in to issues then I’d love to hear about them at
blaze-dev@continuum.io

Michael and I will be working ~9-5pm Eastern, Thu/Fri/Mon/Tues/Wed,
Feb 19-25th (so weekend excepted), and people are welcome to drop in
anytime. We are planning to focus on khmer, screed, khmer-protocols, and khmer-recipes; other lab projects
(like paper pipelines or workshop materials) are fair game, too.
We'll have a list of issues posted for people who are looking for
a small task.

We don't have any particular physical location reserved, but you're
welcome to join us at Michigan State University to hang out in person.
We also plan to be fully online-enabled within the 9-5pm EDT period,
and will have a Google Hangout running that anyone can join. We can
also set up one-to-one video conferences and screen sharing with
people who need help. (We will not be working outside of those
hours: family, sanity, etc.)

February 01, 2015

We are pleased to announce that the Laboratory for Data Intensive
Biology at UC Davis has joined the Software Carpentry Foundation
as an Affiliate Member for three years, starting in January 2015.

"We've been long-term supporters of Software Carpentry, and Affiliate
status lets us support the Software Carpentry Foundation in a tangible
way," said Dr. C. Titus Brown, the lab director. "This status also
gives us the opportunity to include Software Carpentry as part of a
larger biological data science training program at UC Davis."

Software Carpentry is a volunteer organisation whose goal is to make
scientists more productive, and their work more reliable, by teaching
them basic computing skills. Founded in 1998, it runs short, intensive
workshops that cover program design, version control, testing, and
task automation. It has taught over 10,000 learners, and primarily
uses a two-day workshop format.

In October 2014, a non-profit Software Carpentry Foundation was
created to act as a governing body for the project, with Dr. Greg
Wilson as the founding Executive Director. Affiliate status with the
SCF provides preferential access to instructor training as well as
waiving per-workshop fees.

The Laboratory for Data Intensive Biology (DIB) is newly started with
Dr. Brown's move to UC Davis in January 2015. The lab is in the
Department of Population Health and Reproduction in the School of
Veterinary Medicine, and is part of both the Data Science Initiative
and the Genome Center at UC Davis. DIB does fundamental and applied
research on making use of the increasing volume and variety of
biological data, and Dr. Brown currently has funding from the NIH, the
USDA, and the Moore Foundation.

For some reason, this genre seems to be a standard for blog posts. The author
just can't believe that X (a language or operating system) could be so
stupid as to do Y.

It is a little puzzling, because I guess most people reading someone saying
how stupid they are, do not say "ah yes, thank you" but "f#@& you asshole".
So, if we do write in that style, we make it less likely the problem will be
fixed. I suppose that's obvious to the person writing, so there must be some
other thing that is attractive about labeling the other person / system / idea
as a loser.

We technical types don't like to think about soft issues like tone. Like the
drunk under the lamp-post
we prefer technical problems, and ignore other important problems such as the
way we are making our readers feel.

When I am reading a blog post of the "I can't believe it" type, I feel like
someone with bad breath is shouting in my face, and my first and strong
reaction is to move away and avoid that person whenever I see them. For
example, I suppose I won't click on a link to blog post by Armin Ronacher
again, because I predict it will make me feel bad. I am sure that is a shame
for me, because I have every reason to think Mr Ronacher is competent,
intelligent and well-informed. It is just I don't want to feel that way. I
find it harder to work when I feel that way. I am less gentle with my friends
when I feel that way.

Remember that Python usage survey that went around the interwebs late last year? Well, the results are finally out and I’ve visualized them below for your perusal. This survey has been running for two years now (2013-2014), so where we have data for both years, I’ve charted the results so we can see the changes in Python usage over time.

I’ll note that a big focus of this survey is to find out if Python users are transitioning over to Python 3, and if they aren’t, then why they aren’t making that transition. If you’re on the fence about switching from Python 2 to 3, there are some great articles out there about the key differences between the two versions and the many, many, …many advantages that Python 3 offers over Python 2.

If you want to check out the underlying data yourself, head on over to Bruno Cauet’s page.

The first big question that we all we know is: How many Python users have actually used Python 2 and 3?

As expected, nearly every Python user has used Python 2 at some point in their career. We also see good news for Python 3 over the past year: Python 3 usage increased 12 percentage points in 2014, up to nearly 3/4 of the surveyed Python users.

Of course, “writing code with Python 3 once” doesn’t mean that they actually use it regularly. The question below gets at that question more directly.

Here we see yet more good news for Python 3: As much as 10% more Python users are primarily using Python 3 than in 2013, now accounting for 1/3 of all Python users. It seems the transition to Python 3 will be slow but steady for the next few years.

The transition we see above may be caused by project managers at work. What version do Python users go to when working on their personal projects?

Here we see a more close divide between the two versions: Nearly half of all users will start up their own projects with Python 3, whereas the other half still ardently remains pro-Python 2.

Let’s break these usage patterns down by the specific version now.

Somehow, Python v. 2.5 and 2.6 are still in use in some places, but 2.7 still dominates the Python 2 landscape. We also see telltale signs that Python 3 is becoming a mature language in its own right, with users stuck in the older 3.2 and 3.3 versions.

To summarize so far: Over 2/3 of all Python users are still using Python 2, with the majority of them sitting at 2.7. This tells us that Python is still very much a divided language, with a large portion of the user base unwilling to upgrade to the latest version despite the fact that Python 2 nearly 5 years old now. (That’s ancient in programming language years!)

A common complaint of the ardent Python 2 users is that Python 3 was a huge mistake. How does that turn out in this survey?

Surprisingly, it seems the complainers are a minority: Only 12% of the Python users surveyed think Python 3 was a mistake. 1/3 of Python users think Python 3 is great (probably the ones who made the switch!), whereas over half of Python users think the Python developers could’ve made the Python 2 -> 3 transition more fluid.

So, most Python users don’t think Python 3 was a mistake, but 2 in 3 of them still haven’t made the switch. That leaves us to wonder: Why haven’t the Python 2 users made the switch yet?

Package dependencies are — by far — the most-cited reasons for refusing to switch to Python 3. Many of us rely on specific packages — such as numpy, scipy, pandas, etc. — for our day-to-day work. If the packages haven’t been upgraded to Python 3 yet, then why should we? Lucky for us, there’s a web site dedicated to tracking the packages that are Python 3 ready. Are all of your packages on there? Then maybe it’s time to make the switch.

Another common reason for refusing to switch to Python 3 is that there’s no incentive. We’re comfortable with Python 2, we know its ins and outs, and everything works fine. Why bother upgrading? If you fall in this category, I’ll point you to this article again that compares the key differences between Python 2 and 3. And don’t forget about the 2 pounds of reasons why Python 3 is a huge upgrade over Python 2.

The last two major reasons for refusing to switch to Python 3 are a little harder to address. If you have a large legacy code base or your manager simply refuses to make the upgrade, then it’s up to you to convince management (or yourself) that the upgrade is worth the work. The links I included in the above paragraph are a good start.

Finally, the survey wanted to look at cross-compatibility between Python 2 and 3.

Here we start to see more good news for the future of Python 3: We saw a significant increase in the number of users who have ported their code from Python 2 to 3.

It’s not very well-known that there Python has packages for converting code from Python 2 to 3 and Python 3 to 2. When the survey asked users whether they’d used either of these packages even once, well, see for yourself…

It seems like the Python devs need to do a better job of raising awareness of these code conversion packages.

Because Python 2 and 3 really aren’t that different, it’s not necessary to write code for just one version. As the famous taco shell commercial goes: “Por que no los dos?” (Why not both?)

To that end, the survey also polled users on whether they write Python 2- and 3- compatible code.

Apparently only 1 in 3 Python developers bother to write multi-version compatible code, and there hasn’t been much of a change since 2013.

I look forward to seeing how the Python 3 transition progresses in 2014. If you have any more resources for helping (and convincing!) Python developers to make the transition, please share them in the comments below.

January 29, 2015

This post is based on a tutorial given in a machine learning course at University of Bremen. It summarizes some recommendations on how to get started with machine learning on a new problem. This includes

ways of visualizing your data

choosing a machine learning method suitable for the problem at hand

identifying and dealing with over- and underfitting

dealing with large (read: not very small) datasets

pros and cons of different loss functions.

The post is based on "Advice for applying Machine Learning" from Andrew Ng. The purpose of this notebook is to illustrate the ideas in an interactive way. Some of the recommendations are debatable. Take them as suggestions, not as strict rules.

In [1]:

importtimeimportnumpyasnpnp.random.seed(0)

In [2]:

importmatplotlib.pyplotaspltimportseabornassns%matplotlibinline

In [3]:

Expand Code

# Modified from http://scikit-learn.org/stable/auto_examples/plot_learning_curve.htmlfromsklearn.learning_curveimportlearning_curvedefplot_learning_curve(estimator,title,X,y,ylim=None,cv=None,train_sizes=np.linspace(.1,1.0,5)):""" Generate a simple plot of the test and traning learning curve. Parameters ---------- estimator : object type that implements the "fit" and "predict" methods An object of that type which is cloned for each validation. title : string Title for the chart. X : array-like, shape (n_samples, n_features) Training vector, where n_samples is the number of samples and n_features is the number of features. y : array-like, shape (n_samples) or (n_samples, n_features), optional Target relative to X for classification or regression; None for unsupervised learning. ylim : tuple, shape (ymin, ymax), optional Defines minimum and maximum yvalues plotted. cv : integer, cross-validation generator, optional If an integer is passed, it is the number of folds (defaults to 3). Specific cross-validation objects can be passed, see sklearn.cross_validation module for the list of possible objects """plt.figure()train_sizes,train_scores,test_scores=learning_curve(estimator,X,y,cv=5,n_jobs=1,train_sizes=train_sizes)train_scores_mean=np.mean(train_scores,axis=1)train_scores_std=np.std(train_scores,axis=1)test_scores_mean=np.mean(test_scores,axis=1)test_scores_std=np.std(test_scores,axis=1)plt.fill_between(train_sizes,train_scores_mean-train_scores_std,train_scores_mean+train_scores_std,alpha=0.1,color="r")plt.fill_between(train_sizes,test_scores_mean-test_scores_std,test_scores_mean+test_scores_std,alpha=0.1,color="g")plt.plot(train_sizes,train_scores_mean,'o-',color="r",label="Training score")plt.plot(train_sizes,test_scores_mean,'o-',color="g",label="Cross-validation score")plt.xlabel("Training examples")plt.ylabel("Score")plt.legend(loc="best")plt.grid("on")ifylim:plt.ylim(ylim)plt.title(title)

Notice that we generate a dataset for binary classification consisting of 1000 datapoints and 20 feature dimensions. We have used the DataFrame class from pandas to encapsulate the data and the classes into one joint data structure. Let's take a look at the first 5 datapoints:

In [5]:

df[:5]

Out[5]:

0

1

2

3

4

5

6

7

8

9

...

11

12

13

14

15

16

17

18

19

class

0

-1.063780

0.676409

1.069356

-0.217580

0.460215

-0.399167

-0.079188

1.209385

-0.785315

-0.172186

...

-0.993119

0.306935

0.064058

-1.054233

-0.527496

-0.074183

-0.355628

1.057214

-0.902592

0

1

0.070848

-1.695281

2.449449

-0.530494

-0.932962

2.865204

2.435729

-1.618500

1.300717

0.348402

...

0.225324

0.605563

-0.192101

-0.068027

0.971681

-1.792048

0.017083

-0.375669

-0.623236

1

2

0.940284

-0.492146

0.677956

-0.227754

1.401753

1.231653

-0.777464

0.015616

1.331713

1.084773

...

-0.050120

0.948386

-0.173428

-0.477672

0.760896

1.001158

-0.069464

1.359046

-1.189590

1

3

-0.299517

0.759890

0.182803

-1.550233

0.338218

0.363241

-2.100525

-0.438068

-0.166393

-0.340835

...

1.178724

2.831480

0.142414

-0.202819

2.405715

0.313305

0.404356

-0.287546

-2.847803

1

4

-2.630627

0.231034

0.042463

0.478851

1.546742

1.637956

-1.532072

-0.734445

0.465855

0.473836

...

-1.061194

-0.888880

1.238409

-0.572829

-1.275339

1.003007

-0.477128

0.098536

0.527804

0

5 rows × 21 columns

It is hard to get any clue of the problem by looking at the raw feature values directly, even on this low-dimensional example. Thus, there is a wealth of ways of providing more accessible views of your data; a small subset of these is discussed in the next section.

First step when approaching a new problem should nearly always be visualization, i.e., looking at your data.

Seaborn is a great package for statistical data visualization. We use some of its functions to explore the data.

First step is to generate scatter-plots and histograms using the pairplot. The two colors correspond to the two classes and we use a subset of the features and only the first 50 datapoints to keep things simple.

In [6]:

_=sns.pairplot(df[:50],vars=[8,11,12,14,19],hue="class",size=1.5)

Based on the histograms, we can see that some features are more helpful to distinguish the two classes than other. In particular, feature 11 and 14 seem to be informative. The scatterplot of those two features shows that the classes are almost linearly separable in this 2d space. A further thing to note is that feature 12 and 19 are highly anti-correlated. We can explore correlations more systematically by using corrplot:

In [7]:

plt.figure(figsize=(12,10))_=sns.corrplot(df,annot=False)

We can see our observations from before confirmed here: feature 11 and 14 are strongly correlated with the class (they are informative). They are also strongly correlated with each other (via the class). Furthermore, feature 12 is highly anti-correlated with feature 19, which in turn is correlated with feature 14. We have thus some redundant features. This can be problematic for some classifiers, e.g., naive Bayes which assumes the features being independent given the class. The remaining features are mostly noise; they are neither correlated with each other nor with the class.

Notice that data visualization becomes more challenging if you have more feature dimensions and less datapoints. We give an example for visualiszing high-dimensional data later.

Once we have visually explored the data, we can start applying machine learning to it. Given the wealth of methods for machine learning, it is often not easy to decide which method to try first. This simple cheat-sheet (credit goes to Andreas Müller and the sklearn-team) can help to select an appropriate ML method for your problem (see http://dlib.net/ml_guide.svg for an alternative cheat sheet).