Data science often has a similar workflow: acquire, ingest/clean,
store/manage, data wrangling, visual analysis, modeling, story-telling. For many of
those stages, python has nice tools.

Christian Staudt calls it an ecosystem. Well, if you
make a diagram showing the various tools it starts to look like one of those
biology diagrams showing which kinds of animal eats what other kinds of
animals. Likewise, python libraries have their function and their specialized
niche.

Numpy. The fundamental package for numeric computing in
python. N-dimensional arrays. Numpy arrays are different to python lists:
they’re layed out in memory in a much more effective and compact
way.

Essential for understanding numpy: “lose your loops”. Don’t loop over arrays
with regular python operations, but use numpy methods. That pushes
everything down into highly effecient compiled code. That can gain you an
order of magnitude in performance.

He showed a quick example. It pays off to experiment with pandas and to
explicitly

Dask. Dask can combine many different pandas dataframes into one. Handy for
distributed computing. (Update: Dask can do much more than I originally
wrote, like parallel numpy. See Thursday’s keynote about Dask)

Scikit-learn. The starting point for machine
learning in Python. It has a modular approach: estimators, transformers,
pipelines.

Statsmodels. More or less “the
python replacement for R”. Statistical models, tests, etc.

statsmodels and scikit-learn have many models in common. Scikit-learn has a
focus on machine learning and it is more pythonic. Statsmodels is focused on
hardcore statistical analysis. And it is more approachable for those coming
from R.

ipython. Powerful interactive python shell. He mentioned an extension for
doing parallel calculations. And there’s the “rpy2” extension (formerly
known as “rmagic”) that lets you use R at the same time from within your
ipython shell.

jupyter notebooks. One of his favourite tools for data science
projects. Interactive notebooks where you can combine code, documentation
and visualizations together. It starts to look like Donald Knuth’s literate
programming.