Act 1

Dear XXXX,

Thank you for coming in to interview with the team last week. Everyone
enjoyed speaking with you, but unfortunately it was decided that your
background and experience is not an ideal fit for the position. Please be
assured that this decision was arrived at after careful and thorough deliberation.

Again, we appreciate your time, and wish you the best of success in your job search.

I don’t tend to get too sniffy about the quality of discourse on the
Internet. I have some appreciation for even the most pointless, uninformed
flamewars. (And maybe my take on Web site comments is for another post.) But
there’s an increasingly popular topic of articles and blog posts which is starting to annoy
me a little. You’ve likely read them—they have titles like: “Python is Eating R’s Lunch,” “Why
Python is Going to Take Over Data Science,” “Why Python is a Pain in the Ass and
Will Never Beat R,” “Why Everyone Will Live on the Moon and Code in Julia in 5
Years,” etc.

And that’s all okay. Go on the Internet and bitch about languages you don’t like, or tell
everyone why your preferred one is awesome. That’s what
the Internet’s here for. And Lord knows I’ve done it myself.

Introduction

I want to spend some time messing about with iterators in Julia. I think they
not only provide a familiar and useful entry point into Julia’s type system and dispatch
model, they’re also interesting in their own right.1 Clever application of iterators can
help to simplify complicated loops, better express their intent, and improve
memory usage.

A word of warning about the code here. Much of the it isn’t idiomatic Julia and I wouldn’t
necessarily recommend using this style in a serious project. I also can’t speak
to its performance vis-a-vis more obvious Julian alternatives. In some cases,
the style of the code examples below may help reduce memory usage, but
performance is not my main concern. (This may be the first blogpost about Julia
unconcerned with speed). Instead, I’m just interested in different ways of
expressing iteration problems.

For anyone who’d like to play along at home, there’s an IJulia notebook of
this material on Github, which can be viewed on nbviewer here.

The Iterator Protocol

Update 9/10/2013 New posts are going up on the blog, but I’m going to keep this post at the top for a while. Consider the site in beta for the moment, and please use the comment section of this post to report any issues. If you’re using IE to try and view the site, I’m sorry. But I’m not that sorry.

Update 9/3/2013 Things should be working reasonably well. A few kinks to work out, and I have to migrate the former site’s comments, but the current site is pretty much ready to go.

This is the new home for my blog, Slender Means. It’s currently in-progress, and I’m still finishing up the design, and fixing weird links and typos from the Wordpress to Pelican migration.

The code for Chapter 8 has been sitting around for a long time now. Let’s blow the dust off and check it out. One thing before we start: explaining PCA well is kinda hard. If any experts reading feel like I’ve described something imprecisely (and have a better description), I’m very open to suggestions.

Introduction

Chapter 8 is about Principal Components Analysis (PCA), which the authors perform on data with time series of prices for 24 stocks. In very broad terms, PCA is about projecting many real-life, observed variables onto a smaller number of “abstract” variables, the principal components. Principal components are selected in order to best preserve the variation and correlation of the original variables. For example, if we have 100 variables in our data, which are all highly correlated, we can project them down to just a few principal components—-i.e., the high correlation between them can be imagined as coming from an underlying factor that drives all of them, with some other less important factors driving their differences. When variables aren’t highly correlated, more principal components are needed to describe them well.

This isn’t a very thoughtful post. But the conversation was becoming
sort of a shootout and my thoughts (half-formed as they are) were a bit
longer than a tweet. Essentially, I think the Python performance
shootouts—PyPy, Numba, Cython—are missing the point.

The point is, I think, that loops are a crutch. A 3-nested for loop in
Julia that increments a counter takes 8 lines of code (1 initialize
counter, 3 for statements, 1 increment statement, 3 end statements).
Only one of those lines tells me what the code does.

But most scientific programmers learned to code in imperative languages
and that style of thinking and coding has become natural. I’ve often
seen comments like this:

Which I think simply equates readability with familiarity. That isn’t
wrong, but it isn’t the whole story.

Introduction

Chapter 7 of Machine Learning for Hackers is about numerical
optimization. The authors organize the chapter around two examples of
optimization. The first is a straightforward least-squares problem like
that we’ve encountered already doing linear regressions, and is amenable
to standard iterative algorithms (e.g. gradient descent). The second is
a problem with a discrete search space, not clearly differentiable, and
so lends itself to a stochastic/heuristic optimization technique (though
we’ll see the optimization problem is basically artificial). The first
problem gives us a chance to play around with Scipy’s optimization
routines. The second problem has us hand-coding a Metropolis algorithm;
this doesn’t show off much new Python, but it’s fun nonetheless.

The notebook for this chapter is at the github report here, or you
can view it online via nbviewer here.

Ridge regression by least-squares

In chapter 6 we estimated LASSO regressions, which added an L1
penalty on the parameters to the OLS loss-function. The ridge regression
works the same way, but applies an L2 penalty to the parameters. The
ridge regression is a somewhat more straightforward optimization
problem, since the L2 norm we use gives us a differentiable loss function.

In my opinion, Chapter 6 is the most important chapter in Machine
Learning for Hackers. It introduces the fundamental problem of machine
learning: overfitting and the bias-variance tradeoff. And it
demonstrates the two key tools for dealing with it: regularization and cross-validation.

It’s also a fun chapter to write in Python, because it lets me play with
the fantastic scikit-learn library. scikit-learn is loaded with
hi-tech machine learning models, along with convenient “pipeline”-type
functions that facilitate the process of cross-validating and selecting
hyperparameters for models. Best of all, it’s very well
documented.

Fitting a sine wave with polynomial regression

The chapter starts out with a useful toy example—trying to fit a curve
to data generated by a sine function over the interval [0, 1] with added
Gaussian noise. The natural way to fit nonlinear data like this is using
a polynomial function, so that the output, y is a function of powers
of the input x. But there are two problems with this.

First, we can generate highly correlated regressors by taking powers of
x, leading to noisy parameter estimates. The input x are evenly
space numbers on the interval [0, 1]. So x and x ...

Introduction

Chapter 5 of Machine Learning for Hackers is a relatively simple
exercise in running linear regressions. Therefore, this post will be
short, and I’ll only discuss the more interesting regression example,
which nicely shows how patsy formulas handle categorical variables.

Linear regression with categorical independent variables

In chapter 5, the authors construct several linear regressions, the last
of which is a multi-variate regression descriping the number of page
views of top-viewed web sites. The regression is pretty straightforward,
but includes two categorical variables: HasAdvertising, which takes
values True or False; and InEnglish, which takes values Yes,
No and NA (missing).

If we include these variables in the formula, then patsy/statmodels will
automatically generate the necessary dummy variables. For
HasAdvertising, we get a dummy variable equal to one when the the
value is True. For InEnglish, which takes three values, we get two
separate dummy variables, one for Yes, one for No, with the missing
value serving as the baseline.

Introduction

I’m not going to write much about this chapter. In my opinion the payoff-to-effort ratio for this project is pretty low. The algorithm for ranking e-mails is pretty straightforward, but in my opinion seriously flawed. Most of the code in the chapter (and there’s a lot of it) revolves around parsing the text in the files. It’s a good exercise in thinking through feature extraction, but it’s not got a lot of new ML concepts. And from my perspective, there’s not much opportunity to show off any Python goodness. But, I’ll hit a couple of points that are new and interesting.

The complete code is at the Github repo here, and you can read the notebook via nbviewer here.

1. Vectorized string methods in pandas. Back in Chapter 1, I groused about lacking vectorized functions for operations on strings or dates in pandas. If it wasn’t a numpy ufunc, you had to use the pandas map() method. That’s changed a lot over the summer, and since pandas 0.9.0, we can call vectorized string methods.

For example, here’s the code in my chapter for program that identifies e-mails that ...