Classifying Wikipedia Changes

The code of the genereous competitor does logistic regression classification for multiple classes with stochastic gradient ascent.
It is further well-suited for online learning as it uses the hashing trick to one-hot encode boolean, string, and categorial features.

To better understand these methods and tricks I here apply some of them to a multilabel problem I chose mostly for the easy access to a constant stream of training data:

All recent changes on Wikipedia are tracked on this special page
where we can see a number of interesting features such as the length of the change, the contributor's username,
the title of the changed article, and the contributor's comment for a given change.

Using the Wikipedia API to look at this stream of changes we can also see how contributors classify their changes as bot, minor, and new.
Multiple label assignments are possible, so that one contribution may be classified as both bot and new.

Here I will listen to this stream of changes, extract four features (length of change, comment string, username, and article title), and train three logistic regression classifiers (one for each class) to predict the likelihood of a change belonging to each one of them.
The training is done with the stochastic gradient ascent method.

One caveat: I am a complete novice when it comes to most of this stuff so please take everything that follows with a grain of salt - on the same note I would be forever grateful for any feedback especially of the critical kind so that I can learn and improve.

As by convention, the bias of the model is encoded with feature $x_0 = 1$ for all observations -
the only thing that will change about the $w_0 x_0$-term is weight $w_0$ upon training.
The length of the article change is tracked with numerical feature $x_1$ which equals the number of character changes
(hence $x_1$ is either positive or negative for text addition and removal respectively).

As in the Kaggle code that our code is mostly based upon, string features are one-hot encoded using the
hashing trick:

The string features extracted for each observed article change are username, a parse of the comment, and the title of article.
Since this is an online learning problem there is no way of knowing how many unique usernames, comment strings, and article titles
are going to be observed.

With the hashing trick we decide ab initio that D_sparse-many unique values across these three features are sufficient to care about:
Our one-hot encoded feature space has dimension D_sparse and can be represented as a D_sparse-dimensional vector filled
with 0's and 1's (feature not present / present respectively).

The hash in hashing trick comes from the fact that we use a hash function to convert strings to integers.
Suppose now that we chose D_sparse = 3 and our hash function produces
hash("georg") = 0, hash("georgwalther") = 2, and hash("walther") = 3 for three observed usernames.

For username georg we get feature vector $[1, 0, 0]$ and for username georgwalther we get $[0, 0, 1]$.
The hash function maps username walther outside our 3-dimensional feature space and to close this loop we not only
use the hash function but also the modulus (which defines an equivalence relation?):

This illustrates one downside of using the hashing trick since we will now map usernames georg and walther to the same feature vector $[1, 0, 0]$.
We are therefore best adviced to choose a big D_sparse to avoid mapping different feature values to the same one-hot-encoded feature - but probably not too big to preserve memory.

For each article change observation we only map three string features into this D_sparse-dimensional one-hot-encoded feature space - out of D_sparse-many vector elements there will only ever be three ones among (D_sparse-3) zeros (if we do not map to the same vector index multiple times).
We will therefore use sparse encoding for these feature vectors (hence the sparse in D_sparse).

As we can see, we crunched through 106,401 article changes during our ten-minute online training.

It would be fairly hard to understand the link between the D_sparse-dimensional one-hot-encoded feature space and
the observed / predicted classes.
However we can still look at the influence that the length of the article change has on our classification problem

printw[0]printw[1]printw[2]

Here we can see that the weight of the length of change for class 0 (bot) is -1.12, for class 1 (minor) is -0.97, and for class 2 (new) is 2.11.

Intuitively this makes sense since many added characters (big positive change) should make classification as a minor change
less likely and classification as a new article more likely:
For an observed positive character count change $C$, $2.11 C$ will place us further to the right, and $-0.97 C$ further to the left along the $x$-axis of the sigmoid function: