Authorship identification using machine learning

The problem

The internet is great! We have all this anonymity and the freedom that it grants us. However some people abuse this freedom. They use the anonymity to get away with spreading fear and terror.

There are also some less depressing uses of this freedom. Using throwaway accounts on Reddit to tell embarrassing stories for other’s amusement without any real-world repercussions. The philosophical debate surrounding privacy vs. security is not on I will go into here.

One such a case of a throwaway used for amusement is /u/36055512, telling stories of a former workplace filled with ex-cons and shady deals. He is a great storyteller. Two years ago he got to the end of all his stories and just stopped posting. This is not uncommon for throwaways, as they do not want the next set of stories they tell to give away who they are.

But I wanted more. More great stories making you laugh and cry at all his escapades. So I decided to try and find him.

Throwaway accounts are made not to have any traceable information in them, so how could one track a user? By the one thing that remains the same between all the accounts, the writer.

Authorship identification (also called Authorship attribution) have been used to find works that shakespear collaborated on. It is a fascinating field, tracking writing style that is generally assumed to be unique to every writer.

The Plan

Now I could study for weeks on end and become an expert in the field, but ain’t nobody got time for that. So I went for the next best thing, teaching the computer to be an expert in it.

Machine learning (at least the parts usefull to this problem) generally consists of finding a function. In our case a function that can tell us if two Reddit accounts belong to the same writer. Or more mathematically a function F of the type F(A,B) -> Boolean , where A and B are two accounts on Reddit.

So how do we define this function that we need? We use data, specifically input-output pairs. A potential function is applied to all the inputs and if the outputs match that of our data then we know that this function will do what we need it to do.

Getting Data

So I needed data to be able to find my function. I used the Reddit API to grab comments from 2000 accounts who happened to be on the front page at the time. I made the assumption that no two of accounts were written by the same person. Each accounts comments are converted into a sequence of vectors. These sequences of vectors are then given as input to our function.

Building the function

So giving two different accounts for our function F should give us a false, and giving the same account twice for F should give us true.
Neural networks are described as function approximators and suit our problem perfectly. As inputs we want to give the neural network the comments from two accounts and have it output a boolean (encoded as a 0 or 1). Comments can be seen a sequence of characters, and recurrent neural networks do well on sequences. We want to process both user’s comments in a similar way, so we reuse the RNN for both accounts. The RNN’s output must the be compared and a boolean given as output.
This follows a similar process an expert might follow, first identifying the key characteristics of each piece of text and then comparing them to see if they are similar. Something along the lines of:

Diagram of Neural Network Architecture

Due to a limitation in Keras (the library used to build the neural network) only the first 500 characters from the comments of each account was used.

To see the final data gathering and neural network training code go to github

Training

At first training did not go well, the neural network did not do better than random. I feared the there was no correlation between my inputs and outputs, making finding a function impossible. Then some wise words from Andrew Ng came to me. I should try a bigger network architecture. So I added a hidden layer to my neural network just before the output layer. And lo and behold the training accuracy went up, and all the way up to 80%, and it performed similarly on the validation set. I was excited, if this was a test my AI would have gotten a distinction.

The search

So now that I had this AI I started thinking about applying it. I knew that my target ( /u/36055512) liked the stories of a user /u/Bytewave. Scanning all the accounts that ever left a comment on Bytewave’s posts seemed like a good approach. This ended up being 6075 users, less than I expected. So now applying the 80% accuracy on this number meant that I would falsely identify 1215 as my target. And suddenly 80% is not that great.

So I wanted, nay needed, a better accuracy. Increasing the network size worked for me last time, so let’s try increasing it some more. So I added another hidden layer behind the one I already added, and increased the size of my RNN for good measure. After another 2 days of retraining and my accuracy was once again on 80%. How can that be?

I thought long and hard about it, probably longer than I should have. And then it dawned on me: The information was simply not there. I chose 500 characters quite arbitrarily and it turns out that 500 characters is not enough to uniquely identify someone. So more information about a user is needed.

I decided to apply my neural network even though I expected to identify far too many people for it to really be useful. And I found 2317 people that could be my target. This is quite the discrepancy from the 1215 that I was expecting, so why was performance so terrible? The performance on the validation set was good, so what could it be? My best hypotheses that I could find is that the people that comment on Bytewave’s stories all write similarly, they are after all from the same social sphere. So technically there was a drift between my training set and the production set. For some more information on putting machine learning in production check out this post

The Hunt Continues

So for the future I want to train on a larger set of users (to better capture the variety of writing styles) and use a larger sample of comments from each user (to have more data to identify someone with).