Data driven research and development

Menu

Authorship Attribution with Python

Recently I’ve been reading a great book called Building Machine Learning Systems with Python. The book has two authors: Willi Richert and Luis Pedro Coelho. As is often the case for books with multiple authors, the individual chapters have a different literary feel to them. The following meta-idea occurred to me:

Can the tools and techniques from the book be used to identify who wrote each chapter?

Authorship Attribution

A person’s writing style is an example of a behavioral biometric. The words people use and the way they structure their sentences is distinctive, and can often be used to identify the author of a particular work. This is a widely studied problem, with hundreds of academic papers on the subject.

There are two high-level ways to attack the chapter attribution problem:

Supervised learning: One approach would be to gather ground truth from external sources. For example, find works for each author from other publications, blogs, etc. These samples would be used to learn a model for each author’s writing style. Determining who wrote each chapter would be a binary classification problem.

Unsupervised learning: A second approach is unsupervised, meaning that the analysis is conducted without ground truth. In this method, the chapters are analysed to find two subsets that appear to have been written by the same person.

On this page I will consider the unsupervised problem. There are three steps:

Preparing and loading the data

Feature extraction: We will experiment with a few different feature sets. Even though the focus is on the unsupervised problem, the feature extraction code can also be used for supervised learning.

Classification: We will use clustering to find natural groupings in the data. Since we have several feature sets, we will use ensemble learning: learn multiple models, each built using different features, that vote to determine who wrote each chapter.

Getting started

Firstly, you will need to have the following Python libraries installed: NumPy, SciPy, scikit-learn, and NLTK. Secondly, you will need the raw text of a book (you can use any book with 2 or more authors). Convert it to text (e.g. using PDFMiner), and remove anything that isn’t body text (e.g. chapter and section headings, tables, code snippets, etc.). Finally, divide the book into chapter files named chapter01.txt, chapter02.txt,etc. Run the following code to import the libraries and load the text:

Feature Extraction

There are dozens of possible features for authorship attribution that have been proposed in the literature. Good features for this problem (1) are able capture the distinctive aspects of someone’s writing style, and (2) are consistent even when the author is writing on different subjects. We will experiment with a few different approaches.

Lexical and punctuation features

Lexical features:

The average number of words per sentence

Sentence length variation

Lexical diversity, which is a measure of the richness of the author’s vocabulary

Bag of Words features

Our second feature set is Bag of Words, which represents the frequencies of different words in each chapter. This feature vector is commonly used for text classification. However, unlike text classification, we need to use in topic independent keywords (aka “function words”) since each author is writing on a variety of subjects. Our vocabulary will be the most common words across all chapters (e.g. words like ‘a’, ‘is’, ‘the’, etc.). The idea is that the authors use these common words in a distinctive, but consistent, manner.

In the following code, we use NLTK to find the most common words in the book, and scikit-learn to create the feature vectors for each chapter:

Syntactic features

For our final feature set, we extract syntactic features of the text. Part of speech (POS) is a classification of each token into a lexical category (e.g. noun). NLTK has a function for POS labeling, and our feature vector is comprised of frequencies for the most common POS tags:

# get part of speech for each token in each chapter
def token_to_pos(ch):
tokens = nltk.word_tokenize(ch)
return [p[1] for p in nltk.pos_tag(tokens)]
chapters_pos = [token_to_pos(ch) for ch in chapters]
# count frequencies for common POS types
pos_list = ['NN', 'NNP', 'DT', 'IN', 'JJ', 'NNS']
fvs_syntax = np.array([[ch.count(pos) for pos in pos_list]
for ch in chapters_pos]).astype(np.float64)
# normalise by dividing each row by number of tokens in the chapter
fvs_syntax /= np.c_[np.array([len(ch) for ch in chapters_pos])]

Classification

Our goal in the modeling stage is to find two groups, or “clusters”, in the feature space, with each group being the chapters written by an author. To find the clusters we use scikit-learn’s implementation of k-means with k=2:

Results and Conclusions

I will make the assumption that Luis Pedro Coelho wrote chapter 10, as it is on computer vision (Luis is the author of a popular computer vision library called mahotas, which I use quite a bit in other projects). Using this fixed data point, we can assign a name to each cluster, and subsequently an author to each chapter. Here are the results for each feature set:

01

02

03

04

05

06

07

08

09

10

11

12

Lexical

WR

LC

LC

WR

LC

WR

LC

WR

WR

LC

WR

LC

Punctuation

LC

LC

LC

WR

WR

LC

LC

LC

WR

LC

WR

LC

Bag of Words

WR

LC

WR

LC

WR

WR

LC

LC

WR

LC

WR

LC

Syntactic

WR

LC

WR

LC

WR

WR

LC

LC

WR

LC

WR

WR

After counting up the votes, two chapters are a tie (“Clustering – Finding Related Posts” and “Topic Modeling”). Here are the chapters with a majority win:

Willi Richert

Luis Pedro Coelho

Getting Started with Python Machine Learning

Classification – Detecting Poor Answers

Classification II – Sentiment Analysis

Classification III – Music Genre Classification

Dimensionality Reduction

Learning How to Classify with Real-world Examples

Regression – Recommendations

Regression – Recommendations Improved

Computer Vision – Pattern Recognition

Big(ger) Data

How confident am I in these results? Not very. Overall, the problem was much harder than I anticipated:

Selecting good features for unsupervised learning is difficult – it is like feeling your way around the dark. It is likely that several of the features I’ve used are not informative.

The chapters are unlikely to be “pure”. The authors may have collaborated on some sections, read over and modified each other’s work, and the whole book was probably sterilized by copy editors. All of these add noise to the data.

The results are not stable. For example, if I make a minor change to the code (e.g. change the normalization method), or even run k-means again (which has randomness in its initialization), the clusters change. This indicates that clusters are not well separated in the feature space. In fact, it is this instability that motivated me to use ensemble learning: as long as some of the models are performing better than chance, the hope is that the results of voting will be consistent.

Code can be found here. Now I will try to contact the authors, so stay tuned!

Update

I’ve heard back from Willi Richert, and all of the guesses were correct! He also had some good ideas on how to improve the classifier. I’ll give one hint: maybe some sections of a chapter are more distinctive than others? He gave me the answer for the chapters that tied, but I’ll keep them secret and pass the challenge on to you. Can you break the tie? (discuss at: http://twotoreal.com)