A literature professor has developed software using Google's
PageRank algorithm that has identified Jane Austen and Walter Scott as
the most influential authors of the 1800s.

Matthew Jockers of the University of Nebraska analysed 3,592
digitised novels published in the UK, Ireland and the US between 1780 and 1900
using a combination of Google's algorithm, machine learning and a series of techniques used in
computational text analysis including stylometry, corpus
linguistics and network analysis.

After ensuring the gender balance was split roughly evenly,
Jockers went about using his software to extract thematic data --
this included the frequency of specific words or groups of words.
Network software was then used to categorise and rank this data --
Jockers began with a network consisting of 12,902,464 rows and
three columns, with a source book allotted to the first column, a
target book to the second and the third used to calculate the distance
between the two (i.e. how many similarities they share according to
the thematic data). After narrowing these data sets down to
6,447,640, the information was imported into network analysis
software Gephi and PageRank was used to help identify down those
novels which had the most links to future tomes, as well as the
strongest links to those tomes.

Network analysis allowed Jockers to visualise the thematic
distances between each novel. "Networks are constructed out of
nodes (books) and edges (distances). When plotted, nodes with less
similarity (i.e. with larger distances between them) will spread
out further in the network," he explains in a paper detailing his methodology. By generating different
visual models of the network it was possible for Jockers to witness
the ebb and flow of certain popular themes evolving over the
century. The above photo depicts the links between the network's
nodes according to author gender.

"This visualisation reveals that works by female authors
(coloured light gray) and male authors (black) are more
stylistically and thematically homogeneous within their respective
gender classes," writes Jockers. "As a result of this similarity in
'signals', female-authored books cluster together on the south side
of the main network, while male-authored books are drawn together
in the north."

It revealed a few interesting anomalies, such as the fact that
Harriet Beecher Stowe's 1852 Uncle Tom's Cabin shares more
similarities with novels written by male authors than by
female.

Ultimately, Austen and Scott both came out on top, with Jockers
referring to them as "the literary equivalent of Homo erectus or,
if you prefer, Adam and Eve". However, since both writers were
active towards the beginning of Jocker's chosen timeframe, it was
impossible to get a good view of who influenced them. Widening the
timeframe would provide more details as to the source of the two
writers' appeal.

Although Jockers admits there's plenty more work to be done,
including widening the criteria used to pinpoint the most
influential texts, applying his method could lead to some
interesting finds -- perhaps a single, hidden
15th century text is actually responsible
for the bodies of romanticised fiction produced by Austen and
Walter.

Quoting Mark Twain, Jockers points out that influence is an
integral part of literary history, "All ideas are second hand,
consciously and unconsciously drawn from a million outside
sources", before acknowledging T. S. Eliot's more optimistic, less
plagiaristic, opinion of creativity: "The historical sense compels
a man to write not merely with his own generation in his bones, but
with a feeling that the whole of the literature … has a
simultaneous existence".

It's indisputable that literary influence is an important part
of the written word's evolution. Without Edgar Alan
Poe's stream of consciousness in the gothic Tell
Tale Heart Virgina Woolf might never have
written Mrs Dalloway and William
Faulkner's The Sound and the Fury could have
remained the author's own private, embattled internal monologue.
Being able to categorise and rank the links between these different
genres through the decades and centuries will provide linguistics
and literature professors with a novel view of comparing texts,
their history and their relative importance -- we tend to define
texts as being of great importance according to popular opinion at
the time of and immediately after publishing, and according to
contemporary subjective readings of the content. Jockers is boldly
challenging this, and suggesting an entirely novel route that could
prove just as worthy.

Image: Elijah Meeks

Edited by Olivia Solon

Comments

I'd like to know why that image is credited to a stock photo site, considering I created it.

Elijah Meeks

Aug 20th 2012

In reply to Elijah Meeks

Apologies Elijah, we changed the image at the last minute. We will credit correctly now.

Liat Clark

Aug 20th 2012

I must say that is one strange image, but I like it :) I am going to post it on my site.