The Origins of the Text Analysis Project at the ITPS

12/3/2018

by Gary Berton

When I arrived in 2011 to form the Institute for Thomas Paine Studies at Iona College, the first stop I made was the Computer Science Department. My son worked at a tech company that developed Special Vector Machines, a software package designed to separate writing features and enable scholars to distinguish authorship of texts. A major theme of Paine studies for over a century concerned arguments over the authorship of works often attributed to Paine, and scholars held many different, conflicting opinions. I hoped to use a modern computer science approach to help sort out these debates. Initially I was interested in the Declaration of Independence, the Junius Letters, and pseudonymous works written by Paine previously unrecognized by scholars, as well as correcting the historical record of works erroneously attributed to Paine. No progress could be made in Paine studies unless we could, with some certainty, sort these issues out.

Dr. Robert Schiaffino was Chair of the Iona College Department of Computer Science at the time, and he put me in touch with Dr. Smiljana Petrovic and later Dr. Lubomir Ivanov. Spearheaded by Dr. Petrovic, a methodology was developed to combine accepted individual stylistic features of writing in the growing science of author attribution, altogether some seventeen features. The methodology weights stylistic features according to effectiveness and calculates the results mathematically with four separate but simultaneous approaches, producing a software package that can identify the author of an unknown document from a group of possible candidates. These features include unconscious use of function words, prepositions, n-grams and n-words, and root words, all of which the author is unaware when writing. In other words, these features reveal an author’s stylistic fingerprint.

The fun started when we had to create author files: a file of work for each author to establish a baseline of how they used these feature traits—we took each author’s fingerprint. We employed a leave-one-out method to take an individual work in the author file and test it against the rest of the author’s oeuvre. Right away we discovered that certain works typically attributed to a writer turned out to be erroneous. For example, I was floored when I saw that “African Slavery in America,” a 1774 abolitionist piece commonly attributed to Paine, was not his after all. And so the creation of each individual author file was a challenge that took years.

I started with the works that contained in collections of Paine’s writings to test for validity, and I am just now completing that project. I quickly discovered that many works commonly attributed to Paine were not his, and after testing I examined them closely and found incongruous language at odds with Paine’s philosophy or known major works. For example, if a piece that did not pass the software analysis also had a phrase praising Jesus—Paine was not a Christian and never employed biblical language except to refute it—I could exclude this piece from his canon. I developed a method of testing that included three components: passing the computer test (Calculation), conforming to Paine’s politics and philosophy (Content), and physically being able to publish the piece at the time and place it was published (Context). These “3 C’s” are the basis for judging the authorship of a piece of writing. In my next post I’ll explore how to use the results of testing to determine authorship, and how we overcame problems with this process as they emerged.