Text Analysis Project at the ITPS, Part II – The Test Results

12/6/2018

By Gary Berton

The process of testing documents for authorship is based on the author files discussed in my previous post, collections of works of individual authors from which we derived their stylistic fingerprint. As our text analysis methodology advanced, I noticed that as the number of author files increased, so did the different packages of traits to test for stylistic fingerprints, affecting the results.

The combination of authors used in a test influences the outcome, logical because all the computer does is find the closest fit between the document and all the authors’ works presented to it. When the combination of authors changes the results change – slightly when the author is present, more dramatically when the author is not present. So it was necessary to account for this. For example, even if an author matches over 40% of the sixty-eight markers in a particular test (seventeen features with four classifiers), this does not guarantee that a work can be attributed to that particular author.

We came to realize that a 40% threshold is operative: after thousands of tests, especially those where the author is known and we are simply confirming results, a 40% marker-matching threshold for the leading candidate is typically decisive (but as noted above, not always). Less than 40% when more than one author are close to but do not exceed that mark is inconclusive, indicating that the author of the document may not be present in that particular test. A single author achieving 40% with no other individual close to that mark is a strong indication that the author has been found. However, we still had to resolve the results based on shifting the combination of authors. The solution is to create a rigorous system of different packages of authors and testing the particular document with each package. There will often be one or two inconclusive results in a particular test series, but if twenty or so packages result in at least three-quarters of the results achieving over 40% for the same author, then that will likely be the candidate.

As always, Content and Context are still equally important, and in rare cases they trump the Calculation of portion of the computer software package (these “3 C’s” are discussed in my previous post). There are, for example, instances when an author passes the Calculation, and even the Content test, but was not in the same country at the time the piece was written. This occurred just after the publication of Thomas Paine’s Rights of Man, as American commentators debated the text, often mimicking and copying whole sentences from it without quotation marks or acknowledgement to Paine. These writings test positively for Paine’s authorship, but there was no possible way he could have participated in these debates in American newspapers while in France.

On the other hand, there are instances when Content and Context are strong, and even the handwriting matches, but the Calculation rejects the authorship. For example, Paine has typically been credited with writing the short piece “Plans for Equipping the Army,” because it was attached to a letter in his handwriting. But our software failed to show any signs of Paine’s stylistic features in twenty-two tests, with David Rittenhouse winning almost all the tests. Our conclusion: Paine worked closely with Rittenhouse, and lived nearby and associated with him continuously in that period of time, so it is likely that Paine copied Rittenhouse’s notes on the subject and sent them in to a government official for consideration.

A comprehensive and multi-disciplinary approach to author attribution is essential. Historical, philosophical, literary, and forensic tools are all used to make decisions on the likelihood of authorship. In my next post, I’ll discuss applying the Text Analysis Project’s tools to testing in languages other than English.