New state of the art software is being released in various domains, much of which can help in stylometry analysis. I have decided to bite the bullet finally and move over from Matlab to R, the open source statistical software.The best permutation and nonparametric combination test software is now on R -http://caughey.mit.edu/softwareThis allows you to compare samples against base without worrying whether your data is complies with the normality curve, or if you have more variables than samples and so on. Devin Caughey has written some very nice papers on this, and now his software is available on R.Now with the release of Stylo R package, I have well and truly moved over to R:https://sites.google.com/site/computational stylistics/homeThis is a superb stylometry package with some of the latest developments in stylo analysis such as Burrows Delta and Consensus Bootstrap Tree, rolling Delta etc. These guys know their stuff and have written a great program.Two more bits of software to complete the analysis puzzle, the state of the art Stanford Parser from the Stanford NLP Group - https://nlp.stanford.edu/software/lex-parser.shtmlAnd with the advent of Syntactic Ngrams by Google and others, some great ideas along these lines with with software to produce them, Dr. Gregori Sidorov has an interesting site along with some great papers he has written. He has done some interesting work on the syntactic ngrams and call them sngrams. His site and the software in Python -- http://www.cic.ipn.mx/~sidorov/Also worth mentioning is authorship software Toccata by Richard Forsyth, along with his other software. I bought Beagle from him in the eighties, and still have fond memories of it. All his new stuff is in Python:https://www.richardsandesforsyth.net/software.htmlThat's a round up of the software, so lets put it together slowly.The Problem:A 374 word ransom note at the scene of a murder, or accidental homicide of JonBenet Ramsey. The FBI and police and lead investigator James Kolar agree the note was part of the "staging" of the crime scene.A staged ransom note means it is trying to portray what it is not. The writer was aware that handwriting would be extensively analysed afterwards, this alone means that handwriting analysis (physically comparing writing) would be useless in a court of law because a lot of effort would have been made to fake and randomise the appearance of the note, and it could never be "beyond reasonable doubt."Linguistic Analysis:Linguistic analysis is an option and has progressed in leaps and bounds over the last few years: (Koppel,Eder, Rybicki, Hoover et al).It has been known for a long time that people tend to write with their own "style" and using function words, for example "at", "by", "be", "but" and "can" provide linguistic fingerprints because people are unaware of these tiny words and they are not context sensitive, making them a good marker in many cases. By themselves they are not enough however. And so the search is on for more markers and more software to separate the signal from the noise.WritePrint which is embedded into Jstylo (earlier post) has about 800 different variables it creates, and used to be considered the gold standard.Another clever method used with success in a stylometry competition was by the team of Koppel, Akiva and Dagan with their "Unstable" words as markers:http://onlinelibrary.wiley.com/doi/10.1002/asi.20428/abstractThe JonBenet Ramsey Ransom Note:Looking at the JonBenet ransom note, means that using content words would fail. In other words, pronouns probably need to be ignored, and content words cannot be used because all ransom notes bear similarities along these lines.One ransom note would be linked to another if you used word frequencies of "you" and "money" and "die", for example. Since the JonBenet is staged or faked (she was dead when the note was written, the note was purported to be from a "faction"), it is likely that there would be red herrings in the writing in order to attribute it to a radical group.Any spelling mistakes, hyphens and strange letter formations etc would be obvious and probably useless as markers because the writer knew the note would be analysed, and keeping in mind the dynamics of staging, you would expect conscious errors/red herrings etc.What we need to do is look for unconscious style markers and text structure, things that are written as habit. It is likely that just as the handwriting experts noted that the last part of the note was the most fluid, it is also likely that the last part also has the most unconscious markers due to force of habit...concentrating on staging a note in the beginning, and it becoming more "free flowing" with habit taking over at the end.It is also likely that if the crime was covered up by the parents after the son accidentally hit JonBenet on the head with a torch in a fit of rage for snatching some pineapple from him in a midnight snack as per the CBS show (which seems to line up the evidence as the most likely scenario), it would be natural to think that both parents are involved to some extent, one dictating some text or ideas, the other writing.People write differently to how they talk, and use different parts of the brain to process written text and verbal, so one of the parents would be dominating in their unconscious writing style unless the letter was being quoted verbatim (unlikely.)Parts-Of Speech Analysis:The idea is to take away the words, leaving the lexical structure of the ransom note.
This is easily done with the Stanford Parser, and also the Stanford Tagger, both in Java and I have also used the MontyLingua Tagger written in Python.What a Speech Tagger does is replace words with parts of speech lexical categories such as Verbs, Nouns, Pronouns, Determiners etc. The most used Tags are the Penn Tree Bank of tags, of which there are 36:

Using NO words, only parts of speech, the POS structure of one of Patsy's notes is similar to the ransom note, while the other ransom notes get binned together as being similar,and the two Christmas notes get put together too.Using a clustering algorithm, where the closest most similar to clumped together, this dendogram is produced on the twenty most frequent POS tags:

This lumps Patsy with the ransom note, her other notes similar to John, and the real kidnapping notes from Wiles and Mackle are on the outskirts of Patsy and the JonBenet ransom note.Now, asking the software to classify who wrote the note, or more accurately, who is the closest match and using one of the most best classifiers proven to have a good track record in authorship, the SVM classifier, Patsy is determined to be the author.Using one of the most recent and powerful algorithms in determining the distance ie the closeness of match is the Burrows Delta, which is included in the package, as well as modifications such as Eders Delta and Argamons Delta....the output is again Patsy as the author.Is there a way to get more linguistic structure out of the writing ie more information than POS Tags can give us?Yes there is. This brings us to:Syntactic NgramsPart 1 - Parsing Text To Create A Dependency Tree:Recall, POS Tags (above) give us lexical structure, a word is replaced with a verb or noun tag, but tells us nothing about the syntactic dependency tree structure; telling us what is the subject and object of the sentence is, which word is at the head (root) of the tree and so on.We are now going to extract syntactic information. This is very different to POS Tags/ Parts Of Speech.http://demo.ark.cs.cmu.edu/parse/about.htmlWhat we extract with syntactic parsing is the tree structure of a sentence -- which word is the object, which word is dependant on another, and to create a tree structure that is non linear. This means the words in a sentence are not listed by the parser in the order they are written, but in the order assessed to be syntactically correct according to a dependency tree.The critical take away point from this is that syntactic structure is NON LINEAR, meaning the order of the sentence from the parser is different to how it was written. The state of the art Stanford Parser has an accuracy of about 97% and reveals reveals the syntactic structure of text without words, as a first step!An example of the parser output for the sentence:The boy with the brown eyes ate the cake.det(boy-2, The-1)nsubj(ate-7, boy-2)case(eyes-6, with-3)det(eyes-6, the-4)amod(eyes-6, brown-5)nmod(boy-2, eyes-6)root(ROOT-0, ate-7)det(cake-9, the-8)dobj(ate-7, cake-9)Root is at the top of the tree, above that is a noun modifier, and brown at -5 (5th word) is dependent on eyes at -6. There are around 50 tags from the dependency parser, such as determiners, noun subjects etc.Onwards now to:Part 2- NgramsNgrams have been used for a long time and are one of the most reliable indicators of authorship (Sidorov 2014). Ngrams can be characters or words. You can think of it as a sliding window:Using the above sentence again which comes from Google powerpoint presentation about their ngrams:The boy with the brown eyes ate the cake.
A bigram or 2 unit ngram is a 2 word sliding window:The Boy, Boy With, With The, The Brown and so on.A trigram is 3 words or character unit (word in our example) and goes like this:The Boy With, Boy With The, The Brown Eyes, Brown Eyes Ate and so on.Two to five ngram units are the most useful in authorship (Sidorov).Part 3 - Syntactic NgramsThe final piece to this puzzle is the syntactic ngram. Google has used them to index several million books and 320 billion ngrams, with it's ngram viewer:https://books.google.com/ngramsThis is a simplistic interface though, and can only be used for frequencies, however there is more sophisticated analysis possible by downloading the Google ngram data.Notice a problem in the last trigram string above:Brown Eyes AteThis is obviously misleading and won't help with the text analysis of that sentence ie the subject is missing. You never get this output when you use syntactic ngrams, so they are far more powerful, contain more information and are more relevant to the text being analysed!And once again, the beauty with syntactic ngrams is that they are non linear, they contain structure information in a different order according to the parser tree.As mentioned, this example is from a Google presentation as they explain the purpose of their ngram viewer.But there is more power in these little guys yet!Thanks to Dr. Gregori Sidorov, we can produce mixed syntactic ngrams which he calls sngrams--you can mix the syntactic tags from the parser with POS tags (above) or words or lemma.You now have mixed sngrams, or sngrams with relations, which he calls snrgrams.

The take away point from this is that text goes into the Stanford Parser, that output from that goes into the sngram software, the output from that is sngrams or snrgrams (if you mixed them) of various sizes ie bigrams trigrams etc.Long story short--these snrgrams have been shown the be the most powerful use of ngrams in various applications! http://www.g-sidorov.org/Sidorov2014_IJCLA.pdfThe JonBenet ransom note is coded as a 2 unit SNRGRAM (bigram) with Syntactic tags and POS tags.WE are using the power of syntactic tags and syntactic POS tags containing more linguistic structure information than ever.

In other words, this was the most stable output of any analysis I have done over a whole range of settings, showing that the sngrams contain my relevant syntactic information, despite the lack of words!As a final note, I should mention that I used the sngrams as input into Jstylo, the authorship attribution software from Drexel University, and just like the results above, increased the probability of Patsy being the author. Using the same Enron Corpus etc from my earlier post, the sngrams increased the likelihood of Patsy being the author.Let me know if you have any questions. A project I have in the pipeline is use sngrams for lie detection in written statements.Coming soon!

2 comments

Great article. It's a shame that this type of AI-based analysis has not been established in forensics. If done properly, the result could be as reliable as that of DNA analysis (if not more so, since we are certain of the source of our data).

It would be interest to get some p-values into the result in order to assess their statistical significance (the probability that the conclusions were not reached by chance). If you get significant p-values you should publish your work in an academic journal.

Hi Asimos, thanks for your comments. Regarding P values, I am using NPC TEST, non parametric combination software http://static.gest.unipd.it/~salmaso/NPC_TEST.htm and also the R script version by Devin Caughey of MIT. This makes no assumptions about the distributions and allows analysis for small samples. I haven't used it for this analysis but I am gearing up for major new posts. I haven't updated for a long time, getting up to speed with various statistical procedures and graphics. I will have all new material shortly. Thanks for your support and suggestions, I take on board what you say about P values. Cheers.

Deception Detection

Analysing verbal and non verbal deception in politics, crime statements and anywhere else it occurs, using word analysis and statistical software and proven non verbal cues to tells to indicate when the truth is stretched, or outright falsified.