Overview

My goal was to build a visualization that shows the breakdown of the parts of speech within a user inputed passage of text, specifically a classic novel. This midterm covers the first step - showing the total counts of the parts of speech in a novel - in a project that I'd like to continue building out over the rest of the semester. This was built for the general audience to do a first look at how complex a classic novel is when broken down to its words. The question I ask with this project is: does the increased use of certain parts of speech correlate to the readability of classic literature?

Background

Schools teach certain novels based on their thematic relevance, but what interests me is the complexity of these novels from a grammar standpoint. The first noted study of classic literature readibility was in 1880 by Lucius Adelno Sherman, who noticed that the length of sentences in novels had changed from 50 words per sentence in the Pre-Elizabethan Era to 23 words per sentence in his time (DuBay, 2007). Although the studies in the article looked at vocabulary, I am interested to see if there are correlations in the usage of parts of speech to readability.

A common measure of readability is done through the Flesch-Kincaid test. This test generates a score that estimates the readability of a text based on its word, sentence, and syllable counts. Syllables are an interesting factor here that aren't repeated in any of the other tests from the DuBay reading, but I couldn't determine if there is a way to correlate syllables to parts of speech.

Creating the Visualization: Goal and Interaction Model

The goals of the visualization was:

Count the number of times the parts of speech occurred

Compare the times a part of speech occurred to others

Show links between parts of speech to indicate which parts of speech were strongly connected based on the probability of finding a particular part of speech after one has been read in the text

The last goal was a stretch goal and is actually being covered later in the semester by Dan Shiffman's Programming A to Z class, which looks at text analysis methods. I'm combining this midterm with the requirements of that class in order to produce an visualization in which the user provides the data. Here is the interaction model

User enters the title of the novel they want to analyze

User enters the text of the novel to analyze

They copy-paste the text into a text field

They upload a text file (.txt)

The visualization is generated

The user hovers over each bubble to get more information about the data.

Creating the Visualization: Data

I came across Project Gutenberg as I was searching for literary APIs - the site provides free ebooks of many classic novels with very few limitations for its use as the copyrights for many of the books have expired. As each novel is well over a thousand words, they make a very good data set. I can also compare multiple works by the same author to get data about that author.

- Get the user input
- Check that the user has put in a valid file type
- Check that the user has entered a title and file
- Generate the data
- Split the text file into an array of strings (remove punctuation and tokenize)
- Loop through each word to the find the part of speech
- Store the count of each part of speech to a dictionary and push the key to an array
- Find the general type of part of speech (adj, adv, noun, verb) for each word
- Create the visualization
- Pass in the arrays of data
- Generate the bubbles and draw the SVG element

Data Collected - A Sample Case Study

Since Pride and Prejudice is my favorite classic, I decided to use a small sample of Jane Austen's work to see what data I could generate and if there are any patterns in her writing that could lead to understanding the complexity of her writing. First I used Shiffman's Flesch Index Calculator to generate the following table

She uses quite a lot of prepositions or subordinating conjunctions (after, although, provided that) and singular/mass nouns, which may generate a lot of syllables that would increase the Flesch Kincaid score. However, each novel as a data set is massive in terms of word count so that may be the primary driving factor in the complexity.

Creating the Visualization: Next Steps

I built this visualization mostly for myself (and also partially because my previous concepts are unrealizable dreams), but I think this visualization would be interesting to people who would like to analyze text complexity or how novels are built (kind of an engineering take on literature)

I would like to increase the complexity of the graph by using a pack hierarchy to group together related parts of speech. I collected the simple part of speech breakdown groups but wasn't fully able to map the different parts of speech correctly to each of these groups with the dictionaries that I built.

The following step I wanted to accomplish was to build a force directed graph showing which parts of speech most commonly appear together, with a link being created if one part of speech appears before/after another. This seems to be related to Markov chain creations, which I discussed with Dan Shiffman briefly and will be covering later in the semester. I hope to build on this project when I cover that material.