Spooky Author Identification: EDA

The content of this blog is based on some exploratory data analysis performed on the corpora provided for the “Spooky Author Identification” challenge at Kaggle [1]. The corpora includes excerpts/ sentences from some of the scariest writer of all times.

The Spooky Challenge

An Hallowen-based challenge [1] with the following goal: predict who was writing a sentence of a possible spooky story between Edgar Allan Poe, HP Lovecraft and Mary Wollstonecraft Shelley.

The Beginning of the Journey: the Spooky Data

We are given a csv file, the train.csv, containing some information about the authors. The information consists on a set of sentences written by the different authors (EAP, HPL, MWS). Each entry (line) in the file is an observation providing the following information:

an id, a unique id for the excerpt/ sentence (as a string)

the text, the excerpt/ sentence (as a string),

the author, the author of the excerpt/ sentence (as a string)

a categorical feature that can assume three possible values

EAP for Edgar Allan Poe,

HPL for HP Lovecraft,

MWS for Mary Wollstonecraft Shelley

Author: Pier Lorenzo Paracchini

He is a generalist with a passion for people, data and technology. He has a Master of Science in Electronic Engineering from the Politecnico Di Milano and works as an enthusiast developer with a data scientist twist in the software innovation sector in Statoil. His journey in data science and machine learning started in 2014.

Follow us on:

# loading the data using readr packagespooky_datareadr::read_csv(file="./../../../data/train.csv",col_types="ccc",locale=locale("en"),na=c("","NA"))# readr::read_csv does not transform string into factor# being the author feature categorical by nature# it is transformed into a factorspooky_data$authoras.factor(spooky_data$author)

Avoid the madness!

It is forbidden to use all of the provided spooky data for finding our way through the unique spookyness of each author. We still want to evaluate how our intuition generalizes on a unseen excerpt/ sentence, right?? For this reason the given training data is split in two parts (using stratified random sampling)

an actual training dataset (70% of the excerpts/ sentences), used for

exploration and insight creation, and

traing the classification model

a test dataset (the remaining 30% of the excerpts/ sentences), used for

evaluation of the accuracy of our classification model.

# setting the seed for reproducibilityset.seed(19711004)trainIndexcaret::createDataPartition(spooky_data$author,p=0.7,list=FALSE,times=1)spooky_trainingspooky_data[trainIndex,]spooky_testingspooky_data[-trainIndex,]

Moving our first steps: from darkness into the light

Before start building any model, we need to understand tha data, build intuitions about the information contanined in the data and identify a way to use those intuitions to build a great predicting model.

Is the provided data useable?

Question: Does each observation has an id? An excerpt/ sentence associated to it? An author?

missingValueSummarycolSums(is.na(spooky_training))

As we can see from the table below, there are no missing values in the dataset.

id

text

author

0

0

0

Some initial facts about the excerpts/ sentences

Below we can see, as an example, some of the observations (and excerpt/ sentence) available in our dataset

id

text

author

id26305

This process, however, afforded me no means of ascertaining the dimensions of my dungeon; as I might make its circuit, and return to the point whence I set out, without being aware of the fact; so perfectly uniform seemed the wall.

EAP

id22965

A youth passed in solitude, my best years spent under your gentle and feminine fosterage, has so refined the groundwork of my character that I cannot overcome an intense distaste to the usual brutality exercised on board ship: I have never believed it to be necessary, and when I heard of a mariner equally noted for his kindliness of heart and the respect and obedience paid to him by his crew, I felt myself peculiarly fortunate in being able to secure his services.

MWS

id19764

Herbert West needed fresh bodies because his life work was the reanimation of the dead.

HPL

id10125

For many prodigies and signs had taken place, and far and wide, over sea and land, the black wings of the Pestilence were spread abroad.

EAP

Question: How many excerpts/ sentences are available by author?

no_excerpts_by_authorspooky_training%>%dplyr::group_by(author)%>%dplyr::summarise(n=n())ggplot(data=no_excerpts_by_author,mapping=aes(x=author,y=n,fill=author))+geom_col(show.legend=F)+ylab(label="number of excerpts")+theme_dark(base_size=10)

Question: How long (# ofchars) are the excerpts/ sentences by author?

spooky_training$lennchar(spooky_training$text)ggplot(data=spooky_training,mapping=aes(x=len,fill=author))+geom_histogram(binwidth=50)+facet_grid(.~author)+xlab("# of chars")+theme_dark(base_size=10)

ggplot(data=spooky_training,mapping=aes(x=1,y=len))+geom_boxplot(outlier.colour="red",outlier.shape=1)+facet_grid(.~author)+xlab(NULL)+ylab("# of chars")+theme_dark(base_size=10)

There are some excerpts that are very long. As we can see from the boxplot above, there are few outliers for each authors; a possible explanation is that the sentence segmentation had few hiccups (see deatils below).

Author

Min (# chars)

Mean (# chars)

Median (# chars)

Max (# chars)

EAP

21

141.92

115

1533

HPL

21

157.61

144

900

MWS

21

150.5

129

4663

For example Mary Wollstonecraft Shelley (MWS) has an excerpts of around 4600 characters:

“Diotima approached the fountain seated herself on a mossy mound near it and her disciples placed themselves on the grass near her Without noticing me who sat close under her she continued her discourse addressing as it happened one or other of her listeners but before I attempt to repeat her words I will describe the chief of these whom she appeared to wish principally to impress One was a woman of about years of age in the full enjoyment of the most exquisite beauty her golden hair floated in ringlets on her shoulders her hazle eyes were shaded by heavy lids and her mouth the lips apart seemed to breathe sensibility But she appeared thoughtful unhappy her cheek was pale she seemed as if accustomed to suffer and as if the lessons she now heard were the only words of wisdom to which she had ever listened The youth beside her had a far different aspect his form was emaciated nearly to a shadow his features were handsome but thin worn his eyes glistened as if animating the visage of decay his forehead was expansive but there was a doubt perplexity in his looks that seemed to say that although he had sought wisdom he had got entangled in some mysterious mazes from which he in vain endeavoured to extricate himself As Diotima spoke his colour went came with quick changes the flexible muscles of his countenance shewed every impression that his mind received he seemed one who in life had studied hard but whose feeble frame sunk beneath the weight of the mere exertion of life the spark of intelligence burned with uncommon strength within him but that of life seemed ever on the eve of fading At present I shall not describe any other of this groupe but with deep attention try to recall in my memory some of the words of Diotima they were words of fire but their path is faintly marked on my recollection It requires a just hand, said she continuing her discourse, to weigh divide the good from evil On the earth they are inextricably entangled and if you would cast away what there appears an evil a multitude of beneficial causes or effects cling to it mock your labour When I was on earth and have walked in a solitary country during the silence of night have beheld the multitude of stars, the soft radiance of the moon reflected on the sea, which was studded by lovely islands When I have felt the soft breeze steal across my cheek as the words of love it has soothed cherished me then my mind seemed almost to quit the body that confined it to the earth with a quick mental sense to mingle with the scene that I hardly saw I felt Then I have exclaimed, oh world how beautiful thou art Oh brightest universe behold thy worshiper spirit of beauty of sympathy which pervades all things, now lifts my soul as with wings, how have you animated the light the breezes Deep inexplicable spirit give me words to express my adoration; my mind is hurried away but with language I cannot tell how I feel thy loveliness Silence or the song of the nightingale the momentary apparition of some bird that flies quietly past all seems animated with thee more than all the deep sky studded with worlds” If the winds roared tore the sea and the dreadful lightnings seemed falling around me still love was mingled with the sacred terror I felt; the majesty of loveliness was deeply impressed on me So also I have felt when I have seen a lovely countenance or heard solemn music or the eloquence of divine wisdom flowing from the lips of one of its worshippers a lovely animal or even the graceful undulations of trees inanimate objects have excited in me the same deep feeling of love beauty; a feeling which while it made me alive eager to seek the cause animator of the scene, yet satisfied me by its very depth as if I had already found the solution to my enquires sic as if in feeling myself a part of the great whole I had found the truth secret of the universe But when retired in my cell I have studied contemplated the various motions and actions in the world the weight of evil has confounded me If I thought of the creation I saw an eternal chain of evil linked one to the other from the great whale who in the sea swallows destroys multitudes the smaller fish that live on him also torment him to madness to the cat whose pleasure it is to torment her prey I saw the whole creation filled with pain each creature seems to exist through the misery of another death havoc is the watchword of the animated world And Man also even in Athens the most civilized spot on the earth what a multitude of mean passions envy, malice a restless desire to depreciate all that was great and good did I see And in the dominions of the great being I saw man reduced?”

Thinking Point: “What do we want to do with those excerpts/ outliers?”.

Some more facts about the excerpts/ sentences using the bag-of-words

The data is transformed into a tidy format (unigrams only) in order to use the tidy tools to perform some basic and essential NLP operations.

From this initial visualization we can see that the authors use quite often the same set of words – like the, and, of. These words do not give any actual information about the vocabulary actually used by each author, they are common words that represent just noise when working with unigrams: they are usually called stopwords.

If the stopwords are removed, using the list of stopwords provided by the tidytext package, it is possible to see that the authors do actually used different words more frequently than others (and it differs from author to author, the author vocabulary footprint).

A comparison cloud can be used to compare the different authors. From the R documentation

‘Let p{i,j} be the rate at which word i occurs in document j, and p_j be the average across documents(∑ip{i,j}/ndocs). The size of each word is mapped to its maximum deviation ( max_i(p_{i,j}-p_j) ), and its angular position is determined by the document where that maximum occurs.’_

From the plot above we can see that for EAP and HPL provided corpus, we need circa 7500 words to cover 90% of word instance. While for MWS provided corpus, circa 5000 words are needed to cover 90% of word instances.

Question: Is there any commonality between the dictionaries used by the authors?

Are the authors using the same words? A commonality cloud can be used to answer this specific question, it emphasises the similarities between authors and plot a cloud showing the common words between the different authors. It shows only those words that are used by all authors with their combined frequency across authors.

then we need to spread the author (key) and the word frequency (value) across multiple columns (note how NAs have been introduced for word not used by an author) …

word_freqsword_freqs%>%tidyr::spread(author,word_freq)

word

EAP

HPL

MWS

à

0.0001179

NA

NA

a.d

NA

0.0000454

NA

a.m

0.0000589

0.0001362

NA

ab

0.0000196

NA

NA

aback

0.0000393

NA

NA

Lets start to plot the word frequencies (log scale) comparing two authors at a time and see how words distribute on the plane. Words that are close to the line (y = x) have similar frequencies in both sets of texts. While words that are far from the line are words that are found more in one set of texts than another.

As we can see in the plots below – there are some words close to the line but most of the words are around the line showing a difference between the frequencies.

# Removing incomplete cases - not all words are common for the authors# when spreading words to all authors - some will get NAs (if not used# by an author)word_freqs_EAP_vs_HPLword_freqs%>%dplyr::select(word,EAP,HPL)%>%dplyr::filter(!is.na(EAP)&!is.na(HPL))ggplot(data=word_freqs_EAP_vs_HPL,mapping=aes(x=EAP,y=HPL,color=abs(EAP-HPL)))+geom_abline(color="red",lty=2)+geom_jitter(alpha=0.1,size=2.5,width=0.3,height=0.3)+geom_text(aes(label=word),check_overlap=TRUE,vjust=1.5)+scale_x_log10(labels=scales::percent_format())+scale_y_log10(labels=scales::percent_format())+theme(legend.position="none")+labs(y="HP Lovecraft",x="Edgard Allan Poe")

# Removing incomplete cases - not all words are common for the authors# when spreading words to all authors - some will get NAs (if not used# by an author)word_freqs_EAP_vs_MWSword_freqs%>%dplyr::select(word,EAP,MWS)%>%dplyr::filter(!is.na(EAP)&!is.na(MWS))ggplot(data=word_freqs_EAP_vs_MWS,mapping=aes(x=EAP,y=MWS,color=abs(EAP-MWS)))+geom_abline(color="red",lty=2)+geom_jitter(alpha=0.1,size=2.5,width=0.3,height=0.3)+geom_text(aes(label=word),check_overlap=TRUE,vjust=1.5)+scale_x_log10(labels=scales::percent_format())+scale_y_log10(labels=scales::percent_format())+theme(legend.position="none")+labs(y="Mary Wollstonecraft Shelley",x="Edgard Allan Poe")

# Removing incomplete cases - not all words are common for the authors# when spreading words to all authors - some will get NAs (if not used# by an author)word_freqs_HPL_vs_MWSword_freqs%>%dplyr::select(word,HPL,MWS)%>%dplyr::filter(!is.na(HPL)&!is.na(MWS))ggplot(data=word_freqs_HPL_vs_MWS,mapping=aes(x=HPL,y=MWS,color=abs(HPL-MWS)))+geom_abline(color="red",lty=2)+geom_jitter(alpha=0.1,size=2.5,width=0.3,height=0.3)+geom_text(aes(label=word),check_overlap=TRUE,vjust=1.5)+scale_x_log10(labels=scales::percent_format())+scale_y_log10(labels=scales::percent_format())+theme(legend.position="none")+labs(y="Mary Wollstonecraft Shelley",x="HP Lovecraft")

In order to quantify how similar/ different these sets of word frequencies by author, are we can calculate a correlation (Pearson for linearity) measurement between the sets. There is a correlation of around 0.48 to 0.5 between the different authors (see plot below).

Share this entry

You might also like

https://datasciencedojo.com/wp-content/uploads/Building-Data-Visualization-Tools-How-to-work-with-maps.png8001000Arhamhttps://datasciencedojo.com/wp-content/uploads/2016/06/Logo_w300-1.pngArham2017-10-18 17:03:182017-10-30 17:13:09Building Data Visualization Tools: How to work with maps

https://datasciencedojo.com/wp-content/uploads/Intro-R-Visualizations-PowerBI.png8011001DaveLangerhttps://datasciencedojo.com/wp-content/uploads/2016/06/Logo_w300-1.pngDaveLanger2017-04-25 10:44:282017-11-07 11:38:29Introduction to R Visualizations with Power BI