Sara J Kerr

PhD Digital Arts and Humanities

“Curiouser and curiouser!” Cried Alice (she was so much surprised, that for the moment she quite forgot how to speak good English).” – Lewis Carroll, Alice’s Adventures in Wonderland

It seems rather strange to think that, just under eight months ago, I had not written any computer code (I’m not including little bits of BASIC from the ’80s), and yet lines of code or the blinking cursor of Terminal no longer instil a sense of rising panic. Although programming has a very steep learning curve, it is relatively easy to gain a basic understanding, and, with this, the confidence to experiment.

R has rapidly become my favourite programming language, so I was interested to follow a link from Scott Weingart’s blog post ‘Not Enough Perspectives Pt. 1’ to Matthew Jockers’ new R package ‘Syuzhet’. As this is an area I hope to research as part of my PhD I decided to give it a try, using the ‘Introduction to the Syuzhet Package‘ (Jockers, 2015) as a guide. I used a short text from Jane Austen’s juvenilia – ‘LETTER the FOURTH From a YOUNG LADY rather impertinent to her friend’. I removed the speech marks from the text as this causes problems with the code.

The code:

# Experiment based on http://cran.r-project.org/web/packages/syuzhet/vignettes/syuzhet-vignette.html# ‘Introduction to the Syuzhet Package’ Jockers 20-2-2015

# Having installed the syuzhet package from CRAN, access it using library:
library(syuzhet)

# Input a text, for longer texts use get_text_as_string()# This text is from Austen’s Juvenillia from Project Gutenberg
example_text <- “We dined yesterday with Mr Evelyn where we were introduced to a very
agreable looking Girl his Cousin…[I haven’t included the whole text I used – it can be viewed here]
This was an answer I did not expect – I was quite silenced, and never felt so awkward in my Life – -.”

# Use get_sentences to create a character vector of sentences
s_v <- get_sentences(example_text)

# Check that all is well!
class(s_v)
str(s_v)
head(s_v)

# Use get_sentiment to assess the sentiment of each sentence. This function# takes the character vector and one of four possible extraction methods
sentiment_vector <- get_sentiment(s_v, method = “bing”)
sentiment_vector

Visualising the sentiment vector as a line graph shows the fluctuations within the text:

Visualising the emotions within the text:

My Thoughts

I have only started to explore this package and have applied it to a very short passage (44 sentences), while this shows what Syuzhet can do in a general way, it does not demonstrate its full capabilities. In addition, as I haven’t fully read up on the package and the thinking behind it, my analysis may well be plagued with errors.

However, these are my thoughts so far. Running a brief trial, using three of the methods available, highlights some of the difficulties of sentiment analysis, while all three identified the same sentence as the most ‘negative’:

[1] “I dare say not Ma’am, and have no doubt but that any\nsufferings you may have experienced could arise only from the cruelties\nof Relations or the Errors of Freinds.”

each of the methods identified a different sentence as being the most ‘positive’:

bing – [1] “Perfect Felicity is not the property of Mortals, and no one has a right\nto expect uninterrupted Happiness.”

afinn – [1] “I was extremely pleased with her\nappearance, for added to the charms of an engaging face, her manner and\nvoice had something peculiarly interesting in them.”

nrc – [1] “I recovered myself however in a few moments and\nlooking at her with all the affection I could, My dear Miss Grenville\nsaid I, you appear extremely young – and may probably stand in need of\nsome one’s advice whose regard for you, joined to superior Age, perhaps\nsuperior Judgement might authorise her to give it.”

This is something Jockers discusses further in his blog post ‘My Sentiments (Exactly?)‘, highlighting that sentiment analysis is difficult for humans, as well as machines:

This human coding business is nuanced. Some sentences are tricky. But it’s not the sarcasm or the irony or the metaphor that is tricky. The really hard sentences are the ones that are equal parts positive and negative sentiment.Matthew Jockers

However, he also points out:

One thing I learned was that tricky sentences, such as the one above, are usually surrounded by other sentences that are less tricky.Matthew Jockers

It seems that a combination of close and ‘distant’ reading, what Mueller calls ‘scaled reading’, is likely to be of most use if analysis at the sentence level is desired. Having only a relatively recent and limited experience of programming in R, I have found using the Syuzhet package very straightforward and am looking forward to using it again very soon.

UPDATE: 3rd April 2015

There is a great deal of academic discussion surrounding the methods discussed here. As I read further I will add another post exploring the core points and including a reading list.

Exploring texts as networks allows connections within and between texts to be visualised in a simplified manner, enabling researchers to gain insight into complex relationships, which may otherwise remain hidden.

Network analysis of texts is at the intersection of the humanities and science, and therefore has the potential to be truly interdisciplinary. This annotated bibliography covers a period of approximately ten years, and aims to highlight texts which apply network theories to texts, and the methods used to extract networks.

Drieger’s article explores the use of semantic network analysis to gain insights into texts. Building “partially” on text mining, Drieger gives a clear and detailed overview of the field. The article outlines the assumptions and theories used, and emphasises the importance of a combined quantitative and qualitative approach allowing a researcher to “visually explore unknown text sources” (15). Although Drieger presents a detailed description and evaluation of the method, the specific tools utilised have been omitted.

This article outlines a method for the extraction of social networks from 19th Century novels and serials. The authors’ method relies upon the extraction of conversations from a corpus of 60 novels and serials, spanning the work of 31 authors. The aim is to test the “validity of some core theories about social interaction” (139), specifically those of Bakhtin, Moretti and Eagleton. Elson et al. argue that their method disproves these theories, however they do not fully investigate the limitations of their chosen method.

This article explores the use of statistical measures from corpus linguistics in order to visualise a domain via its collocational network. Using the British National Corpus as a reference text, the authors statistically compare this to domain-specific corpora drawn from science and engineering. Gillam and Ahmad identify a number of differences between ‘general language’ and domain-specific language, especially in lower frequency words, using the concept of ‘weirdness’ to explore this.

The article, published in a peer reviewed journal, presents an unusual method of creating a network text analysis. Hunter provides a detailed overview of network text analysis, including a comparison of existing methods for generating and analysing text networks. His method proposes the use of morphology to filter words, which are mapped onto concepts “on the basis of [Indo-European] etymology” (356). Hunter’s article is comprehensive, however it neglects to consider the issues of validity of a method reliant upon a theoretical language.

Kok and Domingos’ article explores a method for extracting semantic networks from large scale corpora using “a scalable, unsupervised, and domain-independent system” (625). The article, which has complex mathematical content, provides an overview of existing tools and current research in the field. They trial their Semantic Network Extractor alongside a number of existing tools, using a large scale web corpus, and conclude that their method is a promising approach.

This blog post provides a example of network analysis which traces meaning circulation in the first ten chapters of Nabokov’s Lolita. Long uses his analysis to explore the ‘lexical connections’, using betweenness centrality highlighted by node size. The embedded videos, which illustrate the Gephi networks, are effective, and Long concludes that this method indicates a “connection between form and function, style and plot”.

This peer reviewed article presents an application of social network theory to three mythological texts, Beowulf, Iliad and Táin Bó Cuailnge. MacCarron and Kenna aim to “place mythological narratives on the spectrum from the real to the imaginary” (1). Characters are extracted from each text and linked via a ‘friendly’ or ‘hostile’ relationship. The resulting social network is then compared to real and fictional networks. The methodology is clearly explained, although the mathematics is complex. The authors conclude that there are indications that all three texts have a level of historicity, including Táin Bó Cuailnge.

This article offers a “non-technical” (276) exploration of the use of collocational networks to visualise changes in a series of financial reports. Magnusson and Vanharanta use Mutual Information and frequency, tools borrowed from corpus linguistics, to create their network manually. The article illustrates how a comparison of collocational networks can aid the identification of changes within a corpus.

This chapter demonstrates the use of visual diagrams to indicate relationships between narrative elements and the physical world. Moretti explores the use of traditional geographical maps, before moving on to consider stylised geometric’ maps, to “abstract [details]….from the narrative flow” (53). He uses a series of ‘Village’ stories (Mitford, Galt and Auerbach) to demonstrate how this method of exploring a text enables the researcher to see changes over time, as well as conceptual elements which may not be apparent via close reading.

This article, presents a method for the extraction and visualisation of a text network. Paranyushkin outlines the problem of subjectivity in the creation of semantic networks and aims to “avoid as much subjective and cultural influence as possible” (3) in his own method. A clear and detailed methodology, which includes raw data and commands, mean that the reader can follow and replicate the research.

This is an edited version of the last of 3 blog posts, previously blogged on ‘Austen, Morgan and Me‘

In the second week of the Coursera R Programming course and things were getting decidedly tough with the focus being on creating functions. The shift from basic calculations to writing functions is a very steep learning curve.

I had made the mistake of reading a ‘warning’ post which highlighted the difficulties of creating the assignment functions, but also linked to some very useful materials (some of which I had already studied). Initially, this post made me panic as it emphasised the number of people who dropped out of the MOOC as well as the fact that the number of hours required were optimistic, even for a skilled programmer.

As seems to be my default at the moment, I talked this problem through with the dog during our afternoon walk, she is a fantastic listener! I asked her how I would have felt if I hadn’t read the post? I wouldn’t have been worried. What would I do if/when I got stuck? Use the message boards and the internet to solve the problems. So, aside from completing the very helpful tutorial, I decided to act as though I had not read the post. [Having completed the CS620c Structured Programming Course in Java, I have found out that programmers call this ‘Rubber Duck Debugging’!]

The assignment for the course had 3 parts and a function to be written for each. The first part took a while, not as long as I had feared though, and the message boards came to my rescue with a final issue I needed to solve.

The second part of the assignment was actually a bit easier as it built on the first, and I was feeling a bit more confident having completed the first task. Again, my sticking point was one aspect, I asked for help on the boards and tried some solutions, to no avail. So, I decided to sleep on it and go back to the start. I rewrote the function line by line, testing as I went and adding markdown comments. This time it came together much better and I had only a fairly simple tweak (which only took 3 trials to get right) before it was ready for submission. I have to say I was very pleased with myself.

The final part was somewhat tricky. I managed to get myself to a position where a correlation was being returned, but only a single result rather than multiples. I felt like I was going round in circles, tweaking the code and retrying it. It turned out that the problem was with my subsetting. When I printed my code it looked like everything was finally starting to work, but when I ran the example outputs it came back as NULL! A few moments of panic before I realised that I had simply forgotten to return the result within the function. So, finally the assignment was completed and submitted.

I think this also goes to show the importance of mindset when approaching difficult tasks. As my background is literature/history and Education, I haven’t really had a great deal of experience using maths and computing. I know I can learn things, like many people who grew up in the 1980s, most of my computer skills are self-taught, but it is that nagging voice that puts me off, causing doubt to creep in. Well, quite frankly, it can get lost!

My initial foray into the world of programming seemed to go fairly well, at least, I managed to get my head around the basics and haven’t run screaming from my laptop. There were a few areas which I found more tricky than others, but I think that some of that is because I don’t have a maths or computer background, so I have a pretty steep learning curve.

An Overview

My understanding is that R is a program that you can use to sort, group and analyse data from a straightforward level right up to complex modelling.

There are several types of ‘object’ in R: vectors, lists, matrices, factors and data frames. These objects have attributes: class (type: character, numeric, integer, logical, complex), names, dimensions, length etc. These objects hold the data. Functions are used to manipulate the data; they seem to be written functionname (object).

Naming vectors and other items carefully will make the data much easier to explore, so this needs some thought as well as careful recording – the environment window of R Studio lists values, their type and their content which is very useful. I feel that I am starting to grasp what R is capable of, and why I might use it, but my practical knowledge is not yet up to where it needs to be.

Next Steps

Having completed Coursera‘s Data Scientist’s Toolbox course, which introduced me to R, R Studio, Git and GitHub (all available free on the internet), the next step is the R Programming course. Both courses are part of the data science specialism run by Johns Hopkins University.

To start, I have uploaded a free tutorial called SWIRL which goes through the basics (in a similar manner to Tryr and DataCamp) using R Studio.

Although I had covered a large proportion of the initial SWIRL content in my previous post, it has been helpful to go over the basic functions. This time round, I feel a bit more confident in the use of terminology and the structure of arguments and functions.

Missing Values

R uses NA to indicate missing values and will return NA as the result of any calculation with NA as one of the variables. To identify the NA results in a data set you use the is.na() command. Care has to be taken using logical expressions if NA is a possible variable as it can return odd results. For non-numerical data NaN is used, standing for ‘Not a Number’.

The is.na() function can be used to remove NA values: bad <- is.na(x) and then x[!bad]. To remove NA from multiple objects or from a data frame: use complete.cases().

Subsetting Vectors

To select elements from a vector use square brackets with a vector index – [vector index]. There are several types of vector index: logical vectors, positive integers, negative integers, character strings.

! gives the negation of a logical expression, so if is.na() can be used to identify results which are NA !is.na() can be used to identify results which are not NA. So by subsetting the data and creating a vector which includes only the not NA items we can avoid the possible problems.

Positive integers can be used to identify specific elements within the vector, i.e. the 3rd and 5th – x[c(3,5)], negative integers can be used to identify elements within the vector excluding specific ones, e.g. all except 3rd and 5th x[c(-3,-5)], this can also be written x[-c(3,5)].

To subset using a name you need to remember to use quotation marks inside the square brackets: x[“name1”, “name2”].

Matrices and Data Frames

The dim() function tells us the dimensions of an object or can be used to set the dimensions – for example to change a vector into columns and rows, and therefore change it into a matrix.

To subset from a list or data frame you use a double square bracket [[]], it can only be used to select a single element. A $ is used to extract elements by name.

To subset a matrix use x[row, column] – this will return a vector by default, to return a matrix use x[row, column, drop = FALSE].

The Working Directory

It is important to know which directory you are working in, and therefore where your information will be saved. To find out your current directory type:getwd(), to set the working directory use session tab-choose directory, or use the setwd() function.

Any file you want R to read will need to be in this directory. To access a csv file (comma separated values) you use the command read.csv(“filename.csv”). To check the contents of a directory: dir().

First in a series of 3 blog posts, previously blogged on ‘Austen, Morgan and Me‘

As I have been delving deeper into the technicalities of a corpus-based approach to literature, it has become increasingly evident that I will need to get my head around some fairly complex statistical analysis. The more I read about this type of analysis, the more references to R I found. To be able to fully understand the articles in this field, I will need to get to grips with statistics as well as at least one of the computer methods used to produce this type of analysis. In addition, as I am a bit of a control freak, I want to be able to carry out my own statistical analysis (as far as is possible) and this seems to mean learning how to program using R.

I downloaded R and RStudio, following the instructions on the Coursera ‘Data Scientist’s Toolbox’ course, and found two sites which allowed me to go through the basics: Datacamp and Tryr (being from the West Country I love the pirate references on Tryr!), I have also signed up for the Coursera course in R programming. This is quite an exciting prospect as it is a world away from my areas of expertise. Although it is hard work, I think it will be worth it to be able to run my own algorithms and to know exactly how the results are achieved, rather than having to rely upon someone else. Below are my notes from following the tutorials on Datacamp and Tryr:

The Basics

Expressions: a simple instruction written after the prompt (>), it could be a line of text (written in speech marks) or a simple mathematical equation (2+3). In this way, R can be used as a simple calculator. The response is written on the next line, indicated by [1]. Logical (Boolean) values: expressions which return True or False. T and F are shorthand. Variables: These allow you to store a value or object which can be accessed later, e.g. a value for x. When x is typed, R replaces it with the assigned value. Values can be assigned either by writing x = 4 or x<-4. Data types:

Numerics – decimal values

Integers – natural numbers

Logical – Boolean values

Characters – text or string

Checking a variable type: type class(‘variable name’) – this allows you to make sure that you are working with the right type of variable for the calculation you are trying to carry out. Functions: Similar to a spreadsheet you can use functions e.g. sum(x,y,z). The values need to be in parentheses.

sum – adds the given values

rep – repeats the value (used with the argument times)

sqrt – square root

Help: to get help for a particular function type help(‘function name’). Example(‘function name’) gives examples of the function used. For simple and short scripts, it is easy to type the commands as required. However, for longer and more complex commands, it is possible to save the commands as a plain text file (‘x.R’) which can be executed later. To run the script you type source(“x.R”).

Vectors

A vector is a list of values (a one-dimension array). They can hold numeric, logical or character data. A vector is created using c(x,y,z) – c meaning combine. Vectors cannot hold values with different types. To name the elements of a vector use names(vector_a)=c(item names). Alternatively, create a vector with the item names and then assign that vector. E.g. item_names_vector =c(a,b,c) then names(vector_a)=item_names_vector. The names can be used to access the values or to change them: vector_a[a] would bring up the value associated with a; to change the value: vector_a[a]<-42.

For example, vectors could be created for authors and their works to identify where specific pieces of data have originated, so I could have an ‘Austen’ vector which included each of the novels.

To add vectors and assign to a total: total_vector=vector_a+vector_b. To add the values within a vector: sum(vector_a). To select elements of a vector: vector_a[1] – the number (the array indices) indicates the value at that position, in R the array indices start at 1; to select multiple elements: vector_a[c(1,3)] or a consecutive set of elements: vector_a[c(2:5)] this is called a sequence vector. You can use [] to assign a new value within a vector or to add new values, e.g. vector_a[2]<- “biscuit” would change the second value in the vector to “biscuit”. To set ranges of values: vector_a[4:6]<- c(x,y,z). Another way of selecting a series of elements is using the seq function: seq(3,9)would return all the numbers from 3 to 9; however it is more flexible as it can be used for increments other than 1 by following the second number with the desired increment. To get the average of a vector: mean(vector_a); to get the average of selected elements: mean(vector_a[c(1,2)]).

Comparing in R:

< less than

> greater than

>= greater than or equal to

== equal to

!= not equal to

To select by comparison: selection_vector=vector_a>3 (would return TRUE/FALSE for each item greater than 3); this can then be used to identify only those items above 3: above_3=vector_a[selection_vector]. NA Values: if a value isn’t known it can be replaced with NA – R recognises this and will return NA for calculations, you can instruct R to ignore NA, e.g. sum(vector_a, na.rm=TRUE) – the default is FALSE. To see the values in a vector: print(vector_a).

Simple Graphs

Bar Graphs: the barplot function draws a bar chart with a single vector’s values – barplot(vector_a). If names have been assigned to the vector values these will be displayed as labels. The mean on the vector can be worked out mean(vector_a) and added to the barplot abline(h=mean(vector_a)) – v would create a vertical line. Median can be calculated and added in the same way. To work out the standard deviation: sd(vector_a). By naming the mean and sd vectors you can add lines to a barplot indicating the mean and 1 standard deviation above and below.

Scatter Plots: the plot function takes two vectors, one for the x-axis and one for the y-axis: plot(vector_a, vector_b) – the first vector is the x-axis.

Matrices

A matrix is a collection of elements (of the same type) arranged in rows and columns (a two-dimensional array). The matrix function matrix has three arguments: the elements to be arranged, byrow or bycol (how the information will be organised), and nrow (how many rows are used). E.g matrix(1:15, bycol=TRUE, nrow=3). To name columns: colnames(matrix_a)=c(“a”,”b”,”c”), to name rows: rownames(matrix_a)=c(“a”,”b”,”c”). To calculate the total for a row: rowSums(matrix_a); for a column: colSums. To add columns or rows: cbindmerges matrices and/or vectors by column e.g. matrix_c=cbind(matrix_a,matrix_b,vector_a); and rbind merges matrices and/or vectors by row. To select elements from a matrix: matrix_a[row,column], e.g.matrix_a[1,2] – this would select the element on the first row, second column. To select a whole row: matrix_a[row,]; a whole column: matrix_a[,column]. To change a vector into a matrix: dim (dimensions) e.g. dim(vector_a) <-c(rows,columns).

Matrix Plotting To create a contour map: contour(matrix_a). To create a 3D perspective plot: persp(matrix_a). To alter the vertical expansion use expand, e.g. persp(matrix_a,expand=0.2). To create a heat map: image(matrix_a).

Factors

Factors are a statistical data type used to store categorical variables, variables with a fixed number of categories e.g. Novels by Jane Austen (e.g. Austen=c(‘sense’, ‘pride’, ‘emma’, ‘mansfield’, ‘abbey’, ‘persuasion’). To categorise the vector, use factor: JANovels=factor(Austen). This will create levels of unique values – they become integer references and the underlying integers can be viewed using as.integer(JANovels).

If you create a plot to explore aspects of the factor, you can use different characters for each level by using pch – e.g. plot(vector_a, vector_b, pch=as.integer(JANovels)). A legend can be added using legend and the levelsfunction e.g. legend(“topright”, levels(JANovels), pch=1:length(levels(JANovels))).

If the variable has a natural order (it is an ordinal variable, e.g. high, medium, low) the order of the levels can be set when creating the factor factor(vector_a, order=TRUE, levels=c(“high”, “medium”, “low”)). The summary function, when used with a factor will give you an overview.

Data Frames

A data frame is a bit like an Excel spreadsheet, it connects linked pieces of data into columns with rows for the values. This means that additions to a column for one data item prompts additions to the others thus keeping the whole in sync. Unlike a matrix, where all the entries need to be of the same data type, a data frame allows a variety of data types. To create a data frame: frame_a<-data.frame(vector_a,vector_b,factor_a). To access a column: frame_a[[2]] or frame_a[[“vector_b”]] – both would return the same information, although the first method is shorter, the second is clearer; an alternative method is frame_a$vector_b.

Loading Data Frames

R has the capability to load external files, e.g. .csv (comma separated values) and .txt files. To load a CSV file into a data frame: read.csv(“file.csv”). For files that use separators other than commas, e.g. a text file using tabs you use the read.table function: read.table(file.txt, sep=”\t”) – this would read a TXT file where the values are separated by tabs. The header argument can indicate that the first line is the column header: read.table(file.txt, sep=”\t”, header=TRUE).

Merging Data Frames

To merge two data frames where the have a common column: merge(x=frame_a, y=frame_b).

Exploring Data Frames

The head() and tail() function allows you to see the top and bottom sections of your data frame: head(frame_a). The str() function tells you the number of observations, the number of variables, the variables’ names and type, and the first observations: str(frame_a) – this is a useful way of getting an overview of a new data set. To create a subset of a data frame: subset_a=subset(frame_a, subset(frame_a$vector_a>n)). To order the information in a data frame using a particular heading use the function order(): decrease=order(frame_a$vector_a, decreasing=TRUE). To create a new data frame using this new ordered information: frame_b=frame_a[decrease,].

Lists

To create a list you use the list() function: list_a=(vector_a, matrix_b, frame_c). To name the items while creating the list: list_a=(vectorname=vector_a, matrixname=matrix_b, framename=frame_c).

Statistical Tests

To test for correlation: cor.test(vector_a,vector_b); or for subsets of a frame: cor.test(frame_a$a, frame_a$b). This will provide the p-value and other information.

To see whether an estimate can be made for a likely result if we have data for a but incomplete data for b, using a linear model: estimate=lm(response ~ predictor) e.g. estimate=lm(frame_a$b ~ frame_a$a).

ggplot2

ggplot2 is a graphics package. Once it is installed you can get help: help(package=”ggplot2″). To use a package: library(ggplot2). This package can simply create more attractive plots using colour without some of the complexities.

As Armistice Day approaches, I thought I would write a post about my own experience of digital history and why I feel that digitisation projects are so important.

It is 100 years since the start of World War 1, there are no longer any living combatants. As the years pass, there it is increasingly likely that we become distanced from the events and the people involved.

Digitised collections from national archives provide us with the opportunity to discover more about those involved in WW1 and to view them as more than just a series of names.

The Library and Archives Canada (LAC) are currently undertaking a project to digitise the service records of about 640,000 members of the Canadian Expeditionary Force (CEF). The archives have already digitised about 620,000 attestation forms and 13,500 service records.

The first of the key steps to digitization involves a review of each file for its content, as some include objects such as badges or mementos. Service files may contain documents as varied as casualty or medal forms, pay books, passports, and, in some cases, personal photos and correspondence. Items that cannot be scanned will be retrieved, photographed, and placed aside so they can be reintegrated with the proper file before final storage. Staples and bindings, such as glue, must be carefully removed from each sheet of paper before being boxed alphabetically and transported for scanning at a minimum of 300 dots per inch (dpi), depending on the amount of details in the document, at a one to one ratio.

Once digitized, images will be associated to metadata (the keywords that allow users to search through an electronic databank, such as the member’s given name, last name or regimental number). The images will be compressed to a lower resolution so that searches on the Web can be performed faster, and uploaded to the CEF databank. Batches of electronic files will be made available as they are ready, with the first set expected to be added to the Soldiers of the First World War section in 2014. After digitization, the paper files will be re-boxed according to new standards designed to ensure their long-term conservation, and stored in LAC’s state-of-the-art preservation facilities in Gatineau. Thereafter, there will be limited access to the original documents.Library and Archives Canada

The purpose of the project is to allow free, public access to the records – something that currently costs C$20 per order – and to preserve the fragile originals.

The UK National Archives have digitised about 5% of their records, some in collaboration with commercial partners.

Two WW1 Soldiers

Reginald G Eldridge (Great Uncle Rex)

Rex in Uniform

Rex was living in Canada at the outbreak of WW1. He joined the Canadian Expeditionary Force in September 1914, attesting in Qubec before travelling to the UK.

Rex’s Attestation Form – LAC

Rex’s Attestation Form – LAC

Rex was initially based on Salisbury Plain, for training and while the Canadian military leaders organised the troops. While he was based there, my grandmother (who lived in the UK) wrote to his colonel asking whether Rex could have leave for Christmas.

Unfortunately, the request was not approved.

Rex, alongside thousands of Canadian troops, was sent to France in early 1915. He was gassed in battle, but fortunately survived, although this affected his health for the rest of his life.

William H Fegan (Grandpa)

Grandpa Pre-Military Service 1916/1917

My grandfather, like many veterans, spoke very little about his military service. Documents from the UK National Archives have revealed a lot of information, which has helped create a more complete picture.

Grandpa joined 16th (Res.) Battalion, London Regiment, Queen’s Westminster Rifles on 8th January 1917, he was 17 years and 11 months old. He spent about 2 months training before returning home to await call up.

On 1st February 1918, aged 19, he travelled from Southampton to Le Havre and was assigned to the Corps Reinforcement Camp. On 20th February 1918 he was sent to the Western Front.

On 28th March 1918 he went over the top at Arras. In no man’s land he was blown up by a shell. On the casualty form, squeezed in above another line of writing are the words “wounded in action” and the date 29th March 1918.

When he was 88, Grandpa told my uncle:

“When I came round, I couldn’t see a soul, either British or German. There wasn’t anything moving on the landscape, so I didn’t know which way to go. I could walk, but must have gone in and out of consciousness several times. I eventually fell into a railway cutting where there was a British dressing station. They gave me a shot of something, but wanted to kill me later on, I found out, because I would not stop singing at the top of my voice.” W H Fegan

His injuries included a large hole in his side, most of his shoulder blade was shot away, as was his right elbow.

Casualty Form – UK National Archives

Casualty Form – UK National Archives

After spending almost a year in hospital, on 18th January 1919 he was judged, under paragraph 392 (XVI) King’s Regulations, “No longer physically fit for War Service”. His entire military career had lasted 2 years 11 months, with just 69 days in active service.

Digitised collections allow anyone to access documents which, at one time, were mostly the preserve of academics and the curation staff. Being able to view these documents helps to bring historical events closer, making their participants and their lives come alive.

Why use network models?

An area of data visualisation, which I have been exploring in another online course (Social Network Analysis from the University of Michigan, via Coursera), are network models. Network models are designed to simplify complex networks in a manner which allows the reader to view patterns and relationships which may be hidden in a mass of data. In addition, they have the benefit that properties can be derived mathematically – which means properties and outcomes can be predicted. By exploring models in this way, we can draw conclusions about a particular network (also known as a graph).

Networks are made up of nodes (points or dots) and edges (lines). Nodes are said to have a degree (the number of connections with other nodes); in a network where the links are directed (link to or from a node), this is further defined as in-degree and out-degree. In a simple network (without self-edges or multiple-edges between nodes) the maximum degree of any node is (n-1) – the total number of nodes minus the chosen node.

Random Graphs

To explore a network, we need to see whether it is behaving in a particular way, and therefore need to consider whether this behaviour could be viewed as random. The Erdös-Rényi random graph makes several assumptions: that nodes connect at random, that the network is undirected, and uses a key parameter, for example the probability that any two nodes share an edge (p).

The degree distribution of a random graph will assume that the probability of whether one node is connected to another is independent (i.e. not affected by any other node being connected), and equally likely to occur/not occur – this is a binomial distribution. The binomial distribution tells us the probability that a particular node will have a specific degree (k).

Binomial Distribution =

where:

is defined as the number of nodes

is defined as degree

is defined as probability

We can also work out:

the average degree:

the variance:

the standard deviation:

Use of NetLogo and Gephi to analyse a social network

If we have a network we wish to explore, we can compare it to one or more models to see whether it is similar, and what the similarities and differences suggest. NetLogo is a free piece of software which enables you to run network models. Below, I have a screenshot from the Erdös-Rényi random model:

Erdös-Renyi

In this model, with 100 nodes, we can see that the nodes are all linked into one giant component and the average degree is 6 edges per node. The random model does not have hubs (nodes with considerably higher degree than other nodes).

I took a Facebook network (my husband’s, as I haven’t got an active Facebook account) and created a visualisation. The Facebook network data was extracted using NetGet (a version of Netvizz with reduced functionality). While Facebook allows the extraction of data like this relatively easily, other social media sites, for example Twitter, do not, for extracting data from a wider range of sites this may help (Windows only). To ensure anonymity, I have only reproduced the image of the network, and not the interactive version.

Facebook Social Network

To carry out the visualisation, I used Gephi, a free piece of software which allows the user to create data visualisations. In the example below, I used the Force Atlas 2 layout algorithm, coloured the nodes according to gender (red are male, blue are female), and altered the size of the nodes to reflect the degree (the number of edges which attach to a node).

The giant component (the large central mass of nodes and edges) is a network of work colleagues. The satellite groups represent networks from previous jobs, and groups of friends from different locations and periods of time.

Unsurprisingly, the work network has nodes with higher degree than the personal networks. What is interesting is that in many of the groups there is an individual who acts as a central node, linking the group members together.

The layout suggests a globe or world map, which in itself is interesting as this could be viewed as a personal, social media ‘world’.

In the Facebook model, there are 142 nodes, and the average degree is 8, however the giant component only makes up 42% of the total, and there are several hubs of various sizes. This tells us that a Facebook network does not follow the pattern of the Erdös-Renyi model and therefore is not random. As this is a social network, the nodes are likely to be created over time rather than all at once. The presence of the hubs, suggests that some of the nodes (contacts) seem to attract more edges than others – a power-law distribution.

Visualising the network makes it much easier to see the links and patterns, and while it would be possible to garner this information from the raw data, it would take a considerable length of time.

The battle lines are drawn, in one corner we have Stanley Fish, the “saving remnant” who “insists on the distinction between the true and the false”, and in the other Franco Moretti, who carries out research “without a single direct textual reading” and doesn’t care if his idea is “particularly popular” (48). Two extremes, dividing the Humanities departments of the world – or so many of the comments, accompanying Fish’s opinion piece, would have us believe.

Once past the bombast of the opinion piece, the battle ground is clearly an exaggeration. A closer look at the literature reveals numerous concerns and possible misunderstandings. Trumpener, an advocate for close reading, feels there is little need for these new ways of reading:

We can change our parameters and our questions simply by reading more: more widely, more deeply, more eclectically, more comparatively. (Trumpener 171)

In “Paratext and Genre System”, she emphasises the importance of comparison within literary analysis; however her fears that digital techniques will discard the traditional methods of exploration appear to be unfounded.

Unsworth calls comparison one of the “scholarly primitives” and illustrates how digital tools can aid the comparison of a text in ways that were not possible in the past. While Moretti himself is not advocating eschewing comparison at all. In fact, his explanation for distant reading has comparison at its heart:

Distant reading: where distance… is a condition of knowledge: it allows you to focus on units that are much smaller or much larger than the text: devices, themes, tropes – or genres and systems…If we want to understand the system in its entirety, we must accept losing something. (Moretti 48-49)

It is a matter of scale, if we want to understand world literature (which Moretti is discussing when he uses the term) it is not possible to read everything, even if there were sufficient time. How to cope with a body of literature on this scale is a problem:

That’s the point: world literature is not an object, it’s a problem, and a problem that asks for a new critical method: and no one has ever found a method by just reading more texts…they need a leap, a wager – a hypothesis, to get started. (Moretti 46)

As digital and online editions make access to an ever-wider, and more numerous, range of texts possible, it is hardly surprising that new methods are needed to work at this scale. The concept of the canon, and close reading in this context, is being challenged:

close reading… necessarily depends on an extremely small canon…you invest so much in individual texts only if you think that very few of them really matter…And if you want to look beyond the canon…close reading will not do it. (Moretti 48)

However, this is not moving away from close reading in all its forms, it is simply using tools to explore what to read in more detail. Mueller (“Digital Shakespeare”) points out that:

In scholarly or professional reading…You need to find what to read, which consists mostly of deciding what not to read. When you have found what to read you need to figure out how to get it. (Mueller 290)

Digital tools are commonly used in this process, and not just by those who ‘do’ Digital Humanities.

As Underwood notes (“Theorizing”), some regularly used close reading tools , for example the full-text search, have complex algorithms at their heart. In effect, ‘traditionalists’ may actually already be ‘doing’ Digital Humanities. The danger here is that without an understanding of how the tools work, findings can be skewed:

The deeper problem is that by sorting sources in order of relevance to your query, it also tends to filter out all the alternative theses you didn’t bring. Search is a form of data mining, but a strangely focused form that only shows you what you already know to expect. (Underwood 3)

The workings of the ‘machine’ needs to be understood, at least at some level, for its use to be effective. A search engine is not neutral, but it can be incredibly useful. Kirschenbaum (“The Remaking of Reading”), Mueller (“Digital Shakespeare”) and Drouin emphasise that the rise of the machine, a common trope in the criticisms of reading in the digital humanities, is not to replace human analysis, but rather to support and assist in that analysis. Mueller (“Scalable Reading”) calls this: “Digitally Assisted Text Analysis”.

Kirschenbaum (“What is Digital Humanities?”) identifies part of the problem as the difference between digital humanities in reality and the “construct”; “the construct…function[s] as a space of contest for competing agendas” (6), and this is what we see with the concept of reading. It is a false dichotomy:

Those who continue to use the traditional methods of literary analysis are not “worshipping a false god” (Mueller, “Digital Shakespeare” 290), as Steven Pinker states:

Those ways do deserve respect, and there can be no replacement for the varieties of close reading, thick description, and deep immersion that erudite scholars can apply to individual works. But must these be the only paths to understanding?(Pinker)

With a broad range of consensus amongst digital humanists, that the traditional methods are needed in conjunction with the digital, this raises the question, why does the argument exist? Underwood (“The Imaginary Conflicts Disciplines Create”) admits:

One thing I’ve never understood about humanities disciplines is our insistence on staging methodology as ethical struggle.(Underwood)

The desire to create for conflict over reading, and, for some, a reluctance to make use of new technologies seems rather strange. Could part of the problem be the ‘construct’ of Digital Humanities? The term itself could be viewed as divisive, as it implies an ‘other’ non-digital humanities:

Fish probably is not right when he sees the problem of the Digital Humanities as a “we/plural/text and author detesting” ethos challenging an “I/singular/text and author fetishizing” ethos. (Mueller 'Stanley Fish and the Digital Humanities')

However, he would find it difficult to make such a claim if the division did not exist. Thirty or more years into the existence of Digital Humanities, in one guise or another, could it be time for a change?

Data can be beautiful. The tools we have at our disposal mean that we can visualise and explore data in ways that simply were not possible a decade ago.

Hans Rosling, the Swedish statistician who founded Gapminder, demonstrates some of the possibilities in this video:

Visualising Data

The Gapminder world statistics cover a huge range of different indicators, these can be analysed using the site’s own tools or exported for use. As part of an online course (Data To Insight from the University of Auckland, via FutureLearn), I have been learning how to use this huge data set using a piece of free software called iNZight.

Summary data – children per woman by region 1952-2012

The difficulty with data, especially when there is a huge amount of it, is that the volume of data can obscure trends. This is where visualisation can help.

In the visualisation below, I used the Gapminder statistics to look at changes to the number of children born per woman, for each leap year between 1952 and 2012, split by region.

A view of the summary data (on the left), for just two regions, gives an indication of the difficulty seeing clear patterns by viewing the numbers alone.

Visualisation – children per woman by region 1952-2012

The iNZight software allows you to view the data as a series of dot plots (one dot representing each country for which there is data) and a box plot underneath showing the spread (this is the dark area under each of the coloured sections.

By stacking the plots for each leap year on top of each other, changes are easy to spot and trends identified. By placing the multiple plots for each region next to each other, comparisons are easy to make.

Even at a glance, the key trend – that the number of children per woman has dropped since 1952 – is visible, both at a regional level and between regions.

However, this is not the full picture, to explore the data further, and perhaps start drawing some conclusions about the data, we need to dig deeper. For example, if we suspect that changes in infant mortality may be part of the cause, we can colour the countries by these rates.

Children per woman coloured to indicate rates of infant mortality

This shows a gradual move in many regions from the pinks and blues which indicate high infant mortality, to the greens and browns which are lower.

The iNZight software is designed to be easy to use, in fact it is taught in schools in New Zealand. The same analysis can be carried out using the animated tools on the Gapminder World site – this is pretty impressive and has the added benefit of being interactive so individual data points can be clicked to gain more information.

This type of analysis could be done by looking closely at the numerical data, however, the tools available mean that we can test our hypotheses with a click of a button, allowing time for more in-depth exploration, or for the exploration of larger quantities of data.

The increased use of free and interactive online tools mean that almost anyone with an internet connection can carry out detailed analyses. Tools are no longer limited to those with the funding or the expertise in programming necessary to create them. This change is one of the core benefits of increased digitisation, a more democratic access to analytical tools. However, for the tools to be of any use, and for this democratisation to truly take place, access to statistical information is also necessary, and in some fields, this is the the front line between open access and proprietary control.

Crowdsourcing

The term ‘crowdsourcing’ was first used by Jeff Howe in a Wired Magazine article, the word itself a portmanteau of outsourcing and crowd. Initially, the term focused on the business world and had connotations of profit, outsourcing jobs and cheap labour.

The term has been increasingly repurposed by cultural heritage and citizen science, as Ridge explains, the type of involvement is very different from the business model. Her definition describes:

One thing I’ve never understood about humanities disciplines is our insistence on staging methodology as ethical struggle.Mia Ridge

The key words in this definition are (for me): ’emerging’, ‘shared’, and ‘inherent rewards’. Crowdsourcing in the humanities is an area which is still developing, identifying what works and what motivates the participants. The crowd need to buy into the goals of the project in order for it to be successful.

Ancient Lives Project

The Ancient Lives Project is one of the earliest examples of a crowdsourced humanities project. It is part of the Zooniverse group which began with a series of science based projects. As such, it combines experts from Classics, Astrophysics and Computer Science.

One of the reasons that I was interested in this particular crowdsourcing project was because I had studied the Zenon papyri from this period (the Greco-Roman period in Egypt) as part of my undergraduate degree. We never saw the original texts, or even images of them, our only contact with the papyri was through a series of typewritten transcripts. The Ancient Lives project allows interested members of the public to be more hands on – measuring and transcribing papyrus fragments using online tools and clear online images.

Image of a papyrus fragment made available in the initial press release for Ancient Lives

In the 1890s two Oxford undergraduates discovered large quantities of papyrus in a series of rubbish tips in the ancient city of Oxyrhynchus. After ten years of excavation, they returned to Oxford with 1000 boxes of papyri – over a million papyrus fragments ranging in size from a postage stamp to a newspaper. For the next century, the were studied by small teams of experts, resulting in the transcription, translation and publication of a mere 1% of the text fragments.

Dr Obbink, from the Classics department at Oxford university, decided that using crowdsourcing for transcription could help the project work on the remaining 99%. In July 2011 the project was launched. The project goals state that:

Ancient Lives combines human computing with machine intelligence in order to expedite the process of identifying known texts, contextualizing unknown texts, bringing together fragments for textual reconstruction, and cataloguing fragments in a more expeditious digital way. The overall goal is to rapidly transform image data from papyri into meaningful information that scholars can use to study Greek literature and Greco-Roman Egypt; information that once took generations to produce.Ancient Lives

These goals, and the way they are phrased, suggest that the target crowd are relatively well educated.

“It’s very exciting that we can use today’s modern tools of astrophysics to get specific information about everyday life in ancient times,” Fortson said. “But it’s really the help from volunteers that will make the difference”.University of Minnesota

The project was launched with a press release and a series of articles calling for ‘Armchair archaeologists’ (Wired). Over the years there have been a steady stream of articles about the project and the texts discovered, including revelations of match-fixing in Greek wrestling (Ancient Origins), and ancient career advice (Forbes).

transcription and correction – using an online keyboard to transcribe the fragments

A papyrus fragment before transcription

The user highlights the centre of each letter, then selects the matching letter. The transcription keyboard provides the type written letter, but also shows written examples to aid the transcriber.

The fragments a user has viewed are saved to their ‘Lightbox’ allowing them to return to them, or to create a collection of favourites.

Papyrus fragment after transcription

Users are encouraged to use the ‘Talk’ tool to discuss a particular fragment with other users and follow the progress of the transcriptions.

Users can complete as many or as few transcriptions as they wish.

Who are the crowd?

The project is open to anyone, however the message boards and the comments on the blog indicate several academics, undergraduate and postgraduate students, and people with Greek language skills. Although the project was initially geared towards transcribing and measuring, some of the comments from users have lead to insights that the project team did not expect.

Users from all over the world willing to help us, both amateurs and professional scholars, have left hundreds of comments, which have revealed a useful source of information on the material uploaded online Ancient Lives Blog

This shows that several members of the crowd could be considered experts rather than amateurs.

Keeping the crowd active

From the start of the project, the associated blog has played a key role in maintaining a relationship with the crowd. In the first two years of the blog, the posts seem to be more personal. The language in these early blog posts help create a sense of community (Ancient Lives blog):

“indomitable…warriors”

“strenuous…web users”

“steadfast, indefatigable web users”

The posts cover a range of tutorials and support for the users. Later posts cover a much broader range of archaeological topics, suggesting the project is confident that it has a core group of transcribers.

For the volunteers there are a number of rewards, the ability to be part of an Egyptian archaeological project, to engage with the academic community, to improve knowledge of the Greek alphabet, to be ‘hands on’ with history. However there are also more tangible benefits:

Will Internet users get credit for the work they’ve done? “Absolutely,” Lintott said, “as with all Zooniverse projects, we’ll take great care to attribute credit correctly.” NBC News Blog

However, Ancient Lives does not use badges, internal competition or a ‘papyrus-ometer’ unlike Old Weather or Transcribe Bentham. However, there is some element of gamification – in fact one user referred to the transcription as a ‘computer game’.

Accuracy

Ridge highlights accuracy (either unintended mistakes or deliberate sabotage) as one of the concerns for teams organising crowdsourcing projects. When the project was originally launched, the intention was for each fragment to be viewed and transcribed 5 times, due to the popularity of the site, this was increased to 70-100 times. This seems to have had the desired effect: