Tracking Human Migration Through Archives and Digital Curation

You are here:

Personalizing the Data by Scott Harkless

Data can seem impersonal; that is something I have been struggling with in telling the stories of soldiers like Henry Mueller. Much of the data we derive from vast indexes and reports from the pension administration a lot of the data comes from government forms, filled out by tired and over wrought veterans injured in service to their country. Some of the data we take, names, locations, injuries and other such from affidavits written by friends, personal stories, letters, and other kinds of information that puts a human face on such peoples as Henry Mueller.

There is a letter from an Emma Stark Humphrey of the Woman’s relief corps where she touches upon the complexity of Henry Mueller. In this Letter she states that Henry Mueller was her guest for some three weeks. In this time she seems to be quite taken with him, saying she would “look at him in astonishment at his knowledge of people and animals, and in fact of languages, literature, history and religion and it seemed wicked that man should have to beg for a chance to earn his own living. And such a life and such a brain was sacrificed for our flag and our government turned a deaf ear”. She goes on to express a significant concern, that after his courteous stay, his fruitless job searches and infirmities “. I have after wondered what became of him, and feared in his despondent moment he would commit suicide”

It’s troubling to see exactly how prevalent suicide is amongst veterans. According to a report released in August of 2016 by the Veteran’s administration, 20 United States veterans take their own lives. Historically it becomes more difficult to discern. Although the department of veteran affairs established veteran hospitals after the American Civil war the measures suggested in the above report were far beyond the administration of the day. Veteran suicide prevention hotlines, same day mental health care, and predictive modeling of such rates did not exist in the day. Perhaps though by understanding this problem historically we may be able to improve our predictive modeling. How would we go about quantifying this data so the promise of big data initiatives may be used to shed new light on Depression, PTSD, injuries, poverty, and any number of the problems that may lead to this tragedy?

The initial step would be to transform the documents in such a way as to be machine readable. In my previous writings, I’ve discussed some of the problems of OCR, however whether we are able to use the neigh fickle magic of OCR or must transcribe by hand with the dedication and reverence of Benedictine monks the works for this purpose could be rendered into machine readable text for us to work with. From here is where we are faced with a much more difficult task, how we identify writings that show not only direct fear of suicide but also potentially depression and other undiagnosed maladies may be difficult.

Perhaps what we could do is it develop a glossary of terminologies related to this subject or well, nearly any other emotion we hope to discover, and we could use a script in python as per the excellent tutorial “Using Gazetteers to Extract Sets of Keywords from Free-Flowing Texts” step by step instructions are given towards using python to pull every mention of a certain set of words from a text file. First one would have to assemble a gazetteer of words relating. In this case one could use words related to the report, as well as from the Wikipedia entry for suicide and related terms, or even a reliable thesaurus for words relating to suicide or depression. The difficulty relates is narrowing down such a list before the age of psychology. The tutorial gives the below steps to complete this.

“ Load the list of keywords that you’ve created in gazetteer.txt and save them each to a Python list

Load the texts from texts.txt and save each one to another Python list

Then for each biographical entry, remove the unwanted punctuation (periods, commas, etc)

Then check for the presence of one or more of the keywords from your list. If it finds a match, store it while it checks for other matches. If it finds no match, move on to the next entry

Finally, output the results in a format that can be easily transferred back to the CSV file.”

That is just one of several methods one may use to discover derive data from these files on mass. Other languages such as R, can allow one to perform complex data analytics on the data however I shall have to write more on that subject when I’ve learned more about R. One could also create an exhibit on presenting the documents in context with Omeka or another online exhibiting tool. This may allow some of the very personal stories in these documents to be told or gathering this data may allow researchers to possibly gleam some information on this important subject through analysis of thousands of pensioner’s records.