Cleaning Text with Python

So all us early modern Europeanists owe the Early English Books Online project a debt of gratitude. Tens of thousands of books published in England before the 19C, all of them scanned, and, in the past few years, downloadable. Thanks to the Text Creation Partnership, some 60,000 of these 125,000 books have been transcribed into full-text versions, mostly those published before 1700. Next year, 2020, everyone with an internet connection will have access to all 60,000. For now, those without an institutional subscription will have to do with only 25,000 or so. Life is hard.

No surprise, scholars have been using this resource for years, but only recently have the digital humanities matured to where we can deal with this mass of text on a larger scale, using it for more than just individual keyword searches. If you want to download what’s publicly available, you should visit the Visualizing English Print project. But as VEP explains, the hand-transcribed texts have their issues. So they’ve created ‘SimpleText’ versions of the TCP documents – no more outdated XML markup for us! And they’ve also created processed versions that have cleaned some of the most common errors in the corpus.

VEP is a great service. But I want more. So I decided to learn Python and create my own Python code (in a Jupyter notebook) to clean these EEBO TCP texts on my own terms. Some of my corrections replicate what VEP has done, but my code also goes beyond to make further changes. I’ll spare you the details here, but I go into an obscene amount of detail in the Jupyter notebook, explaining the various errors I’ve encountered, and how I went about fixing them. The code isn’t perfect, but it does a pretty good job so far, if only through repetitive brute force. And it’s really helped me learn some basic Python along the way.

Though it won’t make too much sense until you go through the notebook, here’s a summary of the variety of errors the notebook checked for in the TCP’s 1640 edition of the Duke of Rohan’s Compleat Captain (commentaries on Caesar), and how many of each it found and corrected:

If you need a sample of the specific changes made:

And this is only the beginning.

So if you’re Python-curious and wonder what all the fuss is about, you can check out my GitHub repository: https://github.com/ostwaldj/eebo_tcp_clean_text. But be warned – for it to work, you’ll need to know a tiny bit of Python, and have Python 3+ as well as Jupyter notebooks (preferably via Anaconda) already installed. Once you have Python/Jupyter installed, you should be able to just download the repo, unzip it, open the Jupyter notebook, change the path to your machine, and it should be ready to go, at least on my sample Rohan text. For those with just a little bit of Python knowledge, it should be easy to alter the code, e.g. to expand it to cover additional types of errors or change, with just a little bit of hacking.

Hopefully, in the future, I’ll have time to set it up with MyBinder, so it can be run by anyone in a web browser.