Spencer Roberts

edited by

Fred Gibbs

reviewed by

published

2013-04-01

retired

2017-07-05

difficulty

Low

This lesson has been retired

What does this mean?

The Programming Historian editors do their best to maintain lessons as minor issues inevitably arise. However, since publication, changes to either the underlying technologies or principles used by this lesson have been substantial, to the point where the editors have decided not to further update it. The lesson may still prove a useful learning tool and a snapshot into the techniques of digital history when it was published, but we cannot guarantee all elements will continue to work as intended.

Contents

Lesson Goals

In Counting Frequencies you learned how to count the frequency of specific
words in a list using python. In this lesson, we will expand on that
topic by showing you how to get information from Zotero HTML items, save
the content from those items, and count the frequencies of words. It may
be beneficial to look over the previous lesson before we begin.

Files Needed For This Lesson

obo.py

If you do not have these files, you can
download programming-historian-3, a (zip) file from the previous lesson.

Modifying the obo.py Module

Before we begin, we need to adjust obo.py in order to use this module to
interact with different html files. The stripTags function in the obo.py
module must be updated to the following, because it was previously
designed for Old Bailey Online content only. First, we need to remove
the line that instructs the program to begin at the end of the header,
then we will tell it where to begin. Open the obo.py file in your text
editor and follow the instructions below:

defstripTags(pageContents):#remove the following line#startLoc = pageContents.find("<hr/><h2>")#modify the following line#pageContents = pageContents[startLoc:]#so that it looks like thispageContents=pageContents[0:]inside=0text=' 'forcharinpageContents:ifchar=='<':inside=1elif(inside==1andchar=='>'):inside=0elifinside==1:continueelse:text+=charreturntext

Remember to save your changes before we continue.

Get Items from Zotero and Save Local Copy

After we have modified the obo.py file, we can create a program designed
to request the top two items from a collection within a Zotero library,
retrieve their associated URLs, read the web pages, and save the content
to a local copy. This particular program will only work on webpage-type
items with html content (for instance, entering the URLs of JSTOR or
Google Books pages will not result in an analysis of the actual
content).

First, create a new .py file and save it in your programming historian
directory. Make sure your copy of the obo.py file is in the same
location. Once you have saved your file, we can begin by importing the
libraries and program data we will need to run this program:

Next, we need to tell our program where to find the items we will be using in
our analysis. Using the sample Zotero library from which we retrieved items in
the lesson on the Zotero API, or using your personal library, we will pull
the first two top-level items from either the library or from a specific
collection within the library. (To find your collection key, mouseover the RSS
button on that collection’s page and use the second alpha-numeric sequence in
the URL. If you are trying to connect to an individual user library, you must
change the word group to the word user, replace the six-digit number
with your user ID, and insert your own API key.)

#links to Zotero libraryzlib=zotero.Library('group','155975','<null>','f4Bfk3OTYb7bukNwfcKXKNLG')#specifies subcollection - leave blank to use whole librarycollectionKey='I253KRDT'#retrieves top two items from libraryitems=zlib.fetchItemsTop({'limit':2,'collectionKey':collectionKey,'content':'json,bib,coins'})

Now we can instruct our program to retrieve the URL from each of our
items, create a filename using that URL, and save a copy of the html on
the page.

#retrieves url from each item, creates a filename from the url, saves a local copyforiteminitems:url=item.get('url')filename=url.split('/')[-1]+'.html'#splits url at last /filename=filename.split('=')[-1]#splits url at last =filename=filename.replace('.html.html','.html')#removes double .htmlprint'Saving local copy of '+filenameresponse=urllib2.urlopen(url)webContent=response.read()f=open(filename,'w')f.write(webContent)f.close()

Running this portion of the program will result in the following:

Saving local copy of PastsFutures.html
Saving local copy of 29.html

Get Item URLs from Zotero and Count Frequencies

Now that we’ve retrieved our items and created local html files, we can
use the next portion of our program to retrieve the URLs, read the web
pages, create a list of words, count their frequencies, and display
them. Most of this should be familiar to you from the Counting Frequencies lesson.

#retrieves url from each item, creates a filename from the urlforiteminitems:itemTitle=item.get('title')url=item.get('url')filename=url.split('/')[-1]+'.html'#splits url at last /filename=filename.split('=')[-1]#splits url at last =filename=filename.replace('.html.html','.html')#removes double .htmlprint'\n'+itemTitle+'\nFilename: '+filename+'\nWord Frequencies\n'response=urllib2.urlopen(url)html=response.read()

This section of code grabs the URL from our items, removes the
unnecessary portions, and creates and prints a filename. For the items
in our sample collection, the output looks something like this:

The Pasts and Futures of Digital History
Filename: PastsFutures.html
Word Frequencies
History and the Web, From the Illustrated Newspaper to Cyberspace: Visual Technologies and Interaction in the Nineteenth and Twenty-First Centuries
Filename: 29.html
Word Frequencies

Now we can go ahead and create our list of words and their frequencies.
Enter the following:

About the author

Spencer Roberts is Research Assistant and former Digital History Research Fellow at the Roy Rosenzweig Center for History and New Media, and a Ph.D. graduate student at George Mason University in the Department of History.