edited by

reviewed by

published

modified

difficulty

This lesson was written using Python v. 2.x. Code may not be compatible with newer versions of Python. “Python Introduction and Installation” provides instructions for how you can install Python 2.x alongside newer versions.

This lesson is part of a series. You might want to check out the previous lesson.

Contents

Lesson Goals

Your list is now clean enough that you can begin analyzing its contents
in meaningful ways. Counting the frequency of specific words in the list
can provide illustrative data. Python has an easy way to count
frequencies, but it requires the use of a new type of variable: the
dictionary. Before you begin working with a dictionary, consider the
processes used to calculate frequencies in a list.

Files Needed For This Lesson

obo.py

If you do not have these files, you can
download a (zip) file containing all of the code from the previous lessons in this series.

Frequencies

Now we want to count the frequency of each word in our list. You’ve
already seen that it is easy to process a list by using a for loop. Try
saving and executing the following example. Recall that += tells the
program to append something to the end of an existing variable.

# count-list-items-1.pywordstring='it was the best of times it was the worst of times 'wordstring+='it was the age of wisdom it was the age of foolishness'wordlist=wordstring.split()wordfreq=[]forwinwordlist:wordfreq.append(wordlist.count(w))print("String\n"+wordstring+"\n")print("List\n"+str(wordlist)+"\n")print("Frequencies\n"+str(wordfreq)+"\n")print("Pairs\n"+str(zip(wordlist,wordfreq)))

Here, we start with a string and split it into a list, as we’ve done
before. We then create an (initially empty) list called wordfreq, go
through each word in the wordlist, and count the number of times that
word appears in the whole list. We then add each word’s count to our
wordfreq list. Using the zip operation, we are able to match the first
word of the word list with the first number in the frequency list, the
second word and second frequency, and so on. We end up with a list of
word and frequency pairs. The str function converts any object to a
string so that it can be printed.

It will pay to study the above code until you understand it before
moving on.

Python also includes a very convenient tool called a list
comprehension, which can be used to do the same thing as the for loop
more economically.

# count-list-items-1.pywordstring='it was the best of times it was the worst of times 'wordstring+='it was the age of wisdom it was the age of foolishness'wordlist=wordstring.split()wordfreq=[wordlist.count(w)forwinwordlist]# a list comprehensionprint("String\n"+wordstring+"\n")print("List\n"+str(wordlist)+"\n")print("Frequencies\n"+str(wordfreq)+"\n")print("Pairs\n"+str(zip(wordlist,wordfreq)))

If you study this list comprehension carefully, you will discover that
it does exactly the same thing as the for loop in the previous example,
but in a condensed manner. Either method will work fine, so use the
version that you are most comfortable with.

Generally it is wise to use code you understand rather than code that runs quickest.

At this point we have a list of pairs, where each pair contains a word
and its frequency. This list is a bit redundant. If ‘the’ occurs 500
times, then this list contains five hundred copies of the pair (‘the’,
500). The list is also ordered by the words in the original text, rather
than listing the words in order from most to least frequent. We can
solve both problems by converting it into a dictionary, then printing
out the dictionary in order from the most to the least commonly
occurring item.

Python Dictionaries

Both strings and lists are sequentially ordered, which means that you
can access their contents by using an index, a number that starts at 0.
If you have a list containing strings, you can use a pair of indexes to
access first a particular string in the list, and then a particular
character within that string. Study the examples below.

To keep track of frequencies, we’re going to use another type of Python
object, a dictionary. The dictionary is an unordered collection of
objects. That means that you can’t use an index to retrieve elements
from it. You can, however, look them up by using a key (hence the name
“dictionary”). Study the following example.

Dictionaries might be a bit confusing to a new programmer. Try to think
of it like a language dictionary. If you don’t know (or remember)
exactly how “bijection” differs from “surjection” you can look the two
terms up in the Oxford English Dictionary. The same principle applies
when you print(d['hello']); except, rather than print a literary
definition it prints the value associated with the keyword ‘hello’, as
defined by you when you created the dictionary named d. In this case,
that value is “0”.

Note that you use curly braces to define a dictionary, but square
brackets to access things within it. The keys operation returns a list
of keys that are defined in the dictionary.

Word-Frequency Pairs

Building on what we have so far, we want a function that can convert a
list of words into a dictionary of word-frequency pairs. The only new
command that we will need is dict, which makes a dictionary from a list
of pairs. Copy the following and add it to the obo.py module.

# Given a list of words, return a dictionary of# word-frequency pairs.defwordListToFreqDict(wordlist):wordfreq=[wordlist.count(p)forpinwordlist]returndict(zip(wordlist,wordfreq))

We are also going to want a function that can sort a dictionary of
word-frequency pairs by descending frequency. Copy this and add it to
the obo.py module, too.

# Sort a dictionary of word-frequency pairs in# order of descending frequency.defsortFreqDict(freqdict):aux=[(freqdict[key],key)forkeyinfreqdict]aux.sort()aux.reverse()returnaux

We can now write a program which takes a URL and returns word-frequency
pairs for the web page, sorted in order of descending frequency. Copy
the following program into Komodo Edit, save it as html-to-freq.py and
execute it. Study the program and its output carefully before
continuing.

These words are usually the most common in any English language text, so
they don’t tell us much that is distinctive about Bowsey’s trial. In
general, we are more interested in finding the words that will help us
differentiate this text from texts that are about different subjects. So
we’re going to filter out the common function words. Words that are
ignored like this are known as stop words. We’re going to use the
following list, adapted from one posted online by computer scientists
at Glasgow. Copy it and put it at the beginning of the obo.py
library that you are building.

Suggested Readings

Code Syncing

To follow along with future lessons it is important that you have the
right files and programs in your “programming-historian” directory. At
the end of each lesson in this series you can download the “programming-historian” zip
file to make sure you have the correct code.