Questions of Style

Programming is as much an art as a science. The undisputed “bible”
of programming, a 2,500 page multivolume work by Donald Knuth, is called
The Art of Computer Programming. Many books have
been written on Literate Programming, recognizing
that humans, not just computers, must read and understand programs. Here
we pick up on some issues of programming style that have important
ramifications for the readability of your code, including code layout,
procedural versus declarative style, and the use of loop
variables.

Python Coding Style

When writing programs you make many subtle choices about names,
spacing, comments, and so on. When you look at code written by other
people, needless differences in style make it harder to interpret the
code. Therefore, the designers of the Python language have published a
style guide for Python code, available at http://www.python.org/dev/peps/pep-0008/. The
underlying value presented in the style guide is
consistency, for the purpose of maximizing the
readability of code. We briefly review some of its key recommendations
here, and refer readers to the full guide for detailed discussion with
examples.

Code layout should use four spaces per indentation level. You
should make sure that when you write Python code in a file, you avoid
tabs for indentation, since these can be misinterpreted by different
text editors and the indentation can be messed up. Lines should be
less than 80 characters long; if necessary, you can break a line
inside parentheses, brackets, or braces, because Python is able to
detect that the line continues over to the next line, as in the
following examples:

Note

Typing spaces instead of tabs soon becomes a chore. Many
programming editors have built-in support for Python, and can
automatically indent code and highlight any syntax errors (including
indentation errors). For a list of Python-aware editors, please see
http://wiki.python.org/moin/PythonEditors.

Procedural Versus Declarative Style

We have just seen how the same task can be performed in
different ways, with implications for efficiency. Another factor
influencing program development is programming
style. Consider the following program to compute the
average length of words in the Brown Corpus:

In this program we use the variable count to keep track of the number of tokens seen, and
total to store the combined length
of all words. This is a low-level style, not far removed from machine
code, the primitive operations performed by the computer’s CPU. The
two variables are just like a CPU’s registers, accumulating values at
many intermediate stages, values that are meaningless until the end.
We say that this program is written in a
procedural style, dictating the machine
operations step by step. Now consider the following program that
computes the same thing:

The first line uses a generator expression to sum the token
lengths, while the second line computes the average as before. Each
line of code performs a complete, meaningful task, which can be
understood in terms of high-level properties like: “total is the sum of the lengths of the
tokens.” Implementation details are left to the Python interpreter.
The second program uses a built-in function, and constitutes
programming at a more abstract level; the resulting code is more
declarative. Let’s look at an extreme example:

The equivalent declarative version uses familiar built-in
functions, and its purpose is instantly recognizable:

>>> word_list = sorted(set(tokens))

Another case where a loop counter seems to be necessary is for
printing a counter with each line of output. Instead, we can use
enumerate(), which processes a
sequence s and produces a tuple of
the form (i, s[i]) for each item in
s, starting with (0, s[0]). Here we enumerate the keys of the
frequency distribution, and capture the integer-string pair in the
variables rank and word. We print rank+1 so that the counting appears to start
from 1, as required when producing
a list of ranked items.

Note that our first solution found the first word having the
longest length, while the second solution found
all of the longest words (which is usually what
we would want). Although there’s a theoretical efficiency difference
between the two solutions, the main overhead is reading the data into
main memory; once it’s there, a second pass through the data is
effectively instantaneous. We also need to balance our concerns about
program efficiency with programmer efficiency. A fast but cryptic
solution will be harder to understand and maintain.

Some Legitimate Uses for Counters

There are cases where we still want to use loop variables in a
list comprehension. For example, we need to use a loop variable to
extract successive overlapping n-grams from a list:

It is quite tricky to get the range of the loop variable right.
Since this is a common operation in NLP, NLTK supports it with
functions bigrams(text) and trigrams(text), and a general-purpose ngrams(text, n).

Here’s an example of how we can use loop variables in building
multidimensional structures. For example, to build an array with
m rows and n columns, where
each cell is a set, we could use a nested list comprehension:

Observe that the loop variables i and j
are not used anywhere in the resulting object; they are just needed
for a syntactically correct for
statement. As another example of this usage, observe that the
expression ['very' for i in
range(3)] produces a list containing three instances of
'very', with no integers in
sight.

Note that it would be incorrect to do this work using
multiplication, for reasons concerning object copying that were
discussed earlier in this section.