In the tutorial I presented at PyCon 2006 (called Text & Data
Processing), I was surprised at the reaction to some techniques I
used that I had thought were common knowledge. But many of the
attendees were unaware of these tools that experienced Python
programmers use without thinking.

Many of you will have seen some of these techniques and idioms
before. Hopefully you'll learn a few techniques that you haven't
seen before and maybe something new about the ones you have already
seen.

A PEP is a design document providing information to the Python
community, or describing a new feature for Python or its processes
or environment.

The Python community has its own standards for what source code
should look like, codified in PEP 8. These standards are different
from those of other communities, like C, C++, C#, Java,
VisualBasic, etc.

Because indentation and whitespace are so important in Python, the
Style Guide for Python Code approaches a standard. It would be
wise to adhere to the guide! Most open-source projects and
(hopefully) in-house projects follow the style guide quite
closely.

But try to avoid the __private form. I never use it.
Trust me. If you use it, you WILL regret it later.

Explanation:

People coming from a C++/Java background are especially prone to
overusing/misusing this "feature". But __private names don't
work the same way as in Java or C++. They just trigger a name
mangling whose purpose is to prevent accidental namespace
collisions in subclasses: MyClass.__private just becomes
MyClass._MyClass__private. (Note that even this breaks down
for subclasses with the same name as the superclass,
e.g. subclasses in different modules.) It is possible to
access __private names from outside their class, just
inconvenient and fragile (it adds a dependency on the exact name
of the superclass).

The problem is that the author of a class may legitimately think
"this attribute/method name should be private, only accessible
from within this class definition" and use the __private
convention. But later on, a user of that class may make a
subclass that legitimately needs access to that name. So either
the superclass has to be modified (which may be difficult or
impossible), or the subclass code has to use manually mangled
names (which is ugly and fragile at best).

There's a concept in Python: "we're all consenting adults here".
If you use the __private form, who are you protecting the
attribute from? It's the responsibility of subclasses to use
attributes from superclasses properly, and it's the
responsibility of superclasses to document their attributes
properly.

It's better to use the single-leading-underscore convention,
_internal. This isn't name mangled at all; it just
indicates to others to "be careful with this, it's an internal
implementation detail; don't touch it if you don't fully
understand it". It's only a convention though.

That's because this automatic concatenation is a feature of the
Python parser/compiler, not the interpreter. You must use the "+"
operator to concatenate strings at run time.

text = ('Long strings can be made up '
'of several shorter strings.')

The parentheses allow implicit line continuation.

Multiline strings use triple quotes:

"""Triple
double
quotes"""

'''\
Triple
single
quotes\
'''

In the last example above (triple single quotes), note how the
backslashes are used to escape the newlines. This eliminates extra
newlines, while keeping the text and quotes nicely left-justified.
The backslashes must be at the end of their lines.

Whitespace & indentations are useful visual indicators of the
program flow. The indentation of the second "Good" line above
shows the reader that something's going on, whereas the lack of
indentation in "Bad" hides the "if" statement.

Multiple statements on one line are a cardinal sin. In Python,
readability counts.

But most importantly: know when to be inconsistent -- sometimes
the style guide just doesn't apply. When in doubt, use your
best judgment. Look at other examples and decide what looks
best. And don't hesitate to ask!

Two good reasons to break a particular rule:

When applying the rule would make the code less readable,
even for someone who is used to reading code that follows
the rules.

To be consistent with surrounding code that also breaks it
(maybe for historic reasons) -- although this is also an
opportunity to clean up someone else's mess (in true XP
style).

We want to join all the strings together into one large string.
Especially when the number of substrings is large...

Don't do this:

result = ''
for s in colors:
result += s

This is very inefficient.

It has terrible memory usage and performance patterns. The
"summation" will compute, store, and then throw away each
intermediate step.

Instead, do this:

result = ''.join(colors)

The join() string method does all the copying in one pass.

When you're only dealing with a few dozen or hundred strings, it
won't make much difference. But get in the habit of building
strings efficiently, because with thousands or with loops, it
will make a difference.

To make a nicely grammatical sentence, we want commas between all
but the last pair of values, where we want the word "or". The
slice syntax does the job. The "slice until -1" ([:-1]) gives
all but the last value, which we join with comma-space.

Of course, this code wouldn't work with corner cases, lists of
length 0 or 1.

The setdefault dictionary method returns the default value, but
we ignore it here. We're taking advantage of setdefault's side
effect, that it sets the dictionary value only if there is no value
already.

You should be careful with defaultdict though. You cannot get
KeyError exceptions from properly initialized defaultdict
instances. You have to use a "key in dict" conditional if you need
to check for the existence of a specific key.

Note that the order of the results of .keys() and .values() is
different from the order of items when constructing the dictionary.
The order going in is different from the order coming out. This is
because a dictionary is inherently unordered. However, the order
is guaranteed to be consistent (in other words, the order of keys
will correspond to the order of values), as long as the dictionary
isn't changed between calls.

We need use a list wrapper to print the result because
enumerate is a lazy function: it generates one item, a pair, at
a time, only when required. A for loop is one place that
requires one result at a time. enumerate is an example of a
generator, which we'll cover in greater detail later. print
does not take one result at a time -- we want the entire result, so
we have to explicitly convert the generator into a list when we
print it.

The problem here is that the default value of a_list, an empty
list, is evaluated at function definition time. So every time you
call the function, you get the same default value. Try it
several times:

>>> print bad_append('one')
['one']

>>> print bad_append('two')
['one', 'two']

Lists are a mutable objects; you can change their contents. The
correct way to get a default list (or dictionary, or set) is to
create it at run time instead, inside the function:

Although if you don't know C, that's not very helpful. Basically,
you provide a template or format and interpolation values.

In this example, the template contains two conversion
specifications: "%s" means "insert a string here", and "%i" means
"convert an integer to a string and insert here". "%s" is
particularly useful because it uses Python's built-in str()
function to to convert any object to a string.

The interpolation values must match the template; we have two
values here, a tuple.

If you haven't done it already, go to python.org, download the HTML
documentation (in a .zip file or a tarball), and install it on your
machine. There's nothing like having the definitive resource at
your fingertips.

The locals() function returns a dictionary of all
locally-available names.

This is very powerful. With this, you can do all the string
formatting you want without having to worry about matching the
interpolation values to the template.

But power can be dangerous. ("With great power comes great
responsibility.") If you use the locals() form with an
externally-supplied template string, you expose your entire local
namespace to the caller. This is just something to keep in mind.

To examine your local namespace:

>>> from pprint import pprint
>>> pprint(locals())

pprint is a very useful module. If you don't know it already,
try playing with it. It makes debugging your data structures much
easier!

List comprehensions ("listcomps" for short) are syntax shortcuts
for this general pattern:

The traditional way, with for and if statements:

new_list = []
for item in a_list:
if condition(item):
new_list.append(fn(item))

As a list comprehension:

new_list = [fn(item) for item in a_list
if condition(item)]

Listcomps are clear & concise, up to a point. You can have
multiple for-loops and if-conditions in a listcomp, but
beyond two or three total, or if the conditions are complex, I
suggest that regular for loops should be used. Applying the
Zen of Python, choose the more readable way.

We can use the sum function to quickly do the work for us, by
building the appropriate sequence.

As a list comprehension:

total = sum([num * num for num in range(1, 101)])

As a generator expression:

total = sum(num * num for num in xrange(1, 101))

Generator expressions ("genexps") are just like list
comprehensions, except that where listcomps are greedy, generator
expressions are lazy. Listcomps compute the entire result list all
at once, as a list. Generator expressions compute one value at a
time, when needed, as individual values. This is especially useful
for long sequences where the computed list is just an intermediate
step and not the final result.

In this case, we're only interested in the sum; we don't need the
intermediate list of squares. We use xrange for the same
reason: it lazily produces values, one at a time.

For example, if we were summing the squares of several billion
integers, we'd run out of memory with list comprehensions, but
generator expressions have no problem. This does take time,
though!

total = sum(num * num
for num in xrange(1, 1000000000))

The difference in syntax is that listcomps have square brackets,
but generator expressions don't. Generator expressions sometimes
do require enclosing parentheses though, so you should always use
them.

Rule of thumb:

Use a list comprehension when a computed list is the desired end
result.

Use a generator expression when the computed list is just an
intermediate step.

Here's a recent example I saw at work.

➔

We needed a dictionary mapping month numbers (both as string and as
integers) to month codes for futures contracts. It can be done in
one logical line of code.

➔

The way this works is as follows:

The dict() built-in takes a list of key/value pairs
(2-tuples).

We have a list of month codes (each month code is a single
letter, and a string is also just a list of letters). We
enumerate over this list to get both the month code and the
index.

The month numbers start at 1, but Python starts indexing at 0, so
the month number is one more than the index.

We want to look up months both as strings and as integers. We
can use the int() and str() functions to do this for us,
and loop over them.

(Note that the list is sorted in-place: the original list is
sorted, and the sort method does not return the list or a
copy.)

But what if you have a list of data that you need to sort, but it
doesn't sort naturally (i.e., sort on the first column, then the
second column, etc.)? You may need to sort on the second column
first, then the fourth column.

The first line creates a list containing tuples: copies of the sort
terms in priority order, followed by the complete data record.

The second line does a native Python sort, which is very fast and
efficient.

The third line retrieves the last value from the sorted list.
Remember, this last value is the complete data record. We're
throwing away the sort terms, which have done their job and are no
longer needed.

This is a tradeoff of space and complexity against time. Much
simpler and faster, but we do need to duplicate the original list.

The yield keyword turns a function into a generator. When you
call a generator function, instead of running the code immediately
Python returns a generator object, which is an iterator; it has a
next method. for loops just call the next method on
the iterator, until a StopIteration exception is raised. You
can raise StopIteration explicitly, or implicitly by falling
off the end of the generator code as above.

Generators can simplify sequence/iterator handling, because we
don't need to build concrete lists; just compute one value at a
time. The generator function maintains state.

This is how a for loop really works. Python looks at the
sequence supplied after the in keyword. If it's a simple
container (such as a list, tuple, dictionary, set, or user-defined
container) Python converts it into an iterator. If it's already an
iterator, Python uses it directly.

Then Python repeatedly calls the iterator's next method,
assigns the return value to the loop counter (i in this case),
and executes the indented code. This is repeated over and over,
until StopIteration is raised, or a break statement is
executed in the code.

A for loop can have an else clause, whose code is executed
after the iterator runs dry, but not after a break
statement is executed. This distinction allows for some elegant
uses. else clauses are not always or often used on for
loops, but they can come in handy. Sometimes an else clause
perfectly expresses the logic you need.

For example, if we need to check that a condition holds on some
item, any item, in a sequence:

for item in sequence:
if condition(item):
break
else:
raise Exception('Condition not satisfied.')

You can wrap exception-prone code in a try/except block to
catch the errors, and you will probably end up with a solution
that's much more general than if you had tried to anticipate every
possibility.

LUKE: But how will I know why explicit imports are better than
the wild-card form?

YODA: Know you will when your code you try to read six months
from now.

Wild-card imports are from the dark side of Python.

Never!

The from module import * wild-card style leads to namespace
pollution. You'll get things in your local namespace that you
didn't expect to get. You may see imported names obscuring
module-defined local names. You won't be able to figure out where
certain names come from. Although a convenient shortcut, this
should not be in production code.

Moral: don't use wild-card imports!

➔

It's much better to:

reference names through their module
(fully qualified identifiers),

➔

import a long module using a shorter name (alias; recommended),

➔

or explicitly import just the names you need.

➔

Namespace pollution alert!

Instead,

Reference names through their module (fully qualified identifiers):

import module
module.name

Or import a long module using a shorter name (alias):

import long_module_name as mod
mod.name

Or explicitly import just the names you need:

from module import name
name

Note that this form doesn't lend itself to use in the interactive
interpreter, where you may want to edit and "reload()" a module.

When imported, a module's __name__ attribute is set to the
module's file name, without ".py". So the code guarded by the
if statement above will not run when imported. When executed
as a script though, the __name__ attribute is set to
"__main__", and the script code will run.

Except for special cases, you shouldn't put any major executable
code at the top-level. Put code in functions, classes, methods,
and guard it with if __name__ == '__main__'.