Data Science from Scratch

A Crash Course in Python

People are still crazy about Python after twenty-five years, which I find hard to believe.

Michael Palin

All new employees at DataSciencester are required to go through new employee orientation, the most interesting part of which is a crash course in Python.

This is not a comprehensive Python tutorial but instead is intended to highlight the parts of the language that will be most important to us (some of which are often not the focus of Python tutorials).

The Basics

Getting Python

You can download Python from python.org.
But if you don’t already have Python, I recommend instead installing the
Anaconda
distribution, which already includes most of the libraries that you need to do data science.

As I write this, the latest version of Python is 3.4.
At DataSciencester, however, we use old, reliable Python 2.7.
Python 3 is not backward-compatible with Python 2,
and many important libraries only work well with 2.7.
The data science community is still firmly stuck on 2.7, which means we will be, too.
Make sure to get that version.

If you don’t get Anaconda, make sure to install
pip, which is a Python package manager that allows you to easily install third-party packages
(some of which we’ll need). It’s also worth getting IPython,
which is a much nicer Python shell to work with.

(If you installed Anaconda then it should have come with pip and IPython.)

Just run:

pip install ipython

and then search the Internet for solutions to whatever cryptic error messages that causes.

Code written in accordance with this "obvious" way (which may not be obvious at all to a newcomer) is often described as "Pythonic." Although this is not a book about Python, we will occasionally contrast Pythonic and non-Pythonic ways of accomplishing the same things, and we will generally favor Pythonic solutions to our problems.

Whitespace Formatting

Many languages use curly braces to delimit blocks of code. Python uses indentation:

for i in [1, 2, 3, 4, 5]:
print i # first line in "for i" block
for j in [1, 2, 3, 4, 5]:
print j # first line in "for j" block
print i + j # last line in "for j" block
print i # last line in "for i" block
print "done looping"

This makes Python code very readable, but it also means that you have to be very careful with your formatting. Whitespace is ignored inside parentheses and brackets, which can be helpful for long-winded computations:

You can also use a backslash to indicate that a statement
continues onto the next line, although we’ll rarely do this:

two_plus_three = 2 + \
3

One consequence of whitespace formatting is that it can be hard to copy and paste code into the Python shell. For example, if you tried to paste the code:

for i in [1, 2, 3, 4, 5]:
# notice the blank line
print i

into the ordinary Python shell, you would get a:

IndentationError: expected an indented block

because the interpreter thinks the blank line signals the end of the for loop’s block.

IPython has a magic function %paste, which correctly pastes whatever is on your clipboard, whitespace and all. This alone is a good reason to use IPython.

Modules

Certain features of Python are not loaded by default. These include both features included as part of the language as well as third-party features that you download yourself. In order to use these features, you’ll need to import the modules that contain them.

One approach is to simply import the module itself:

import re
my_regex = re.compile("[0-9]+", re.I)

Here re is the module containing functions and constants for working with regular expressions. After this type of import you can only access those functions by prefixing them with re..

If you already had a different re in your code you could use an alias:

import re as regex
my_regex = regex.compile("[0-9]+", regex.I)

You might also do this if your module has an unwieldy name or if you’re going to be typing it a lot. For example, when visualizing data with matplotlib, a standard convention is:

import matplotlib.pyplot as plt

If you need a few specific values from a module, you can import them explicitly and use them without qualification:

Although in many languages exceptions are considered bad, in Python there is no shame in using them to make your code cleaner, and we will occasionally do so.

Lists

Probably the most fundamental data structure in Python is the list. A list is simply an ordered collection. (It is similar to what in other languages might be called an array, but with some added functionality.)

This check involves examining the elements of the list one at a time, which means that you probably shouldn’t use it unless you know your list is pretty small (or unless you don’t care how long the check takes).

It is often convenient to unpack lists if you know how many elements they contain:

x, y = [1, 2] # now x is 1, y is 2

although you will get a ValueError if you don’t have the same numbers of elements on both sides.

It’s common to use an underscore for a value you’re going to throw away:

_, y = [1, 2] # now y == 2, didn't care about the first element

Tuples

Tuples are lists' immutable cousins. Pretty much anything you can do to a list that doesn’t involve modifying it, you can do to a tuple. You specify a tuple by using parentheses (or nothing) instead of square brackets:

Dictionary keys must be immutable; in particular, you cannot use lists as keys. If you need a multipart key, you should use a tuple or figure out a way to turn the key into a string.

defaultdict

Imagine that you’re trying to count the words in a document. An obvious approach is to create a dictionary in which the keys are words and the values are counts. As you check each word, you can increment its count if it’s already in the dictionary and add it to the dictionary if it’s not:

Every one of these is slightly unwieldy, which is why defaultdict is useful. A defaultdict is like a regular dictionary, except that when you try to look up a key it doesn’t contain, it first adds a value for it using a zero-argument function you provided when you created it. In order to use defaultdicts, you have to import them from collections:

We’ll use sets for two main reasons. The first is that in is a very fast operation on sets. If we have a large collection of items that we want to use for a membership test, a set is more appropriate than a list:

Python lets you use any value where it expects a Boolean. The following are all "Falsy":

False

None

[] (an empty list)

{} (an empty dict)

""

set()

0

0.0

Pretty much anything else gets treated as True. This allows you to easily use if statements to test for empty lists or empty strings or empty dictionaries or so on. It also sometimes causes tricky bugs if you’re not expecting this behavior:

By default, sort (and sorted) sort a list from smallest to largest based on naively comparing the elements to one another.

If you want elements sorted from largest to smallest, you can specify a reverse=True parameter. And instead of comparing the elements themselves, you can compare the results of a function that you specify with key:

Generators and Iterators

A problem with lists is that they can easily grow very big. range(1000000) creates an actual list of 1 million elements. If you only need to deal with them one at a time, this can be a huge source of inefficiency (or of running out of memory). If you potentially only need the first few values, then calculating them all is a waste.

A generator is something that you can iterate over (for us, usually using for) but whose values are produced only as needed (lazily).

One way to create generators is with functions and the yield operator:

def lazy_range(n):
"""a lazy version of range"""
i = 0
while i < n:
yield i
i += 1

The following loop will consume the yielded values one at a time until none are left:

for i in lazy_range(10):
do_something_with(i)

(Python actually comes with a lazy_range function called xrange, and in Python 3, range itself is lazy.) This means you could even create an infinite sequence:

although you probably shouldn’t iterate over it without using some kind of break logic.

Tip

The flip side of laziness is that you can only iterate through a generator once. If you need to iterate through something multiple times, you’ll need to either recreate the generator each time or use a list.

A second way to create generators is by using for comprehensions wrapped in parentheses:

lazy_evens_below_20 = (i for i in lazy_range(20) if i % 2 == 0)

Recall also that every dict has an items() method that returns a list of its key-value pairs. More frequently we’ll use the iteritems() method, which lazily yields the key-value pairs
one at a time as we iterate over it.

Randomness

As we learn data science, we will frequently need to generate random numbers,
which we can do with the random module:

Regular Expressions

Regular expressions provide a way of searching text. They are incredibly useful but also fairly complicated, so much so that there are entire books written about them. We will explain their details the few times we encounter them; here are a few examples of how to use them in Python:

Object-Oriented Programming

Like many languages, Python allows you to define classes that encapsulate data and the functions that operate on them. We’ll use them sometimes to make our code cleaner and simpler. It’s probably simplest to explain them by constructing a heavily annotated example.

Imagine we didn’t have the built-in Python set. Then we might want to create our own Set class.

What behavior should our class have? Given an instance of Set, we’ll need to be able to add items to it, remove items from it, and check whether it contains a certain value. We’ll create all of these as member functions, which means we’ll access them with a dot after a Set object:

# by convention, we give classes PascalCase names
class Set:
# these are the member functions
# every one takes a first parameter "self" (another convention)
# that refers to the particular Set object being used
def __init__(self, values=None):
"""This is the constructor.
It gets called when you create a new Set.
You would use it like
s1 = Set() # empty set
s2 = Set([1,2,2,3]) # initialize with values"""
self.dict = {} # each instance of Set has its own dict property
# which is what we'll use to track memberships
if values is not None:
for value in values:
self.add(value)
def __repr__(self):
"""this is the string representation of a Set object
if you type it at the Python prompt or pass it to str()"""
return "Set: " + str(self.dict.keys())
# we'll represent membership by being a key in self.dict with value True
def add(self, value):
self.dict[value] = True
# value is in the Set if it's a key in the dictionary
def contains(self, value):
return value in self.dict
def remove(self, value):
del self.dict[value]

That is, when we define a function like this, args is a tuple of its unnamed arguments
and kwargs is a dict of its named arguments. It works the other way too, if you
want to use a list (or tuple) and dict to supply arguments to a function:

Joel Grus is a software engineer at Google. Before that he worked as a data scientist at multiple startups. He lives in Seattle, where he regularly attends data science happy hours. He blogs infrequently at joelgrus.com.