Python is the programming language of choice for many scientists to a large degree because it offers a great deal of power to analyze and model scientific data with relatively little overhead in terms of learning, installation or development time. It is a language you can pick up in a weekend, and use for the rest of one's life.

The Python Tutorial is a great place to start getting a feel for the language. To complement this material, I taught a Python Short Course years ago to a group of computational chemists during a time that I was worried the field was moving too much in the direction of using canned software rather than developing one's own methods. I wanted to focus on what working scientists needed to be more productive: parsing output of other programs, building simple models, experimenting with object oriented programming, extending the language with C, and simple GUIs.

I'm trying to do something very similar here, to cut to the chase and focus on what scientists need. In the last year or so, the IPython Project has put together a notebook interface that I have found incredibly valuable. A large number of people have released very good IPython Notebooks that I have taken a huge amount of pleasure reading through. Some ones that I particularly like include:

I find IPython notebooks an easy way both to get important work done in my everyday job, as well as to communicate what I've done, how I've done it, and why it matters to my coworkers. I find myself endlessly sweeping the IPython subreddit hoping someone will post a new notebook. In the interest of putting more notebooks out into the wild for other people to use and enjoy, I thought I would try to recreate some of what I was trying to get across in the original Python Short Course, updated by 15 years of Python, Numpy, Scipy, Matplotlib, and IPython development, as well as my own experience in using Python almost every day of this time.

There are two branches of current releases in Python: the older-syntax Python 2, and the newer-syntax Python 3. This schizophrenia is largely intentional: when it became clear that some non-backwards-compatible changes to the language were necessary, the Python dev-team decided to go through a five-year (or so) transition, during which the new language features would be introduced and the old language was still actively maintained, to make such a transition as easy as possible. We're now (2013) past the halfway point, and, IMHO, at the first time when I'm considering making the change to Python 3.

Nonetheless, I'm going to write these notes with Python 2 in mind, since this is the version of the language that I use in my day-to-day job, and am most comfortable with. If these notes are important and are valuable to people, I'll be happy to rewrite the notes using Python 3.

With this in mind, these notes assume you have a Python distribution that includes:

IPython, with the additional libraries required for the notebook interface.

A good, easy to install option that supports Mac, Windows, and Linux, and that has all of these packages (and much more) is the Entought Python Distribution, also known as EPD, which appears to be changing its name to Enthought Canopy. Enthought is a commercial company that supports a lot of very good work in scientific Python development and application. You can either purchase a license to use EPD, or there is also a free version that you can download and install.

Here are some other alternatives, should you not want to use EPD:

Linux Most distributions have an installation manager. Redhat has yum, Ubuntu has apt-get. To my knowledge, all of these packages should be available through those installers.

Mac I use Macports, which has up-to-date versions of all of these packages.

Cloud This notebook is currently not running on the IPython notebook viewer, but will be shortly, which will allow the notebook to be viewed but not interactively. I'm keeping an eye on Wakari, from Continuum Analytics, which is a cloud-based IPython notebook. Wakari appears to support free accounts as well. Continuum is a company started by some of the core Enthought Numpy/Scipy people focusing on big data.

Continuum also supports a bundled, multiplatform Python package called Anaconda that I'll also keep an eye on.

This is a quick introduction to Python. There are lots of other places to learn the language more thoroughly. I have collected a list of useful links, including ones to other learning resources, at the end of this notebook. If you want a little more depth, Python Tutorial is a great place to start, as is Zed Shaw's Learn Python the Hard Way.

Briefly, notebooks have code cells (that are generally followed by result cells) and text cells. The text cells are the stuff that you're reading now. The code cells start with "In []:" with some number generally in the brackets. If you put your cursor in the code cell and hit Shift-Enter, the code will run in the Python interpreter and the result will print out in the output cell. You can then change things around and see whether you understand what's going on. If you need to know more, see the IPython notebook documentation or the IPython tutorial.

Many of the things I used to use a calculator for, I now use Python for:

In [1]:

2+2

Out[1]:

4

In [2]:

(50-5*6)/4

Out[2]:

5.0

(If you're typing this into an IPython notebook, or otherwise using notebook file, you hit shift-Enter to evaluate a cell.)

There are some gotchas compared to using a normal calculator.

In [3]:

7/3

Out[3]:

2.3333333333333335

Python integer division, like C or Fortran integer division, truncates the remainder and returns an integer. At least it does in version 2. In version 3, Python returns a floating point number. You can get a sneak preview of this feature in Python 2 by importing the module from the future features:

from __future__ import division

Alternatively, you can convert one of the integers to a floating point number, in which case the division function returns another floating point number.

In [4]:

7/3.

Out[4]:

2.3333333333333335

In [5]:

7/float(3)

Out[5]:

2.3333333333333335

In the last few lines, we have sped by a lot of things that we should stop for a moment and explore a little more fully. We've seen, however briefly, two different data types: integers, also known as whole numbers to the non-programming world, and floating point numbers, also known (incorrectly) as decimal numbers to the rest of the world.

We've also seen the first instance of an import statement. Python has a huge number of libraries included with the distribution. To keep things simple, most of these variables and functions are not accessible from a normal Python interactive session. Instead, you have to import the name. For example, there is a math module containing many useful functions. To access, say, the square root function, you can either first

from math import sqrt

and then

In [6]:

sqrt(81)

Out[6]:

9.0

or you can simply import the math library itself

In [7]:

importmathmath.sqrt(81)

Out[7]:

9.0

You can define variables using the equals (=) sign:

In [8]:

width=20length=30area=length*widtharea

Out[8]:

600

If you try to access a variable that you haven't yet defined, you get an error:

You can name a variable almost anything you want. It needs to start with an alphabetical character or "_", can contain alphanumeric charcters plus underscores ("_"). Certain words, however, are reserved for the language:

Python lists, like C, but unlike Fortran, use 0 as the index of the first element of a list. Thus, in this example, the 0 element is "Sunday", 1 is "Monday", and so on. If you need to access the nth element from the end of the list, you can use a negative index. For example, the -1 element of a list is the last element:

The range() command is a convenient way to make sequential lists of numbers:

In [26]:

list(range(10))

Out[26]:

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

Note that range(n) starts at 0 and gives the sequential list of integers less than n. If you want to start at a different number, use range(start,stop)

In [27]:

list(range(2,8))

Out[27]:

[2, 3, 4, 5, 6, 7]

The lists created above with range have a step of 1 between elements. You can also give a fixed step size via a third command:

In [28]:

evens=list(range(0,20,2))evens

Out[28]:

[0, 2, 4, 6, 8, 10, 12, 14, 16, 18]

In [29]:

evens[3]

Out[29]:

6

Lists do not have to hold the same data type. For example,

In [30]:

["Today",7,99.3,""]

Out[30]:

['Today', 7, 99.3, '']

However, it's good (but not essential) to use lists for similar objects that are somehow logically connected. If you want to group different data types together into a composite data object, it's best to use tuples, which we will learn about below.

You can find out how long a list is using the len() command:

In [31]:

help(len)

Help on built-in function len in module builtins:
len(...)
len(object) -> integer
Return the number of items of a sequence or mapping.

One of the most useful things you can do with lists is to iterate through them, i.e. to go through each element one at a time. To do this in Python, we use the for statement:

In [33]:

fordayindays_of_the_week:print(day)

Sunday
Monday
Tuesday
Wednesday
Thursday
Friday
Saturday

This code snippet goes through each element of the list called days_of_the_week and assigns it to the variable day. It then executes everything in the indented block (in this case only one line of code, the print statement) using those variable assignments. When the program has gone through every element of the list, it exists the block.

(Almost) every programming language defines blocks of code in some way. In Fortran, one uses END statements (ENDDO, ENDIF, etc.) to define code blocks. In C, C++, and Perl, one uses curly braces {} to define these blocks.

Python uses a colon (":"), followed by indentation level to define code blocks. Everything at a higher level of indentation is taken to be in the same block. In the above example the block was only a single line, but we could have had longer blocks as well:

In [34]:

fordayindays_of_the_week:statement="Today is "+dayprint(statement)

Today is Sunday
Today is Monday
Today is Tuesday
Today is Wednesday
Today is Thursday
Today is Friday
Today is Saturday

The range() command is particularly useful with the for statement to execute loops of a specified length:

In [35]:

foriinrange(20):print("The square of ",i," is ",i*i)

The square of 0 is 0
The square of 1 is 1
The square of 2 is 4
The square of 3 is 9
The square of 4 is 16
The square of 5 is 25
The square of 6 is 36
The square of 7 is 49
The square of 8 is 64
The square of 9 is 81
The square of 10 is 100
The square of 11 is 121
The square of 12 is 144
The square of 13 is 169
The square of 14 is 196
The square of 15 is 225
The square of 16 is 256
The square of 17 is 289
The square of 18 is 324
The square of 19 is 361

Lists and strings have something in common that you might not suspect: they can both be treated as sequences. You already know that you can iterate through the elements of a list. You can also iterate through the letters in a string:

In [36]:

forletterin"Sunday":print(letter)

S
u
n
d
a
y

This is only occasionally useful. Slightly more useful is the slicing operation, which you can also use on any sequence. We already know that we can use indexing to get the first element of a list:

In [37]:

days_of_the_week[0]

Out[37]:

'Sunday'

If we want the list containing the first two elements of a list, we can do this via

In [38]:

days_of_the_week[0:2]

Out[38]:

['Sunday', 'Monday']

or simply

In [39]:

days_of_the_week[:2]

Out[39]:

['Sunday', 'Monday']

If we want the last items of the list, we can do this with negative slicing:

In [40]:

days_of_the_week[-2:]

Out[40]:

['Friday', 'Saturday']

which is somewhat logically consistent with negative indices accessing the last elements of the list.

You can do:

In [41]:

workdays=days_of_the_week[1:6]print(workdays)

['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday']

Since strings are sequences, you can also do this to them:

In [42]:

day="Sunday"abbreviation=day[:3]print(abbreviation)

Sun

If we really want to get fancy, we can pass a third element into the slice, which specifies a step length (just like a third argument to the range() function specifies the step):

Note that in this example I was even able to omit the second argument, so that the slice started at 2, went to the end of the list, and took every second element, to generate the list of even numbers less that 40.

We have now learned a few data types. We have integers and floating point numbers, strings, and lists to contain them. We have also learned about lists, a container that can hold any data type. We have learned to print things out, and to iterate over items in lists. We will now learn about boolean variables that can be either True or False.

We invariably need some concept of conditions in programming to control branching behavior, to allow a program to react differently to different situations. If it's Monday, I'll go to work, but if it's Sunday, I'll sleep in. To do this in Python, we use a combination of boolean variables, which evaluate to either True or False, and if statements, that control branching based on boolean values.

For example:

In [44]:

ifday=="Sunday":print("Sleep in")else:print("Go to work")

Sleep in

(Quick quiz: why did the snippet print "Go to work" here? What is the variable "day" set to?)

Let's take the snippet apart to see what happened. First, note the statement

In [45]:

day=="Sunday"

Out[45]:

True

If we evaluate it by itself, as we just did, we see that it returns a boolean value, False. The "==" operator performs equality testing. If the two items are equal, it returns True, otherwise it returns False. In this case, it is comparing two variables, the string "Sunday", and whatever is stored in the variable "day", which, in this case, is the other string "Saturday". Since the two strings are not equal to each other, the truth test has the false value.

The if statement that contains the truth test is followed by a code block (a colon followed by an indented block of code). If the boolean is true, it executes the code in that block. Since it is false in the above example, we don't see that code executed.

The first block of code is followed by an else statement, which is executed if nothing else in the above if statement is true. Since the value was false, this code is executed, which is why we see "Go to work".

You can compare any data types in Python:

In [46]:

1==2

Out[46]:

False

In [47]:

50==2*25

Out[47]:

True

In [48]:

3<3.14159

Out[48]:

True

In [49]:

1==1.0

Out[49]:

True

In [50]:

1!=0

Out[50]:

True

In [51]:

1<=2

Out[51]:

True

In [52]:

1>=1

Out[52]:

True

We see a few other boolean operators here, all of which which should be self-explanatory. Less than, equality, non-equality, and so on.

Particularly interesting is the 1 == 1.0 test, which is true, since even though the two objects are different data types (integer and floating point number), they have the same value. There is another boolean operator is, that tests whether two objects are the same object:

In [53]:

1is1.0

Out[53]:

False

We can do boolean tests on lists as well:

In [54]:

[1,2,3]==[1,2,4]

Out[54]:

False

In [55]:

[1,2,3]<[1,2,4]

Out[55]:

True

Finally, note that you can also string multiple comparisons together, which can result in very intuitive tests:

In [56]:

hours=50<hours<24

Out[56]:

True

If statements can have elif parts ("else if"), in addition to if/else parts. For example:

Of course we can combine if statements with for loops, to make a snippet that is almost interesting:

In [58]:

fordayindays_of_the_week:statement="Today is "+dayprint(statement)ifday=="Sunday":print(" Sleep in")elifday=="Saturday":print(" Do chores")else:print(" Go to work")

Today is Sunday
Sleep in
Today is Monday
Go to work
Today is Tuesday
Go to work
Today is Wednesday
Go to work
Today is Thursday
Go to work
Today is Friday
Go to work
Today is Saturday
Do chores

This is something of an advanced topic, but ordinary data types have boolean values associated with them, and, indeed, in early versions of Python there was not a separate boolean object. Essentially, anything that was a 0 value (the integer or floating point 0, an empty string "", or an empty list []) was False, and everything else was true. You can see the boolean value of any data object using the bool() function.

The Fibonacci sequence is a sequence in math that starts with 0 and 1, and then each successive entry is the sum of the previous two. Thus, the sequence goes 0,1,1,2,3,5,8,13,21,34,55,89,...

A very common exercise in programming books is to compute the Fibonacci sequence up to some number n. First I'll show the code, then I'll discuss what it is doing.

In [62]:

n=10sequence=[0,1]foriinrange(2,n):# This is going to be a problem if we ever set n <= 2!sequence.append(sequence[i-1]+sequence[i-2])print(sequence)

[0, 1, 1, 2, 3, 5, 8, 13, 21, 34]

Let's go through this line by line. First, we define the variable n, and set it to the integer 20. n is the length of the sequence we're going to form, and should probably have a better variable name. We then create a variable called sequence, and initialize it to the list with the integers 0 and 1 in it, the first two elements of the Fibonacci sequence. We have to create these elements "by hand", since the iterative part of the sequence requires two previous elements.

We then have a for loop over the list of integers from 2 (the next element of the list) to n (the length of the sequence). After the colon, we see a hash tag "#", and then a comment that if we had set n to some number less than 2 we would have a problem. Comments in Python start with #, and are good ways to make notes to yourself or to a user of your code explaining why you did what you did. Better than the comment here would be to test to make sure the value of n is valid, and to complain if it isn't; we'll try this later.

In the body of the loop, we append to the list an integer equal to the sum of the two previous elements of the list.

After exiting the loop (ending the indentation) we then print out the whole list. That's it!

We might want to use the Fibonacci snippet with different sequence lengths. We could cut an paste the code into another cell, changing the value of n, but it's easier and more useful to make a function out of the code. We do this with the def statement in Python:

In [63]:

deffibonacci(sequence_length):"Return the Fibonacci sequence of length *sequence_length*"sequence=[0,1]ifsequence_length<1:print("Fibonacci sequence only defined for length 1 or greater")returnif0<sequence_length<3:returnsequence[:sequence_length]foriinrange(2,sequence_length):sequence.append(sequence[i-1]+sequence[i-2])returnsequence

We can now call fibonacci() for different sequence_lengths:

In [64]:

fibonacci(2)

Out[64]:

[0, 1]

In [65]:

fibonacci(12)

Out[65]:

[0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89]

We've introduced a several new features here. First, note that the function itself is defined as a code block (a colon followed by an indented block). This is the standard way that Python delimits things. Next, note that the first line of the function is a single string. This is called a docstring, and is a special kind of comment that is often available to people using the function through the python command line:

In [66]:

help(fibonacci)

Help on function fibonacci in module __main__:
fibonacci(sequence_length)
Return the Fibonacci sequence of length *sequence_length*

If you define a docstring for all of your functions, it makes it easier for other people to use them, since they can get help on the arguments and return values of the function.

Next, note that rather than putting a comment in about what input values lead to errors, we have some testing of these values, followed by a warning if the value is invalid, and some conditional code to handle special cases.

Functions can also call themselves, something that is often called recursion. We're going to experiment with recursion by computing the factorial function. The factorial is defined for a positive integer n as

$$ n! = n(n-1)(n-2)\cdots 1 $$

First, note that we don't need to write a function at all, since this is a function built into the standard math library. Let's use the help function to find out about it:

In [67]:

frommathimportfactorialhelp(factorial)

Help on built-in function factorial in module math:
factorial(...)
factorial(x) -> Integral
Find x!. Raise a ValueError if x is negative or non-integral.

This is clearly what we want.

In [68]:

factorial(20)

Out[68]:

2432902008176640000

However, if we did want to write a function ourselves, we could do recursively by noting that

Tuples are useful anytime you want to group different pieces of data together in an object, but don't want to create a full-fledged class (see below) for them. For example, let's say you want the Cartesian coordinates of some objects in your program. Tuples are a good way to do this:

In [75]:

('Bob',0.0,21.0)

Out[75]:

('Bob', 0.0, 21.0)

Again, it's not a necessary distinction, but one way to distinguish tuples and lists is that tuples are a collection of different things, here a name, and x and y coordinates, whereas a list is a collection of similar things, like if we wanted a list of those coordinates:

In [76]:

positions=[('Bob',0.0,21.0),('Cat',2.5,13.1),('Dog',33.0,1.2)]

Tuples can be used when functions return more than one value. Say we wanted to compute the smallest x- and y-coordinates of the above list of objects. We could write:

In [77]:

defminmax(objects):minx=1e20# These are set to really big numbersminy=1e20forobjinobjects:name,x,y=objifx<minx:minx=xify<miny:miny=yreturnminx,minyx,y=minmax(positions)print(x,y)

0.0 1.2

Here we did two things with tuples you haven't seen before. First, we unpacked an object into a set of named variables using tuple assignment:

>>> name,x,y = obj

We also returned multiple values (minx,miny), which were then assigned to two other variables (x,y), again by tuple assignment. This makes what would have been complicated code in C++ rather simple.

Tuple assignment is also a convenient way to swap variables:

In [78]:

x,y=1,2y,x=x,yx,y

Out[78]:

(2, 1)

Dictionaries are an object called "mappings" or "associative arrays" in other languages. Whereas a list associates an integer index with a set of objects:

In [79]:

mylist=[1,2,9,21]

The index in a dictionary is called the key, and the corresponding dictionary entry is the value. A dictionary can use (almost) anything as the key. Whereas lists are formed with square brackets [], dictionaries use curly brackets {}:

In [80]:

ages={"Rick":46,"Bob":86,"Fred":21}print("Rick's age is ",ages["Rick"])

Rick's age is 46

There's also a convenient way to create dictionaries without having to quote the keys.

We can generally understand trends in data by using a plotting program to chart it. Python has a wonderful plotting library called Matplotlib. The IPython notebook interface we are using for these notes has that functionality built in.

As an example, we have looked at two different functions, the Fibonacci function, and the factorial function, both of which grow faster than polynomially. Which one grows the fastest? Let's plot them. First, let's generate the Fibonacci sequence of length 20:

The factorial function grows much faster. In fact, you can't even see the Fibonacci sequence. It's not entirely surprising: a function where we multiply by n each iteration is bound to grow faster than one where we add (roughly) n each iteration.

Let's plot these on a semilog plot so we can see them both a little more clearly:

There are many more things you can do with Matplotlib. We'll be looking at some of them in the sections to come. In the meantime, if you want an idea of the different things you can do, look at the Matplotlib Gallery. Rob Johansson's IPython notebook Introduction to Matplotlib is also particularly good.

There is, of course, much more to the language than I've covered here. I've tried to keep this brief enough so that you can jump in and start using Python to simplify your life and work. My own experience in learning new things is that the information doesn't "stick" unless you try and use it for something in real life.

Tim Peters, one of the earliest and most prolific Python contributors, wrote the "Zen of Python", which can be accessed via the "import this" command:

In [88]:

importthis

The Zen of Python, by Tim Peters
Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!

No matter how experienced a programmer you are, these are words to meditate on.

Numpy contains core routines for doing fast vector, matrix, and linear algebra-type operations in Python. Scipy contains additional routines for optimization, special functions, and so on. Both contain modules written in C and Fortran so that they're as fast as possible. Together, they give Python roughly the same capability that the Matlab program offers. (In fact, if you're an experienced Matlab user, there a guide to Numpy for Matlab users just for you.)

Fundamental to both Numpy and Scipy is the ability to work with vectors and matrices. You can create vectors from lists using the array command:

In [89]:

array([1,2,3,4,5,6])

Out[89]:

array([1, 2, 3, 4, 5, 6])

You can pass in a second argument to array that gives the numeric type. There are a number of types listed here that your matrix can be. Some of these are aliased to single character codes. The most common ones are 'd' (double precision floating point number), 'D' (double precision complex number), and 'i' (int32). Thus,

In [90]:

array([1,2,3,4,5,6],'d')

Out[90]:

array([ 1., 2., 3., 4., 5., 6.])

In [91]:

array([1,2,3,4,5,6],'D')

Out[91]:

array([ 1.+0.j, 2.+0.j, 3.+0.j, 4.+0.j, 5.+0.j, 6.+0.j])

In [92]:

array([1,2,3,4,5,6],'i')

Out[92]:

array([1, 2, 3, 4, 5, 6], dtype=int32)

To build matrices, you can either use the array command with lists of lists:

In [93]:

array([[0,1],[1,0]],'d')

Out[93]:

array([[ 0., 1.],
[ 1., 0.]])

You can also form empty (zero) matrices of arbitrary shape (including vectors, which Numpy treats as vectors with one row), using the zeros command:

In [94]:

zeros((3,3),'d')

Out[94]:

array([[ 0., 0., 0.],
[ 0., 0., 0.],
[ 0., 0., 0.]])

The first argument is a tuple containing the shape of the matrix, and the second is the data type argument, which follows the same conventions as in the array command. Thus, you can make row vectors:

If you provide a third argument, it takes that as the number of points in the space. If you don't provide the argument, it gives a length 50 linear space.

In [100]:

linspace(0,1,11)

Out[100]:

array([ 0. , 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1. ])

linspace is an easy way to make coordinates for plotting. Functions in the numpy library (all of which are imported into IPython notebook) can act on an entire vector (or even a matrix) of points at once. Thus,

Now that we have these tools in our toolbox, we can start to do some cool stuff with it. Many of the equations we want to solve in Physics involve differential equations. We want to be able to compute the derivative of functions:

$$ y' = \frac{y(x+h)-y(x)}{h} $$

by discretizing the function $y(x)$ on an evenly spaced set of points $x_0, x_1, \dots, x_n$, yielding $y_0, y_1, \dots, y_n$. Using the discretization, we can approximate the derivative by

$$ y_i' \approx \frac{y_{i+1}-y_{i-1}}{x_{i+1}-x_{i-1}} $$

We can write a derivative function in Python via

In [113]:

defnderiv(y,x):"Finite difference derivative of the function f"n=len(y)d=zeros(n,'d')# assume double# Use centered differences for the interior points, one-sided differences for the endsforiinrange(1,n-1):d[i]=(y[i+1]-y[i])/(x[i+1]-x[i])d[0]=(y[1]-y[0])/(x[1]-x[0])d[n-1]=(y[n-1]-y[n-2])/(x[n-1]-x[n-2])returnd

Let's see whether this works for our sin example from above:

In [114]:

x=linspace(0,2*pi)dsin=nderiv(sin(x),x)plot(x,dsin,label='numerical')plot(x,cos(x),label='analytical')title("Comparison of numerical and analytical derivatives of sin(x)")legend()

for $\psi(x)$ when $V(x)=\frac{1}{2}m\omega^2x^2$ is the harmonic oscillator potential. We're going to use the standard trick to transform the differential equation into a matrix equation by multiplying both sides by $\psi^*(x)$ and integrating over $x$. This yields

We will again use the finite difference approximation. The finite difference formula for the second derivative is

$$ y'' = \frac{y_{i+1}-2y_i+y_{i-1}}{x_{i+1}-x_{i-1}} $$

We can think of the first term in the Schrodinger equation as the overlap of the wave function $\psi(x)$ with the second derivative of the wave function $\frac{\partial^2}{\partial x^2}\psi(x)$. Given the above expression for the second derivative, we can see if we take the overlap of the states $y_1,\dots,y_n$ with the second derivative, we will only have three points where the overlap is nonzero, at $y_{i-1}$, $y_i$, and $y_{i+1}$. In matrix form, this leads to the tridiagonal Laplacian matrix, which has -2's along the diagonals, and 1's along the diagonals above and below the main diagonal.

The second term turns leads to a diagonal matrix with $V(x_i)$ on the diagonal elements. Putting all of these pieces together, we get:

x=linspace(-3,3)m=1.0ohm=1.0T=(-0.5/m)*Laplacian(x)V=0.5*(ohm**2)*(x**2)H=T+diag(V)E,U=eigh(H)h=x[1]-x[0]# Plot the Harmonic potentialplot(x,V,color='k')foriinrange(4):# For each of the first few solutions, plot the energy level:axhline(y=E[i],color='k',ls=":")# as well as the eigenfunction, displaced by the energy level so they don't# all pile up on each other:plot(x,-U[:,i]/sqrt(h)+E[i])title("Eigenfunctions of the Quantum Harmonic Oscillator")xlabel("Displacement (bohr)")ylabel("Energy (hartree)")

Out[116]:

<matplotlib.text.Text at 0x5695d90>

We've made a couple of hacks here to get the orbitals the way we want them. First, I inserted a -1 factor before the wave functions, to fix the phase of the lowest state. The phase (sign) of a quantum wave function doesn't hold any information, only the square of the wave function does, so this doesn't really change anything.

But the eigenfunctions as we generate them aren't properly normalized. The reason is that finite difference isn't a real basis in the quantum mechanical sense. It's a basis of Dirac δ functions at each point; we interpret the space betwen the points as being "filled" by the wave function, but the finite difference basis only has the solution being at the points themselves. We can fix this by dividing the eigenfunctions of our finite difference Hamiltonian by the square root of the spacing, and this gives properly normalized functions.

plot(x,ho_evec(x,0,1,1),label="Analytic")plot(x,-U[:,0]/sqrt(h),label="Numeric")xlabel('x (bohr)')ylabel(r'$\psi(x)$')title("Comparison of numeric and analytic solutions to the Harmonic Oscillator")legend()

Out[118]:

<matplotlib.legend.Legend at 0x59b4e90>

The agreement is almost exact.

We can use the subplot command to put multiple comparisons in different panes on a single plot:

Other than phase errors (which I've corrected with a little hack: can you find it?), the agreement is pretty good, although it gets worse the higher in energy we get, in part because we used only 50 points.

The Scipy module has many more special functions:

In [120]:

fromscipy.specialimportairy,jn,eval_chebyt,eval_legendresubplot(2,2,1)x=linspace(-1,1)Ai,Aip,Bi,Bip=airy(x)plot(x,Ai)plot(x,Aip)plot(x,Bi)plot(x,Bip)title("Airy functions")subplot(2,2,2)x=linspace(0,10)foriinrange(4):plot(x,jn(i,x))title("Bessel functions")subplot(2,2,3)x=linspace(-1,1)foriinrange(6):plot(x,eval_chebyt(i,x))title("Chebyshev polynomials of the first kind")subplot(2,2,4)x=linspace(-1,1)foriinrange(6):plot(x,eval_legendre(i,x))title("Legendre polynomials")

Out[120]:

<matplotlib.text.Text at 0x69dbe10>

As well as Jacobi, Laguerre, Hermite polynomials, Hypergeometric functions, and many others. There's a full listing at the Scipy Special Functions Page.

There's a section below on parsing CSV data. We'll steal the parser from that. For an explanation, skip ahead to that section. Otherwise, just assume that this is a way to parse that text into a numpy array that we can plot and do other analyses with.

Since we expect the data to have an exponential decay, we can plot it using a semi-log plot.

In [124]:

title("Raw Data")xlabel("Distance")semilogy(data[:,0],data[:,1],'bo')

Out[124]:

[<matplotlib.lines.Line2D at 0x6501f10>]

For a pure exponential decay like this, we can fit the log of the data to a straight line. The above plot suggests this is a good approximation. Given a function
$$ y = Ae^{-ax} $$
$$ \log(y) = \log(A) - ax$$
Thus, if we fit the log of the data versus x, we should get a straight line with slope $a$, and an intercept that gives the constant $A$.

There's a numpy function called polyfit that will fit data to a polynomial form. We'll use this to fit to a straight line (a polynomial of order 1)

This data looks more Gaussian than exponential. If we wanted to, we could use polyfit for this as well, but let's use the curve_fit function from Scipy, which can fit to arbitrary functions. You can learn more using help(curve_fit).

Many methods in scientific computing rely on Monte Carlo integration, where a sequence of (pseudo) random numbers are used to approximate the integral of a function. Python has good random number generators in the standard library. The random() function gives pseudorandom numbers uniformly distributed between 0 and 1:

random() uses the Mersenne Twister algorithm, which is a highly regarded pseudorandom number generator. There are also functions to generate random integers, to randomly shuffle a list, and functions to pick random numbers from a particular distribution, like the normal distribution:

It is generally more efficient to generate a list of random numbers all at once, particularly if you're drawing from a non-uniform distribution. Numpy has functions to generate vectors and matrices of particular types of random distributions.

In [132]:

plot(rand(100))

Out[132]:

[<matplotlib.lines.Line2D at 0x794fe90>]

One of the first programs I ever wrote was a program to compute $\pi$ by taking random numbers as x and y coordinates, and counting how many of them were in the unit circle. For example:

In [133]:

npts=5000xs=2*rand(npts)-1ys=2*rand(npts)-1r=xs**2+ys**2ninside=(r<1).sum()figsize(6,6)# make the figure squaretitle("Approximation to pi = %f"%(4*ninside/float(npts)))plot(xs[r<1],ys[r<1],'b.')plot(xs[r>1],ys[r>1],'r.')figsize(8,6)# change the figsize back to 4x3 for the rest of the notebook

The idea behind the program is that the ratio of the area of the unit circle to the square that inscribes it is $\pi/4$, so by counting the fraction of the random points in the square that are inside the circle, we get increasingly good estimates to $\pi$.

The above code uses some higher level Numpy tricks to compute the radius of each point in a single line, to count how many radii are below one in a single line, and to filter the x,y points based on their radii. To be honest, I rarely write code like this: I find some of these Numpy tricks a little too cute to remember them, and I'm more likely to use a list comprehension (see below) to filter the points I want, since I can remember that.

As methods of computing $\pi$ go, this is among the worst. A much better method is to use Leibniz's expansion of arctan(1):

$$\frac{\pi}{4} = \sum_k \frac{(-1)^k}{2*k+1}$$

In [134]:

n=100total=0forkinrange(n):total+=pow(-1,k)/(2*k+1.0)print(4*total)

3.1315929035585537

If you're interested a great method, check out Ramanujan's method. This converges so fast you really need arbitrary precision math to display enough decimal places. You can do this with the Python decimal module, if you're interested.

As more and more of our day-to-day work is being done on and through computers, we increasingly have output that one program writes, often in a text file, that we need to analyze in one way or another, and potentially feed that output into another file.

This output actually came from a geometry optimization of a Silicon cluster using the NWChem quantum chemistry suite. At every step the program computes the energy of the molecular geometry, and then changes the geometry to minimize the computed forces, until the energy converges. I obtained this output via the unix command

% grep @ nwchem.out

since NWChem is nice enough to precede the lines that you need to monitor job progress with the '@' symbol.

We could do the entire analysis in Python; I'll show how to do this later on, but first let's focus on turning this code into a usable Python object that we can plot.

First, note that the data is entered into a multi-line string. When Python sees three quote marks """ or ''' it treats everything following as part of a single string, including newlines, tabs, and anything else, until it sees the same three quote marks (""" has to be followed by another """, and ''' has to be followed by another ''') again. This is a convenient way to quickly dump data into Python, and it also reinforces the important idea that you don't have to open a file and deal with it one line at a time. You can read everything in, and deal with it as one big chunk.

The first thing we'll do, though, is to split the big string into a list of strings, since each line corresponds to a separate piece of data. We will use the splitlines() function on the big myout string to break it into a new element every time it sees a newline (\n) character:

Splitting is a big concept in text processing. We used splitlines() here, and we will use the more general split() function below to split each line into whitespace-delimited words.

We now want to do three things:

Skip over the lines that don't carry any information

Break apart each line that does carry information and grab the pieces we want

Turn the resulting data into something that we can plot.

For this data, we really only want the Energy column, the Gmax column (which contains the maximum gradient at each step), and perhaps the Walltime column.

Since the data is now in a list of lines, we can iterate over it:

In [140]:

forlineinlines[2:]:# do something with each linewords=line.split()

Let's examine what we just did: first, we used a for loop to iterate over each line. However, we skipped the first two (the lines[2:] only takes the lines starting from index 2), since lines[0] contained the title information, and lines[1] contained underscores.

We then split each line into chunks (which we're calling "words", even though in most cases they're numbers) using the string split() command. Here's what split does:

In [141]:

#import stringhelp("".split)

Help on built-in function split:
split(...)
S.split([sep[, maxsplit]]) -> list of strings
Return a list of the words in S, using sep as the
delimiter string. If maxsplit is given, at most maxsplit
splits are done. If sep is not specified or is None, any
whitespace string is a separator and empty strings are
removed from the result.

Here we're implicitly passing in the first argument (s, in the doctext) by calling a method .split() on a string object. In this instance, we're not passing in a sep character, which means that the function splits on whitespace. Let's see what that does to one of our lines:

This is fine for printing things out, but if we want to do something with the data, either make a calculation with it or pass it into a plotting, we need to convert the strings into regular floating point numbers. We can use the float() command for this. We also need to save it in some form. I'll do this as follows:

In [144]:

data=[]forlineinlines[2:]:# do something with each linewords=line.split()energy=float(words[2])gmax=float(words[4])time=float(words[8])data.append((energy,gmax,time))data=array(data)

We now have our data in a numpy array, so we can choose columns to print:

I would write the code a little more succinctly if I were doing this for myself, but this is essentially a snippet I use repeatedly.

Suppose our data was in CSV (comma separated values) format, a format that originally came from Microsoft Excel, and is increasingly used as a data interchange format in big data applications. How would we parse that?

There are two significant changes over what we did earlier. First, I'm passing the comma character ',' into the split function, so that it breaks to a new word every time it sees a comma. Next, to simplify things a big, I'm using the map() command to repeatedly apply a single function (float()) to a list, and to return the output as a list.

Hartrees (what most quantum chemistry programs use by default) are really stupid units. We really want this in kcal/mol or eV or something we use. So let's quickly replot this in terms of eV above the minimum energy, which will give us a much more useful plot:

This gives us the output in a form that we can think about: 4 eV is a fairly substantial energy change (chemical bonds are roughly this magnitude of energy), and most of the energy decrease was obtained in the first geometry iteration.

We mentioned earlier that we don't have to rely on grep to pull out the relevant lines for us. The string module has a lot of useful functions we can use for this. Among them is the startswith function. For example:

and we've successfully grabbed all of the lines that begin with the @ symbol.

The real value in a language like Python is that it makes it easy to take additional steps to analyze data in this fashion, which means you are thinking more about your data, and are more likely to see important patterns.

Strings are a big deal in most modern languages, and hopefully the previous sections helped underscore how versatile Python's string processing techniques are. We will continue this topic in this chapter.

We can print out lines in Python using the print command.

In [151]:

print("I have 3 errands to run")

I have 3 errands to run

In IPython we don't even need the print command, since it will display the last expression not assigned to a variable.

In [152]:

"I have 3 errands to run"

Out[152]:

'I have 3 errands to run'

print even converts some arguments to strings for us:

In [153]:

a,b,c=1,2,3print("The variables are ",1,2,3)

The variables are 1 2 3

As versatile as this is, you typically need more freedom over the data you print out. For example, what if we want to print a bunch of data to exactly 4 decimal places? We can do this using formatted strings.

Formatted strings share a syntax with the C printf statement. We make a string that has some funny format characters in it, and then pass a bunch of variables into the string that fill out those characters in different ways.

We use a percent sign in two different ways here. First, the format character itself starts with a percent sign. %d or %i are for integers, %f is for floats, %e is for numbers in exponential formats. All of the numbers can take number immediately after the percent that specifies the total spaces used to print the number. Formats with a decimal can take an additional number after a dot . to specify the number of decimal places to print.

The other use of the percent sign is after the string, to pipe a set of variables in. You can pass in multiple variables (if your formatting string supports it) by putting a tuple after the percent. Thus,

It's worth noting that more complicated string formatting methods are in development, but I prefer this system due to its simplicity and its similarity to C formatting strings.

Recall we discussed multiline strings. We can put format characters in these as well, and fill them with the percent sign as before.

In [156]:

form_letter="""\%sDear %s,We regret to inform you that your product did notship today due to %s.We hope to remedy this as soon as possible. From, Your Supplier"""print(form_letter%("July 1, 2013","Valued Customer Bob","alien attack"))

July 1, 2013
Dear Valued Customer Bob,
We regret to inform you that your product did not
ship today due to alien attack.
We hope to remedy this as soon as possible.
From,
Your Supplier

The problem with a long block of text like this is that it's often hard to keep track of what all of the variables are supposed to stand for. There's an alternate format where you can pass a dictionary into the formatted string, and give a little bit more information to the formatted string itself. This method looks like:

In [157]:

form_letter="""\%(date)sDear %(customer)s,We regret to inform you that your product did notship today due to %(lame_excuse)s.We hope to remedy this as soon as possible. From, Your Supplier"""print(form_letter%{"date":"July 1, 2013","customer":"Valued Customer Bob","lame_excuse":"alien attack"})

July 1, 2013
Dear Valued Customer Bob,
We regret to inform you that your product did not
ship today due to alien attack.
We hope to remedy this as soon as possible.
From,
Your Supplier

By providing a little bit more information, you're less likely to make mistakes, like referring to your customer as "alien attack".

As a scientist, you're less likely to be sending bulk mailings to a bunch of customers. But these are great methods for generating and submitting lots of similar runs, say scanning a bunch of different structures to find the optimal configuration for something.

For example, you can use the following template for NWChem input files:

This is a very bad geometry for a water molecule, and it would be silly to run so many geometry optimizations of structures that are guaranteed to converge to the same single geometry, but you get the idea of how you can run vast numbers of simulations with a technique like this.

We used the enumerate function to loop over both the indices and the items of a sequence, which is valuable when you want a clean way of getting both. enumerate is roughly equivalent to:

What the keyword argument construction does is to take any additional keyword arguments (i.e. arguments specified by name, like "endpoint=False"), and stick them into a dictionary called "kwargs" (you can call it anything you like, but it has to be preceded by two stars). You can then grab items out of the dictionary using the get command, which also lets you specify a default value. I realize it takes a little getting used to, but it is a common construction in Python code, and you should be able to recognize it.

There's an analogous *args that dumps any additional arguments into a list called "args". Think about the range function: it can take one (the endpoint), two (starting and ending points), or three (starting, ending, and step) arguments. How would we define this?

Note that we have defined a few new things you haven't seen before: a break statement, that allows us to exit a for loop if some conditions are met, and an exception statement, that causes the interpreter to exit with an error message. For example:

List comprehensions are a streamlined way to make lists. They look something like a list definition, with some logic thrown in. For example:

In [171]:

evens1=[2*iforiinrange(10)]print(evens1)

[0, 2, 4, 6, 8, 10, 12, 14, 16, 18]

You can also put some boolean testing into the construct:

In [172]:

odds=[iforiinrange(20)ifi%2==1]
odds

Out[172]:

[1, 3, 5, 7, 9, 11, 13, 15, 17, 19]

Here i%2 is the remainder when i is divided by 2, so that i%2==1 is true if the number is odd. Even though this is a relative new addition to the language, it is now fairly common since it's so convenient.

iterators are a way of making virtual sequence objects. Consider if we had the nested loop structure:

for i in range(1000000):
for j in range(1000000):

Inside the main loop, we make a list of 1,000,000 integers, just to loop over them one at a time. We don't need any of the additional things that a lists gives us, like slicing or random access, we just need to go through the numbers one at a time. And we're making 1,000,000 of them.

iterators are a way around this. For example, the xrange function is the iterator version of range. This simply makes a counter that is looped through in sequence, so that the analogous loop structure would look like:

for i in xrange(1000000):
for j in xrange(1000000):

Even though we've only added two characters, we've dramatically sped up the code, because we're not making 1,000,000 big lists.

A factory function is a function that returns a function. They have the fancy name lexical closure, which makes you sound really intelligent in front of your CS friends. But, despite the arcane names, factory functions can play a very practical role.

Suppose you want the Gaussian function centered at 0.5, with height 99 and width 1.0. You could write a general function.

In [176]:

defgauss(x,A,a,x0):returnA*exp(-a*(x-x0)**2)

But what if you need a function with only one argument, like f(x) rather than f(x,y,z,...)? You can do this with Factory Functions: