Beginning Python for Bioinformatics

Bioinformatics, the use of computers in biological research, is the newest
wrinkle on one of the oldest pursuits--trying to uncover the secret of life.
While we may not know all of life's secrets, at the very least computers
are helping us understand many of the biological processes that take place
inside of living things. In fact, the use of computers in biological research
has risen to such a degree that computer programming has now become an important
and almost essential skill for today's biologists.

The purpose of this article is to introduce Python as a useful and viable
development language for the computer programming needs of the bioinformatics
community. In this introduction, we'll identify some of the advantages of using
Python for bioinformatics. Then we'll create and demonstrate examples of
working code to get you started. In subsequent articles we'll explore some
significant bioinformatics projects that make use of Python.

A Bit of Background

Because scientists have long relied on the open availability of each other's
research results, it was only natural that they would turn to Open Source
software when it came time to apply computer processes to the study of
biological processes. One of the first Open Source languages to gain popularity
among biologists was Perl. Perl gained a foothold in bioinformatics based on
its strong text processing facilities, which were ideally suited to analyzing
early sequence data. To its credit, Perl has a history of successful use in
bioinformatics and is still a very useful tool for biological research.

In comparison to Perl, Python is a relative newcomer to bioinformatics,
but is steadily gaining in popularity. A few of the reasons for this popularity
are the:

Readability of Python code

Ability to development applications quickly

Powerful standard library of functionality

Scalability from very small to very large programs

The Python language was designed to be as simple and accessible as possible,
without giving up any of the power needed to develop sophisticated
applications. Python's clean, consistent syntax leaves it free from the
subtleties and nuances that can make other languages difficult to
learn and programs written in those languages difficult to comprehend.

Python's dynamic nature adds to its accessibility. For example, Python
doesn't require you to declare variables before you use them, and the same
variable can refer to objects of different types over the course of its
existence. Python can be also be used interactively, allowing you to
familiarize yourself with the language of any Python modules in an interactive
session where each command produces immediate results.

Python also has excellent support for the object-oriented style of
programming. We'll show an example of this capability at the end of this
article, but the basic idea is that object-orientation often provides a better
way to organize the data and functionality within your programs. As the data
and analytical techniques used in bioinformatics have become more complex, the
value of object-oriented language features has risen.

In addition, Python integrates well with systems written in other languages,
such as C, C++, Java and Fortran. One of the main benefits of C is speed. When
a programmer needs an algorithm to run as fast as possible, they can code it in
C or C++ and make it available to Python as an extension module. To the
programmer, these are indistinguishable from pure Python modules. Similar
utilities exist that make the large body of scientific algorithms coded in
Fortran accessible to Python programs.

Java has become popular as a cross-platform and Web development language.
The Python interpreter is now available in two variations: one version
written in C, and the other version, known as Jython, written in Java. Jython
allows Java programmers to write programs using the Python syntax and dynamic
language features, and it allows Python programmers to use existing code
developed in Java. These are just a few examples of the many ways Python is
able to leverage and extend existing code written in other languages.

So while Perl is more well established in the bioinformatics community, many
biologists and bioinformaticians are also turning to Python as it gains in
popularity. To get a better sense of what Python has to offer, we'll look at
examples of Python code that highlight some of its features. But first, we need
to cover some of the basic biology that we'll touch on in the examples.

A Bit of Biology

One of the goals of molecular biology is to understand the processes that
take place within the cells of living organisms. One such process is the
creation of proteins, some of the most basic raw materials of all living
things. Almost every process within a living creature makes use of, or is
influenced by, these large, complex molecular structures. There are thousands
of different proteins and we have barely begun to understand them in any
detail. One thing we do know is that the creation of proteins is determined by
the information encoded within the genetic material in each cell, called DNA.

DNA is a linear structure made up of a sequence of molecules called
nucleotides or bases. Four nucleotides appear in DNA: adenine, cytosine,
guanine, and thymine. These nucleotides are usually represented by their
initials, A, C, G and T. DNA is actually composed of two strands of these
nucleotides wound around each other in the famous double helix shape.

The sequence of a single strand of DNA can be represented as a sequence of
alphabetical characters identifying each base in the sequence, such as
ACCTTGGCACCT. Due to their chemical attractions, the nucleotides always appear
in pairs, also called base pairs, such that adenine (A) always pairs up with
thymine (T), and cytosine (C) always pairs up with guanine (G). Because of this
base-pairing characteristic, we can easily determine the complementary, or
opposite, strand of any single-stranded DNA sequence.

A simplified view of how DNA determines the creation of a protein goes
something like this. A section of DNA called a gene contains the encoded
information about the protein to create. Through the process of transcription,
the two DNA strands along a gene separate, and the gene is copied. This
single-stranded copy is called RNA, or, more precisely, messenger RNA. It is
identical to the original gene sequence, except that the nucleotide uracil (U)
appears in place of thymine (T).

Once formed, the messenger RNA moves to a structure in the cell known as a
ribosome. The ribosome moves along the messenger RNA, reading its sequence
three nucleotides at a time. Each group of three nucleotides, called a codon,
determines which of 20 amino acids gets assembled by the ribosome into a
protein. Like DNA and RNA, proteins are linear structures that can be
represented by a text string. Where DNA and RNA use a four-character alphabet,
proteins require a 20-character alphabet to represent each amino acid in a
protein sequence.