The Comprehensive Introduction To Your Genome With the SciPy Stack

Bioinformatics is an interdisciplinary field that develops methods and software tools for analyzing and understanding biological data.

More simply stated, you can simply think of it as data science for biology.

Among the many types of biological data, genomics data is one of the most widely analyzed. Especially with the rapid advancement of next-generation DNA sequencing (NGS) technologies, the volume of genomics data has been growing exponentially. According to Stephens, Zachary D et al., genomics data acquisition is on the exabyte-per-year scale.

SciPy gathers many Python modules for scientific computing, which is ideal for many bioinformatic needs.

In this post, I demo an example of analyzing a GFF3 file for the human genome with the SciPy Stack. Generic Feature Format Version 3 (GFF3) is the current standard text file format for storing genomic features. In particular, in this post you will learn how to use the SciPy stack to answer the following questions about the human genome:

How much of the genome is incomplete?

How many genes are there in the genome?

How long is a typical gene?

What does gene distribution among chromosomes look like?

The latest GFF3 file for the human genome can be downloaded from here. The README file that comes in this directory provides a brief description of this data format, and a more thorough specification is found here.

We will use Pandas, a major component of the SciPy stack providing fast, flexible, and expressive data structures, to manipulate and understand the GFF3 file.

Setup

First things first, let’s setup a virtual environment with the SciPy stack installed. This process can be time-consuming if built from the source manually, as the stack involves many packages – some of which depends on external FORTRAN or C code. Here, I recommend using Miniconda, which makes the setup very easy.

The -b flag on the bash line tells it to execute in batch mode. After the commands above are used to successfully install Miniconda, start a new virtual environment for genomics, and then install the SciPy stack.

Note that we have only specified the 3 packages we are going to use in this post. If you want all the packages listed in the SciPy stack, simply append them to the end of the conda create command.

If you are unsure of the exact name of a package, try conda search. Let’s activate the virtual environment and start IPython.

source activate venv/
ipython

IPython is a significantly more powerful replacement to the default Python interpreter interface, so whatever you used to do in the default python interpreter can also be done in IPython. I highly recommend every Python programmer, who hasn’t been using IPython yet, to give it a try.

Download the Annotation File

It is about 37 MB, a very small file compared to the information content of a human genome, which is about 3 GB in plain text. That’s because the GFF3 file only contains the annotation of the sequences, while the sequence data is usually stored in another file format called FASTA. If you are interested, you can download FASTA here, but we won’t use the sequence data in this tutorial.

The prefixed ! tells IPython that this is a shell command instead of a Python command. However, IPython can also process some frequently used shell commands like ls, pwd, rm, mkdir, rmdir even without a prefixed !.

Taking a look at the head of the GFF file, you will see many metadata/pragmas/directives lines starting with ## or #!.

According to the README, ## means the metadata is stable, while #! means it’s experimental.

Later on you will also see ###, which is another directive with yet more subtle meaning based on the specification.

Human readable comments are supposed to be after a single #. For simplicity, we will treat all lines starting with # as comments, and simply ignore them during our analysis.

The first line indicates that the version of GFF format used in this file is 3.

Following that are summaries of all sequence regions. As we will see later, such information can also be found in the body part of the file.

The lines starting with #! shows information about the particular build of the genome, GRCh38.p7, that this annotation file applies to.

Genome Reference Consortium (GCR) is an international consortium, which oversees updates and improvements to several reference genome assemblies, including those for human, mouse, zebrafish, and chicken.

This is the annotation of the first chromosome with a seqid of 1, which starts from the first base to the 24,895,622nd base.

In other words, the first chromosome is about 25 million bases long.

Our analysis won’t need information from the three columns with a value of . (i.e. score, strand, and phase), so we can simply ignore them for now.

The last attributes column says Chromosome 1 also has three alias names, namely CM000663.2, chr1, and NC_000001.11. That’s basically what a GFF3 file looks like, but we won’t inspect them line by line, so it’s time to load the whole file into Pandas.

Pandas is good fit for dealing with GFF3 format because it is a tab-delimited file, and Pandas has very good support for reading CSV-like files.

Note one exception to the tab-delimited format is when the GFF3 contains ##FASTA.

According to the specification, ##FASTA indicates the end of an annotation portion, which will be followed with one or more sequences in FASTA (a non tab-delimited) format. But this is not the case for the GFF3 file we’re going to analyze.

The last line above loads the entire GFF3 file with pandas.read_csv method.

Since it is not a standard CSV file, we need to customize the call a bit.

First, we inform Pandas of the unavailability of header information in the GFF3 with header=None, and then we specify the exact name for each column with names=col_names.

If the names argument is not specified, Pandas will use incremental numbers as names for each column.

sep='\t' tells Pandas the columns are tab-separated instead of comma-separated. The value to sep= can actually be a regular expression (regex). This becomes handy if the file at hand uses different separators for each column (oh yeah, that happens). comment='#' means lines starting with # are considered comments and will be ignored.

compression='gzip' tells Pandas that the input file is a gzip-compressed.

In addition, pandas.read_csv has a rich set of parameters that allows different kinds of CSV-like file formats to be read.

The type of the returned value is a DataFrame, which is the most important data structure in Pandas, used for representing 2D data.

Pandas also has a Series and Panel data structure for 1D and 3D data, respectively. Please refer to the documentation for an introduction to Pandas’ data structures.

df.seqid is one way to access column data from a dataframe. Another way is df['seqid'], which is more general syntax, because if the column name is a Python reserved keyword (e.g. class) or contains a . or space character, the first way (df.seqid) won’t work.

The output shows that there are 194 unique seqids, which include Chromosomes 1 to 22, X, Y, and mitochondrion (MT) DNA as well as 169 others seqids.

The seqids starting with KI and GL are DNA sequences – or scaffolds – in the genome that have not been successfully assembled into the chromosomes.

For those who are unfamiliar with genomics, this is important.

Although the first human genome draft came out more than 15 years ago, the current human genome is still incomplete. The difficulty in assembling these sequences is largely due to complex repetitive regions in the genome.

Next, let’s take a look at the source column.

The README says that the source is a free text qualifier intended to describe the algorithm or operating procedure that generated this feature.

If you inspect the value evaluated from the expression df.source == 'GRCh38', it’s a series of True and False values for each entry with the same index as df. Passing it to df[] will only return those entries where their corresponding values are True.

There are 194 keys in df[] for which df.source == 'GRCh38'.

As we’ve seen previously, there are also 194 unique values in the seqid column, meaning each entry in gdfcorresponds to a particular seqid.

Then we randomly select 10 entries with the sample method to take closer look.

You can see that the unassembled sequences are of type supercontig while the others are of chromosome. To compute the fraction of genome that’s incomplete, we first need to know the length of the entire genome, which is the sum of the lengths of all seqids.

In the above snippet first, we made a copy of gdf with .copy(). Otherwise, the original gdf is just a slice ofdf, and modifying it directly would result in SettingWithCopyWarning (see here for more details).

We then calculate the length of each entry and add it back to gdf as a new column named “length”. The total length turns out to be about 3.1 billion, and the fraction of unassembled sequences is about 0.37%.

Here is how the slicing works in the last two commands.

First, we create a list of strings that covers all seqids of well assembled sequences, which are all chromosomes and mitochondria. We then use the isin method to filter all entries whose seqid are in the chrs list.

A minus sign (-) is added to the beginning of the index to reverse the selection, because we actually want everything that is not in the list (i.e. we want the unassembled ones starting with KI and GL)…

Note: Since the assembled and unassembled sequences are distinguished by the type column, the last line can alternatively be rewritten as follows to obtain the same results.

gdf[(gdf['type'] == 'supercontig')].length.sum() / gdf.length.sum()

How Many Genes Are There?

Here we focus on the entries from source ensembl, havana and ensembl_havana since they’re where the majority of the annotation entries belong.

They are formatted as semicolon-separated list of tag-value pairs. The information we are most interested in is gene name, gene ID and description, and we will extract them with regular expression (regex).

In the regex Name=(?P<gene_name>.+?); , +? is used instead of + because we want it to be non-greedy and let the search stop at the first semicolon; otherwise, the result will match up to the last semicolon.

Also, the regex is first compiled with re.compile instead of being used directly as in re.search for better performance because we will apply it to thousands of attribute strings.

extract_gene_name serves as a helper function to be used in pd.apply, which is the method to use when a function needs to be applied on every entry of a dataframe or series.

In this particular case, we want to extract the gene name for every entry in ndf.attributes, and add the names back to ndf in a new column called gene_name.

The regex for RE_GENE_ID is a bit more specific since we know that every gene_id must start with ENSG, where ENS means ensembl and G means gene.

For entries that don’t have any description, we will return an empty string. After everything is extracted, we won’t use the attributes column anymore, so let’s drop it to keep things nice and clean with the method .drop:

The lengths of the two extreme cases are five orders of magnitude apart (2.3 million vs. 8), which is enormous and which can be an indication of the level of diversity of life.

A single gene can be translated to many different proteins via a process called alternative splicing, something we haven’t explored. Such information is also inside the GFF3 file, but outside the scope of this post.

Gene Distribution Among Chromosomes

The last thing I’d like to discuss is gene distribution among chromosomes, which also serves as an example for introducing the .merge method for combining two DataFrames. Intuitively, longer chromosomes likely host more genes. Let’s see if that is true.

We borrowed the chrs variable from the previous section, and used it to filter out the unassembled sequences. Based on the output, the largest Chromosome 1 indeed has the most genes. While Chromosome Y has the smallest number of genes, it is not the smallest chromosome.

Note that there seem to be no genes in the mitochondrion (MT), which is not true.

A bit more filtering on the first DataFrame df returned by pd.read_csv shows that all MT genes are from source insdc (which were filtered out before when generating edf where we only considered sources of havana, ensembl, or ensembl_havana).

Since chr_gene_counts is still a Series object, which doesn’t support a merge operation, we need to convert it to a DataFrame object first with .to_frame.

.reset_index() converts the original index (i.e. seqid) into a new column and resets current index as 0-based incremental numbers.

The output from cdf.head(2) shows what it looks like. Next, we used the .merge method to combine the two DataFrame on the seqid column (on='seqid').

After merging gdf and cdf, the MT entry is still missing. This is because, by default, .merge operates an inner join, while left join, right join, or outer join are available by tuning the how parameter.

The related DataFrame.join method, uses merge internally for the index-on-index and index-on-column(s) joins, but joins on indexes by default rather than trying to join on common columns (the default behavior for merge). If you are joining on index, you may wish to use DataFrame.join to save yourself some typing.

Basically, .merge is more general-purpose and used by .join.

Finally, we are ready to calculate the correlation between chromosome length and gene_count.

As seen in image above, even though it is a positive correlation overall, it does not hold for all chromosomes. In particular, for Chromosome 17, 16, 15, 14, 13, the correlation is actually negative, meaning the number of genes on the chromosome decreases as the chromosome size increases.

Findings and Future Research

That ends our tutorial on the manipulation of an annotation file for human genome in GFF3 format with the SciPy stack. The tools we’ve mainly used include IPython, Pandas, and matplotlib. During the tutorial, not only have we learned some of the most common and useful operations in Pandas, we also answered some very interesting questions about our genome. In summary:

About 0.37% of the human genome is still incomplete even though the first draft came out over 15 years ago.

There are about 42,000 genes in the human genome based on this particular GFF3 file we used.

The length of a gene can range from a few dozen to over two million bases.

Genes are not evenly distributed among the chromosomes. Overall, the larger the chromosome, the more genes it hosts, but for a subset of the chromosomes, the correlation can be negative.

The GFF3 file is very rich in annotation information, and we have just scratched the surface. If you are interested in further exploration, here are a few questions you can play with:

How many transcripts does a gene typically have? What percentage of genes have more than 1 transcript?

How many isoforms does a transcript typically have?

How many exons, CDS, and UTRs does a transcript typically have? What sizes are they?

Is it possible to categorize the genes based on their function as described in the description column?