Biologist, Bioinformatician

Author: Connor Skennerton

NCBI’s assembly database is a great one-stop-shop for genomic data and annotations but it’s actually kind of difficult to download data if you only know the accession number of an assembly. The documentation says that the assembly database is integrated with entrez-direct, a great set of command line utilities for accessing NCBI data from the command line. Most of the databases have an option to download data based on the ID, so I thought that something like the following would work

efetch -id GCA_000398025.1 -db assembly -format fasta

Alas it does not, however the directory structure for the assembly FTP site nicely mirrors the accession numbers so for the example above, the path looks something like this: ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/398/025/

Below is a little awk script that I wrote to generate the FTP path and for input into wget. The input for this script is a file containing the accession in the first column and optionally a second column containing the assembly name. The full FTP path contains both of these but we can get around not having the assembly name by using wildcards. Nope, you need both the accession and the assembly name for this snippet to work.

I love looking at KEGG maps and using them to understand an organisms metabolism but they have their limitations. For starters, you’re obviously stuck with how they are drawn, which in most cases includes many variations on a particular pathway. Secondly, the tools for mapping on your own genes to a pathway are limited to one organism at a time.

What I really wanted was a way of quickly comparing 10s of genomes across a common set of pathways, in this case amino acid synthesis pathways. Failing to find a suitable way to map on many genomes to KEGGs pathway images, I decided to roll my own. I mixed up manually drawing the pathways and then using code to color in the presence/absence of the reaction in each organism.

My solution starts by creating a template svg image in Illustrator. This is the laborious part as I went through and redrew the amino acid pathways using information from both KEGG and MetaCyc. For each enzyme I positioned a grid of boxes, where each box represents an individual genome, that I would eventually color in based on the presence/absence of that reaction in that organism.

The TCA cycle and parts of the amino acid synthesis pathways. Each reaction is labelled with the EC number and contains a grid of 100 boxes.

Now the trick is to group the parts of the enzyme together and give that group a name in Illustrator. First the grid of 100 boxes was grouped first per row and then I grouped those 10 rows together (this is important and I’ll explain why below). And then the boxes, the arrow and the text label all got grouped together. Next name both the larger group and the boxes in Illustrator. You can do this easily by double clicking the group to enter “isolation mode”, where everything that isn’t in the group gets faded, and then changing the name in the layers panel of Illustrator.

When in isolation mode, the layers panel gives easy access to changing the name of the group. In this case I’m calling the group after the EC number.

Now comes the critical part in naming – the group containing only the 100 boxes must be start with “box-” and then what follows will be the identifier that gets used for coloring the boxes.

each enzyme has a sub-group containing the matrix of 100 boxes. This group must start with “box-” and then whatever follows will be used to lookup that reaction for each organism.

In the case above I’m using the EC number again as my identifier. It is whatever is written after the “box-” in the name of the group that is important. This means that you could use any kind of label in the figure, for simplicity I’ve simply kept both the visual label of the enzyme and the name of the group to both use the EC number.

Now, the reason why this is important is that we are going to save our figure as an svg file. Svg is an xml based file format, which means that it is simply text and thus is easy to modify using some scripting. Lets have a look at how each enzyme gets encoded in svg:

svg of part of one enzyme. the “” tag in svg is a group. You can see that the names for the whole enzyme and for the group of 100 boxes are retained from Illustrator

The names that were given to the groups in Illustrator have been retained in the svg file as the id tags, and it is this text that we’ll process using the script. The other thing to note is how the boxes are arranged, in this case as groups of 10 – these are the rows that were made originally.

Coloring the boxes programmatically

Now for the scripted part. You’ll need the code available here, which will color in the boxes based on a couple of text files. Run the script as follows:

The ec_numbers.csv file must contain three columns and must have a header row labelled organism, ec, lineage. The organism column is the name of the organism, the ec column must contain the identifiers that match up to the names used in the svg file. For example, in the svg file, if there is a box called “box-1.1.1.1” then the corresponding annotation in the ec column should be 1.1.1.1. However, there isn’t a requirement that this column be EC numbers or even that they are all from the same source. You could combine EC numbers, KO numbers and Pfam families in this column they just have to match what’s written in the svg file. The lineage column should contain information on how to properly order all of the genomes into the 100 boxes. I use the full taxonomy string of the organism to allow for similar organisms to be grouped together. However, this column doesn’t require a taxonomy string and is simply used for lexicographic sorting, so you could put anything in there to sort organisms however you want.

The organism_colors.csv file should contain two columns with a header row labelled organism and color. The organism column is the organism name that was used in the ec_numbers.csv file. The color column should contain a valid css color.

Here is a function that will take a character string in R and return an expression for fancy formatting in plots that properly italicize scientific names. The syntax for doing this is truly quite horrible, but this is how R does it.

scientific_name_formatter <- function(raw_name) {
# strsplit returns a list but we are passing in only
# one name so we just take the first element of the list
words = strsplit(raw_name, ' ', fixed = TRUE)[[1]]
# some sort of acronym or bin name, leave it alone
if(length(words) < 2) {
return(raw_name)
} else if (length(words) > 2) {
if (words[1] == 'Candidatus') {
# for candidatus names, only the Candidatus part is italicised
# name shortening it for brevity
unitalic <- paste(words[2:length(words)], collapse = ' ')
return(bquote(paste(italic(Ca.) ~ .(unitalic))))
} else if (grepl('^[A-Z]+$', words[1])) {
# If the first word is in all caps then it is an abreviation
# so we don't want to italicize that at all
return(raw_name)
} else {
# assume that everything after the second word is strain name
# which should not get italicised
unitalic <- paste(words[3:length(words)], collapse = ' ')
return(bquote(paste(italic(.(words[1])) ~ italic(.(words[2])) ~ .(unitalic) )))
}
} else {
return(bquote(paste(italic(.(words[1])) ~ italic(.(words[2])))))
}
}

I use it like the following in ggplot by setting the labels in scale_y_discrete.

Unix has so many great ways to perform text manipulation but one niche which hasn’t been filled is splitting a tabular file into pieces based on the contents of certain columns. There are two commands, split and csplit, that do a similar role. split can split a file into a certain number of bytes or lines; csplit uses a regular expression to determine where to split the file. Often for my purposes neither of these tools is a good fit, and what I really want is an equivalent to the “group by” clause in SQL databases. Group by sorts rows based on certain grouping columns so that you can then perform summaries on that group. split_by splits out a file based on the grouping columns for further processing on each chunk.

For example, say I’ve got a delimited text file containing a mix of categorical and numerical data like the following:

A

B

C

D

E

F

AU

BR

447

223.2

55958

US

AU

FR

348

16.8

58484

AU

US

UK

12

53.30

129

PG

I want to split this file into smaller files based on what’s in column A, which in this example would make two files for the 2 lines containing “AU” and for the 1 line containing “US”. Neither, split or csplit really handle this case and the only option is to use an awk, perl or other script to handle it. (It’s also possible to do this in multiple passes with grep)

After writing similar awk one-liners to do this sort of thing I decided to make it a bit more reusable:

This will produce two files on the example above called “test_split_AU.tsv” and “test_split_US.tsv”. Only the cols argument is needed to say which of the columns is used to split on. If you want to split on multiple columns pass a comma separated list -vcols=1,2. The output file names are generated based on the contents of those columns so if there are any special characters in there that might mess up a file name, you’re in trouble. The prefix and suffix variables default to nothing but are useful for being able to find the files for later by giving them all a common prefix. The nice thing about this is that all of the awk variables are still available so changing between a csv and tsv file can be achieved using the FS parameter on the command line.

The Entrez Direct toolkit is great for programmatic access to all of NCBI’s resources. This little snippet below finds all of the refseq representative genomes for a given NCBI taxonomy ID, makes a little summary of the genomes downloaded and uses wget to download the genbank files from the Assembly FTP. Change the inital query in the first call to esearch to change what genomes are downloaded.

I feel like I’m on a life-long quest to make all of my phylogenetic tree figures completely programmatically. The best tool I’ve found for making them is the ete library for the python programing language. I’ve already figured out how to get trees drawn in the style that I like but there was still one thing left to do: making organism names italicize correctly.

I work with microorganisms where the convention is for the genus and species names to be italicized but the strain name to be in regular font. For example: Methanoregula formicica SMSP, DSM 22288. There are exceptions to this as well; if an organism is not in pure culture then the name Candidatus is prepended and any genus and species names remain in regular format. Unfortunately, ete doesn’t have an interface for doing this kind of name formatting, you can either have it italic, bold or regular, but you can’t mix and match. Happily though it does contain an interface for dropping down into the lower-level drawing engine (pyQt4), which enables you to do all sorts of custom things.

Below is my solution to making fancy formatted organism names. I think it might be a little hacky but seems to work. The organism name should be in the “name” attribute of the leaf for this to work. (See the example below)

Here are the formatting rules:

If the name has less than 2 words: do nothing. This is to catch internal names or accession numbers that are not valid scientific names

If the first word of the name is “Candidatus”: abbreviate it to “Ca.” and italicize just that part.

So. You’ve got yourself a nice new genome sequence and you want to know what kind of metabolism it has. There is a good chance that you have some idea already — you think it’s a nitrogen fixer or a sulfate reducer etc. — based on other analyses and now it’s time to strengthen your paper with a bit of genomic evidence.

Getting an initial annotation

The vast majority of the genes in the genome are going to be hypothetical proteins; of the rest, a sizable chunk are going to be genes with a general sort of annotation like “ABC transporter” (which says nothing about what it’s transporting), and the rest are going to be metabolic genes that you probably care about. The first thing that you want to do is to throw your genome into one of the automatic annotation pipelines (prokka, rast, img, ncbi, kaas) to get an initial annotation. You can then search through these to look for pathways of interest.

However, don’t just straight out believe what these pipelines say, they will definitely miss some genes and report annotations that aren’t correct, but they are a good start since in most cases you will only be interested in a portion of the genes.

The most important thing to remember is to be a scientist. Have a sceptical mindset and interrogate your data. You can’t look at every gene (well you can but probably don’t have time) so my rule of thumb is that you should manually check all of the genes in all of the pathways that you are going to write about in your paper.

My Workflow

1. Compare the sequence of your unknown gene against Uniprot.

You can also use NCBI blast database or IMG or any other database you wish however I like Uniprot because it ranks it annotations with an overall confidence score, so you know what kind of experiments has informed the annotation. Uniprot will show some starred proteins that are biochemically verified – this is a good way to figure out if your gene is annotated with the correct function.

Remember that most genomes are automatically annotated which means that even if all of the top blast hits are annotated as the same thing, it doesn’t mean that it’s correct. Comparing it against genes that have been biochemically annotated is important.

Also remember that blast is a local aligner, so don’t just look at the e-value, look at the percentage matched of the query (your gene) and the subject (gene in the database). If you have a very long gene and only part of it matches then you should definitely investigate this!

2. Find conserved domains using Interpro

Sometimes (read: Often) blast matches will not be conclusive or return only a general annotation. In these situations it’s best to look at the protein sequence another way. In this case looking at the domains that make up the protein using Interpro may help figure out what it does. This is also a good way to compare your protein to a biochemically analysed protein. I your gene has extra domains, different domains, or fewer domains then it may well be that the blast annotation was incorrect.

3. Look at surrounding genes

A lot of enzymes are encoded by multiple genes which all exist in one operon. If you want to say that your bug has some enzyme you should make sure that it contains all of the subunits for that enzyme. This can be an essential step where enzymes evolved from a common ancestor such that individual subunits can be difficult to distinguish. A good example here is NADH dehydrogenase and the evolutionarily related hydrogenases, which can be easily distinguished by the operon structure, but individual subunits often get mis-annotated.

4. Make an alignment (optional)

The last thing you might need to do is to make an alignment to see if your putative enzyme has all (or some of) of the right conserved residues as other biochemically characterised proteins. If the alignment looks terrible then perhaps you’ve got something novel.

ete3 has support for phyloxml which I use with archaeopteryx tree viewer for a lot of my day-to-day phylogenetics visualisation. My main reason for using phyloxml is one of convenience as I have a script that will easily add in the proper organism name onto the tree and I think that archaeopteryx is a really good basic tree viewer. I wanted to draw a tree from phyloxml in ete using my own style and to have the proper organism name to be rendered. In my phyloxml file I have this coded in as the scientific name for each leaf (see below for phyloxml snippet), so now all I needed to do was make this the node name when rendering the tree.

I found that the interface for phyloxml was not the same as for newick formatted trees and unfortunately the documentation for phyloxml in ete3 is a bit lacking as there wasn’t a complete listing of methods for each class. After much messing around, looking at the source code of ete3 and examining python objects using the builtin dir function I was able to get what I wanted.

turns out that for each node/leaf I needed to access the phyloxml_clade attribute, which has an attribute taxonomy, which implements an iterable interface (I think it’s probably a list), which I could then use to access the scientific name and make the name of the leaf for printing. It’s a little convoluted but easy when you know how.

from ete3 import Phyloxml
project = Phyloxml()
# iterate through the trees in the phyloxml file
for tree in project.get_phylogeny():
# go through the node in the tree
for node in tree:
# assign the node name from the data in the phyloxml file
node.name = node.phyloxml_clade.taxonomy[0].get_scientific_name()
tree.show()

I recently had to clean up some data from the supplementary material from Pereira et. al 2011, which is a very nice table of manually annotated genes in sulfate reducing bacteria. The only problem is that the table is designed for maximum human readability, which made it a real pain when trying to parse out the data. I decided to use R and the “hadleyverse” packages to clean up the table to make things work better for downstream analyses. This isn’t part of my normal workflow, I’m more of a python guy, but after doing this analysis in R I’d have to say that I’m a convert.

Getting the data

Download the supplementary material from the paper linked above, if you want to play along at home. This data is a nicely formatted excel workbook containing eight spreadsheets with the locus identifiers for a number of genes important in suflate reducing prokaryotes. While this data is nice and visually appealing, it is not tidy and it’s difficult to get the information I want out of it.

I want the locus identifiers for the genes of interest so that I can download them from NCBI and use them as a blast database.

Cleaning the data

Some things are just easier to do in excel before tidying the data in R here is what I did:

removed the empty columns and rows at the beginning. This actually isn’t difficult to do in R, but doing this makes inputting the data more painless cause then R will pick up the column names.

Remove the rows that contain just the taxonomic information

For some of the sheets (‘Hases’, for example) I removed rows at the beginning that gave hierarchy to the columns. These are mostly unnecessary and make it difficult to parse the excel sheet, as readxl does not currently handle merged cells and cause the boundaries of this hierarchy is coded visually using cell boarders in excel.

For some reason there were single quotation marks in the Archaeoglobus fulgidus DsrK locus identifiers, which I removed

Open up an R session and load the following libraries (assuming that you already have them installed)

Lets look at what our table looks like (Note the ‘organism’ column is not shown for brevity)

knitr::kable(d[[1]][,-1])

locus

SAT

AprA

QmoA

DsrA

DsrC

H-Ppi

FdxI

FdxII

AF

1667

1670

0663

0423

2228

NA

00427; 1010; 0355; 0923; 2142; 0166; 1700; 0156; 0464

0167

Arcpr_

1264

1261

1260

0139

1726

NA

0142; 0625; 0483; 0712; 1058

NA

Cmaq_

0274

0273

NA

0853

0856

0949

0549; 1711

NA

DaesDRAFT_

2031

2029

2028

2438

0796

NA

1729

0903

Dde_

2265

1110

1111

0526

0762

NA

3775

0286

Ddes_

0454

2129

2127

2275

1917

NA

889

NA

DMR_

39470

05400

05410

03600

15890

NA

39570; 18760

13970

DESPIG_

02241

02773

02771

NA

02353

NA

00991

NA

Desal_

0228

0230

0231

0787

0984

NA

1299

0241; 2850

DFW101DRAFT_

0832

1162

1163

3451

2958

NA

0847

0729

DVU

1295

0847

0848

0402

2776

NA

3276

NA

Dbac_

3196

3198

3199

0279

2958

NA

0275

2977

Dalk_

2445

1569

1568

4301

4140

3373

4380; 2230; 2714

2374

HRM2_

31180

04510

04500

42400

22050

NA

26720; 10680; 01580; 39570

40690

Dole_

1002

0998

0999

2669

0463

2820

1168

2655

Dret_

1968

1966

1965

0244

1739

NA

0240

0154; 0169

DthioDRAFT_

1410

1407

1406

2272

2675

NA

2268

NA

DP

1472

1105

1106

0797

0997

NA

2755; 0775

1865

DaAHT2_

0293

1471

1470

0296

2041

NA

1668

2532; 2287

Sfum_

1046

1048

1287

4042

4045

3037

4046

2933; 3217

Dtox_

3579

3577

3576

0079

0077

3931

0074; 0532; 1221; 1608; 3208

1637

Dred_

0635

0637

0638

3187

3197

2985

3200; 2937; 0466

2203

Daud_

1076

NA

1884

2201

2190

0308

2193; 1963

1080

Adeg_

1712

1080

1079

2094

0035

NA

0032

1625

THEYE_

A1835

A1832

A1831

A1994

A0003

NA

A0789

NA

In the data, the columns for each gene are really values, not variables; they should be converted into individual rows. To do this use the gather function from tidyr. Here I specify the name of the new columns gene.identifier which will contain the name of the gene and locus.identifier which will contain the information for that gene. I’m also setting na.rm which will not include genes where it was not found in the organism. After the gather function is applied all of the data frames in the list will have the same columns, which means that they can all be concatenated into one big data frame. To do this I’m using dpylr::bind_rows.

The other untidy aspect of the dataset is that there are multiple locus identifiers for some of the genes (presumably cause there are multiple copies in the genome). We next need to split them out into separate observations (rows). The str_split function from stringr will split a string based on a regular expression and return a list of values. I then pass this to the unnest function, which will “flatten” the list into multiple rows as required.

Now for the final clean-up. The original data separated the locus prefix and the locus identifier, now I want to combine them back together. To do this I’m going to use a couple of calls to the mutate function, which modifies a column. First, in the previous command I converted the locus.identifier column to characters, however this has the unwanted effect of having decimal places in the strings, which I don’t want. Passing the locus.identifier column to the sub function will remove the unwanted text. The next mutate command combines the locus and locus.identifier columns into one and finally I select which columns I want in the final data frame using the select function.

I do a lot of work in phylogenetics, which means that for just about every paper I’ve written I’ve had at least one figure that is a phylogenetic tree. Making pretty looking trees for a publication is tedious and my previous workflow involved using ARB for actually drawing the tree and producing an initial file in postscript, and then loading that into Adobe Illustrator to make everything beautiful.

The problem with this is that it is not an automated process so any time I need to change the tree I need to redo all of the ‘beautifying’ manually. Recently I coded up an alternative approach using the excellent ete package for python to draw trees exactly how I want.

One of the nicest things about drawing trees in ARB is that you can collapse clades into wedges. Unfortunately, while ete does allow you to collapse clades it doesn’t provide a way to show the collapsed node as a wedge, the only options are a square or a circle. But there is the option to create a custom face which is exactly what I did. Below is a function to create ARB-style wedges:

def polygon_name_face(node, width, height, width_percent):
"""create a wedge shaped face in the style of ARB
Args:
width (int): size in pixels for the width of the wedge
height (int): size in pixels for the height of the wedge
width_percent (float): change the angle of the point of the wedge.
This must be a number between 0 and 1
Returns:
QGraphicsRectItem: The Qt graphics item of the polygon
"""
points = [
(0.0, 0.0), # top left point
(width, 0.0), # top right point
(width * width_percent, height), # bottom right point
(0.0, height), # bottom left point
(0.0, 0.0) # back to the beginning
]
shape = QPolygonF()
for i in points:
shape << QtCore.QPointF(*i)
## Creates a main master Item that will contain all other elements
## Items can be standard QGraphicsItem
masterItem = QGraphicsRectItem(0, 0, width, height)
# Keep a link within the item to access node info
masterItem.node = node
# I dont want a border around the masterItem
masterItem.setPen(QPen(QtCore.Qt.NoPen))
polygon = QGraphicsPolygonItem(shape, masterItem)
# Make the wedge grey in color
polygon.setBrush(QBrush(QColor( '#D3D3D3')))
# Print the name of the node
text = QGraphicsSimpleTextItem(node.name)
text.setParentItem(polygon)
# Center text according to masterItem size
tw = text.boundingRect().width()
th = text.boundingRect().height()
center = masterItem.boundingRect().center()
text.setPos(center.x() + tw/2, center.y() - th/2)
polygon.setPos(0, masterItem.boundingRect().y()/1.5)
return masterItem

And then to actually use it in a script, set up the tree style. I like to mark internal nodes with bootstrap support >70% with a grey circle and >90% with a black circle as well. Below is the function that I use to add in the groups.

def master_ly(node):
style = NodeStyle()
style['shape'] = 'circle'
if node.support >= .90:
style['size'] = 5
style['fgcolor'] = 'black'
elif node.support >= .70:
style['size'] = 5
style['fgcolor'] = 'grey'
else:
style['size'] = 0
if node in grouping_nodes:
style['draw_descendants'] = False
# Create an ItemFAce. First argument must be the pointer to
# the constructor function that returns a QGraphicsItem. It
# will be used to draw the Face. Next arguments are arbitrary,
# and they will be forwarded to the constructor Face function.
# in this case we pass through the width, height, and width_percent for
# the wedge.
F = faces.DynamicItemFace(polygon_name_face, 60, 30, 0.25)
faces.add_face_to_node(F, node, 0)
node.set_style(style)

There are improvements to be made with the way I’m drawing the wedge. First, there isn’t any border between the top of the wedge and the next leaf — you can see the name “aaaaaaaaad” is a bit cramped. Second, ARB has a nice feature which changes the wedge dimensions based on the number of grouped leaves which I haven’t yet implemented.