How to use common bioinformatic tools to compare two Neandertal sequences

I really want the following paper that was just published in Nature, “Neanderthals in central Asia and Siberia.” The paper seems really interesting. Why? It pushes how far east Neandertals ventured, by using sequence comparisons. From the abstract,

“we determined mitochondrial DNA (mtDNA) sequences from hominid remains found in Uzbekistan and in the Altai region of southern Siberia. Here we show that the DNA sequences from these fossils fall within the European Neanderthal mtDNA variation. Thus, the geographic range of Neanderthals is likely to have extended at least 2,000 km further to the east than commonly assumed.”

If anyone has a copy of it, please send me a copy, I’d appreciate it a lot.

In exchange, I’ll preemptively offer you a tutorial to using public sequence databases to compare Neandertals. This relates to the above paper. Specifically, I’m gonna outline how similar two sequences of mitochondrial DNA extracted from Neandertals from Spain and Italy are.

If you’re interested in the intersection of bioinformatics, genomics, and paleoanthropology but haven’t really wondered how to apply these disciplines together, this should wet your appetite. All the tools I will be using are freely accessible to everyone. After this tutorial, you can basically begin to do some of the comparative research that Svante Pääbo et al. did… and you don’t need really any preface, other than DNA is made up of nucleotides and one can compare two or more sets of nucleotides to trace relationships.

If you’re an instructor, feel free to use or adapt this in one of your classes or lectures. It is a quick and easy way to introduce physical anthropology students to tools many molecular biologists have been using to compare species and genomes. So, what are we waiting for? Fire up a new browser window or tab, and follow along. I’m gonna keep it simple.

Perhaps the best bioinformatic resource out there is the NCBI’s GenBank. Lots of information can be harvested out from the free, public databases housed there. Today, we’re gonna focus on the Entrez Nucleotide database, so click the following link to jump on over there: http://www.ncbi.nlm.nih.gov/

At the top of the page you’ll see a text box. Locate where it says, “Search” and from the pull down menu, to the right, select, “Nucleotide.” Type or copy and paste in “Homo neanderthalensis” in the empty text box. Hit the ‘Go’ button.

You’ll find yourself at a landing page which says there are 1,335 nucleotide sequences in GenBank at this time, of which only 9 are core nucleotide records, for Neandertal genes. The remainder of the records are genome survey sequence records which are way beyond our needs. We’ll just keep it at the 9 core nucleotide records for now. So, if you click the 9 core nucleotide link, you should be at this rather foreign summary page.

You ask, “What the hell am I looking at? What does DQ859014 stand for?” You’re looking at GenBank’s record for Neandertal sequences and things like DQ859014 are accession numbers, a fancy word that basically means GenBank’s dewey decimal system. Each sequence that gets submitted to GenBank gets a unique catalogue number called the accession number.

For today’s purposes, I want us to be looking at the top two records, DQ859014 and DQ836132, which are both control, partial sequences of mitochondrial DNA from Neandertals found in Spain and Italy.

If you click on the first, DQ859014 you’ll find yourself at an information sheet with a lot of data displayed. The most important things I look for when I see this page are the title of the publication, the authors, the date, and then the sequence…. which is all the way at the bottom. For DQ859014, here is the sequence that was submitted in the most current revision:

These 300 or so A’s, T’s, C’s, and G’s represent the order of nucleotides, or bases, that make up this sequence of Neandertal mitochondrial DNA. But the format that sequence is provided to us isn’t too useful. A more versatile and widely used format is the FASTA format. Other databases and tools use FASTA format.

No worries though, using GenBank, we can easily convert this sequence to the FASTA format by scrolling to the top of the information sheet for DQ859014. Under the Search prompt, at the top, you should see “Display.” Select that pull down menu and look for FASTA. GenBank will automatically restructure the sequence data, to the more concise FASTA format. Here’s what you should be seeing for DQ859014 in FASTA format:

See how the FASTA format cuts out both the spaces between every 10 nucleotides as well as the 1, 61, 121 markers? That’s much easier to work with for what I have planned. But before we jump to the next step, let’s not forget about our other sequence we wanna compare, DQ836132. If you repeat the steps to convert DQ836132 to FASTA, just like we did for DQ859014… you should get this output:

Cool, we now have our two sequences from Spain and Italy to compare. Let’s see how similar these two sets are by using a tool called LALIGN, which compares these two nucleotide sequences.

I didn’t bother to title both query sequences. It is up to you if you’d like to do that. Instead, I just copied the entire FASTA sequence from the Spanish Neandertal and pasted it in the first sequence query box and then repeated the same thing for the Italian but this time pasted it in the second query box. I then pressed the “Run lalign” and didn’t futz with any other settings.

The results the LALIGN spews out are the best local alignments between two sequences. In our example, DQ859014 and DQ836132, are 97.0% similar in 303 overlapping nucleotides. That’s a pretty remarkable similarity between two Neandertals from different locations… and especially remarkable since the samples were sequenced by different labs. Now, we don’t know if the sequences were from the same part of the mitochondrial genome, but since they share such a remarkable similarity, it is very probable they came from the same region.

In this miniexperiment, we see how related Neandertals from Spain and Italy were, at least in 300 or so base pairs of their mtDNA. We did that all, in about 5 minutes, without any prior bioinformatic knowledge. Pretty sweet, right? Anyways, I hope you enjoyed that. I can write up more more of these type bioinformatic and anthropology tutorials, if you’d like me to. Just let me know.