at OpenHelix

Tag Archives: conversion

Welcome to our Friday feature link collection: SNPpets. During the week we come across a lot of links and reads that we think are interesting, but don’t make it to a blog post. Here they are for your enjoyment…

I did this tip over two years ago and am revisiting it today with a bit more information, on SciVee (so it’s shareable) and up-to-date. I’ve been updating our Galaxy tutorial and that tip has been one of the most tweeted, shared and visited tips we’ve done (not the most, just one of), so thought now would be a good time to revisit it. This tip will go through the Galaxy tool to “liftover” genome coordinates between assemblies and genomes. You might also wish to visit a few other tools and places where you can convert genome coordinates between genome assemblies such as the UCSC Genome Browser Liftover utility (access that link from “utilities” menu on the front page, it uses a chain conversion files), FlyBase (for D. melanogaster genome), Maker (an annotation tool from GMOD that includes an assembly conversion tool), Ensembl Assembly converter, and I’m sure there are others. Have any to report? As the comment below informs us, there is also NCBI’s new remapping service which maps between assemblies (within species) and between refseq sequences and assemblies.

A word about methodology, as mentioned in the first paragraph, UCSC Genome Browser’s liftover tool uses chain conversion files. I am unsure of the methodology used at Galaxy though I’m assuming it’s similar. I have an inquiry in and will update this page when I know the answer.

Indeed it is. I received an nice answer from the Galaxy support team:

The liftOver program and the underlying mapping file comes from UCSC and is based on their “Chain/Net” comparative genome algorithms.

The data represents the syntenic genome regions for the two reference genomes involved. Genes with similar annotation, between closely related species, found within these syntenic regions have a good likelihood of being orthologs, but gene function is not considered by the algorithm and would have to be evaluated independently to confirm orthology.

From my HUM-MOLGEN mailing list newsletter today I spotted an interesting comparison. We get a lot of questions about how to convert IDs or how to best move from one data source to another. We’ve done some explorations of that in the past (MatchMiner is one example). This is not the sort of sexy thing that gets published in the literature in general, but a really nice thing for the informal literature system of the newsletter/blogosphere/etc world.

Diego Forero, an editor of HUM-MOLGEN, has assembled a comparison of several tools: Babelomics, Clone/ID converter, DAVID, g:Profiler, MatchMiner. He started with a list of 100 Ensembl IDs and tested them on each of the tools to get the HUGO official nomenclature. (He does note that there are plenty of other conversions also possible, Ensembl, HGNC, EntrezGene, RefSeq, UniGene, but Ensembl–>HGNC was the test performed). There was a second test on Affymetrix IDs to HUGO symbols too. The references for the tools are also provided.

The data is available on Scribd and you can download it yourself. You can access the IDs and test other tools too. Here is a sample of the outcome:

In this experiment Babelomics did the best in this test. Now–I have a separate question: are they right? Just because a program provides an ID doesn’t mean it gave the right one. This is a problem I’ve seen over and over in this field. In my experience most stuff needs to be checked by humans. I remember one meeting I was in and someone was describing this new tool that represented splice variants. We were all impressed, it sounded great, and then I raised my hand to ask: “But are they right?” and the tool developer said, “I don’t know.”

Still, it is a useful exercise to compare these tools. And it is a great list to bookmark. But keep that in mind.

There are a lot of them. FASTA comes to mind. GenBank is another. Clustal, EMBL, GCG and the list goes on. I’d say FASTA is one of the most commonly used or accepted, but I could be wrong. Still, many databases and software programs have their own format that they accept and generate. Some of these programs and databases will accept several formats or generate files in several formats. It can get a bit confusing. So, you’ve got a sequence file in PAUP but you need it in FASTA? Don’t even know what format it is? Or what they look like or the information that they contain?

MatchMiner translates one type of gene ID into another type – essentially the genetic equivalent of Swahili-to-German translating software. In this tip I’ll show you how to do a translation on a list of genes, or a ‘Batch Lookup’.