at OpenHelix

Tag Archives: multiple sequence alignment

This week’s highlighted question is from the Bioinformatics subreddit. And I’m using it because it made me laugh. I dare someone to name their alignment tool “Fancy Pants”. That said, it did provide links to a number of different tools that people might find useful, depending on the kind of thing you might want to do do make your stuff look niiiiiiiiice.

Hey all- my colleague and I are looking for an alignment tool (software, website, etc) that makes beautiful looking alignments for publications. ClustalW is what everyone keeps suggesting and its nice but its not niiiiiiiiice. If anyone has any favorites, please fill me in, Thanks!

Some of the tools are things we’ve talked about before. Jalview and AliView, for example. But there are some others too. Have a look at the chatter and check ‘em out yourself if you need some alignments. Or, if you have other tools, suggest them.

{By the way, I am familiar with the tool called FancyGene. But it’s different.}

This weeks highlighted discussion is the problem of pan-genome graphs, which are ways to represent the variation we find in genomes instead of a linear reference sequence view. I was really struggling with these concepts until I heard a talk at the #TRICON meeting recently. David Haussler had some really helpful visuals. I don’t have an audio link to the talk I heard, but I found a similar one. I think it’s a concept people need to consider, because these are going to be coming to us in the near future.

Biostars is a site for asking, answering and discussing bioinformatics questions and issues. We are members of the community and find it very useful. Often questions and answers arise at Biostars that are germane to our readers (end users of genomics resources). Every Thursday we will be highlighting one of those items or discussions here in this thread. You can ask questions in this thread, or you can always join in at Biostars.

There’s much talk on my twittersphere about this piece in MIT Technology Review: Rebooting the Human Genome. It talks about how the current reference genome concept misses so much of the human variation that we need to capture as we sequence more and more people’s personal genome data.

But there’s also some confusion. I don’t think the concepts of the graphs was really well described in there. In an earlier thread here we talked about it a little, but I wasn’t able to find the talk I’d heard about this which was helpful to me. But I found a similar one, and maybe this will help people to get the idea of the graphs instead of just the current linear view we have of the reference genome.

You can watch the whole thing, of course. But the part about the graph ideas come in to this talk around 52 minutes.

So the idea is that we have to be able to account for the “bubbles” that don’t match a linear reference string. Some bubbles will be alterations, some insertions, some deletions, some inversions–but we can capture this with graph representations that go beyond our current tools. But they are all valid, and we need to know and see this variation better.

Anyway, I’m posting because I think it’s important to be aware of. And I think that even researchers in the field aren’t that familiar with the ideas yet.

This paper was also helpful to me to understand the concepts, but unfortunately is not open access: Building a pan-genome reference for a population. doi: 10.1089/cmb.2014.0146 http://www.ncbi.nlm.nih.gov/pubmed/25565268

If anyone else has good introductions to the representations of these variant graph concepts I’d like to see them.

The multiple sequence alignment editing question recently on our What’s the Answer? feature was popular. We have covered MSA editors in the past, and we include a bit on Jalview in our Clustal tutorial, but I hadn’t revisited them lately. In preparation for that post I specifically looked over at the Jalview site, and I realized that they have recently provided a number of training videos to help people use their tools. So this week’s tip of the week will highlight them.

At the Jalview site, they give this brief description of the features:

Jalview is a free program for multiple sequence alignment editing, visualisation and analysis. Use it to view and edit sequence alignments, analyse them with phylogenetic trees and principal components analysis (PCA) plots and explore molecular structures and annotation.

On the Jalview online training Youtube channel, they have a number of videos. Some are general overview, some are specific tasks. For a general overview of what it does, this intro video will help you to decide if it’s a tool that would help you:

If you are ready to try it out, there are some handy tips in this video with more details about actually using the features of the software. It covers basic navigation, understanding the interface layout, working on editing, and good tips for accomplishing things efficiently.

For more of the philosophy and foundations of Jalview, check out their paper (linked below). And check out their other videos to go further.

As last year, we were impressed with the students (there were 117 total, about 50/50 gender ratio). English is their 3rd or 4th language in most cases, Moroccan Arabic, French or various African languages being their language of choice. Yet, they were attentive and asked very perceptive and fascinating questions. They were also very enthusiastic

The workshop students

learners. It was a delight to teach them.

We’d like to thank Mohammed Bourdi at NIH, who spent large amounts of time and financial resources to organize this (and last year’s) workshop. We hope to repeat and expand these for next year and perhaps years to come. We will be looking for sponsors.

Several questions were asked at the workshop we’d like to reiterate the answers here and seek some answers from our readers:

* Another question concerned finding a ‘dynamic programming’ (optimal solution) multiple sequence alignment tool as opposed to a heuristic one. The issue with this is the complexity of the search space of dynamic programming solution, this slide set might help with the understanding, particularly slides 1-5 and 17-22. It is too computationally intensive. That said, the student might want to check out MSAProps and this list at Wikipedia.

Do our readers have any other guidance on this?

Teaching moment

* Another student asked if we know how to find DC-area internships in biological sciences. Another student (mathematician from Mali) was looking for something in the US in bioinformatics. Any ideas of programs to bring African biology students to the US or Canada?

If our Moroccan students (or anyone else) have any additional questions, please feel free to ask them here!

ANd a side note. Last year I had all of 3 hours to tour Fes. This year I took advantage of my trip. Mary and I spent a few days in Fes and Marrakech. My family joined us in Marrakech and later my family and I toured for 8 days visiting the Atlas mountains, the Sahara and Fes. Needless to say, it was a trip of a lifetime. Morocco is a fascinating and beautiful place. I look forward to visiting again.

BioStar is a site for asking, answering and discussing bioinformatics questions. We are members of the community and find it very useful. Often questions and answers arise at BioStar that are germane to our readers (end users of genomics resources). Every Thursday we will be highlighting one of those questions and answers here in this thread. You can ask questions in this thread, or you can always join in at BioStar.

I want to track the evolution of several domains, and for doing so, I need to align and cluster 1000′s of sequences. is it possible? and what is the best software to use for that? Eventually I want to understand which is the most “basal” sequence that might lead me to the most ancient protein containing this sequence.

But there are a couple of other options as well, as with most bioinformatics solutions! This includes a hot-off-the-press lead on the new Clustal version (Clustal Omega). Check out the others over there.

I have a vague memory of reading about COBALT a while back, but at the time it was an executable file to download and I think I put it away as “to do.” Well, a couple days ago I was over at the NCBI BLAST site for something (tip of the week?), and noticed there was a “new” flash for COBALT. So, COBALT is now integrated as a web-tool on the NCBI site. The short description of what COBALT is, from the site:

COBALT is a multiple sequence alignment tool that finds a collection of pairwise constraints derived from conserved domain database, protein motif database, and sequence similarity, using RPS-BLAST, BLASTP, and PHI-BLAST.
Pairwise constraints are then incorporated into a progressive multiple alignment.

I haven’t tried it out yet, compared it to other multiple sequence alignment tools, but thought I’d point it out to those who haven’t yet noticed it.