Many of you will be aware of the proposed web blackout in response to the Stop Online Piracy Act which is currently going through the U.S. House of Representatives (you can read the BBC’s explanation of the Act here). If this Act is enforced, it has far-reaching consequences for the overall freedom of the internet. Editors of the English Wikipedia have taken the decision to close the English Wikipedia for 24 hours, starting at 0500 hrs on Wednesday 18th January. To respect this protest, we will also be making our Wikipedia content unavailable during this time.

You’ll still be able to access all the non-Wikipedia content – that is, all the covariance models and HMMs describing families, domain graphics, full and seed alignments, as well as our species trees.

As you surely have noted the highly anticipated new Pfam paper is out as part of the 2012 NAR database issue! We were delighted to be listed as a featured article. The paper covers the new release 26.0 (more on this from Rob soon) and presents some novel analysis that may be of interest to Pfam addicts like you. We quite extensively discuss our use of family-specific bit score gathering thresholds (GAs), hoping to bring clarity to an issue that seems to have been a source of confusion in the past (a.k.a. stop sending us tickets asking what GAs are and how to use them! :-)). Also, we extend and update the analysis of DUF families that was presented in a previous publication hoping to push more people into the de-DUF a DUF game. So, enjoy reading the paper and send us comments and suggestions, your support and advice is as always invaluable to us!!

We are pleased to announce the arrival of the Rfam Track Hub for the popular UCSC Genome browser. Rfam data has been available in the Ensembl browser for some time and provides links back to the Rfam annotation, and now this same functionality is available for the UCSC Genome Browser.

The hub file is available on our ftp site, and by following the instructions at the UCSC Genome Browser Custom Hub page, you can visualise Rfam annotations for the majority of species for which genomes are provided by the UCSC Genome Browser. Clicking on a match will give you exact start and stop positions, as well as links to the Rfam annotation page here at the Sanger. At the moment, bit scores or E-values for a given match aren’t yet available directly through the UCSC Genome Browser, though we’re working on it. Happy browsing!

Rfam types for Genome annotation

Xfam (in the forms of Sarah and Rob) attended the NIH Genome Annotation Workshop last week, and it was a great insight into the trials and tribulations of coming up with common standards that everyone’s happy with. It was also nice to hear that Rfam is being used exensively to annotate ncRNA features. However, there’s been some confusion amongst annotators when converting between Rfam types (such as CD-Box) and the ncRNA_classes required by INSDC under the ncRNA feature key. The ncRNA feature key is intended to describe non-coding RNAs that aren’t ribosomal or transfer RNAs; these use the rRNA and tRNA feature keys respectively.

To use the ncRNA feature key, annotators are required to supply an appropriate ncRNA_class, and this is where confusion arises, as there’s no perfect overlap between the Rfam entry types and the ncRNA classes. To reduce this, here at Rfam we’ve put together a handy translation guide to make it easy to know what ncRNA class you should apply if you are using an Rfam family to annotate a genome. There are also some cases where an INSDC type is more specific than the Rfam type; for example, we don’t have a specific telomerase RNA type, whereas there is a ncRNA_class called telomerase_RNA. Therefore any annotation to RF00025 can use the telomerase_RNA ncRNA_class category. You can find our table of Rfam types and their INSDC equivalents here.

You can also find out all you ever wanted to know about the feature tables used for genome annotation here, and here.

Well, it should have been out about 6 months ago, but finally the long awaited Pfam release 25.0 is here! Release 25.0 contains a total of 12273 families, with 384 new families and 21 families killed since the latest release. Pfam 25.0 is based on UniProt release 2010_05. Those of you who follow Pfam closely will be familiar with the fact the sequence coverage (the number of sequences in Pfamseq containing at least one Pfam match) has hovered at or just below 75%. Despite the addition of only a modest number of new families in this release, the sequence coverage is now 76.69% of all proteins in Pfamseq contain a match to at least one Pfam domain. 53.86% of all residues in the sequence database fall within Pfam domains.

In this publication we discuss the success of the relationship between Wikipedia and Rfam. This includes a fun analysis of the degree of vandalism the RNA pages have received with respect to the number of useful edits. We also discuss the new clans that explicitly link families that share an evolutionary relationship yet are too divergent to be sensibly aligned, the latest “decimal” release and our future plans.

We have been very sad to see a few people leave the group recently. Rob Finn has been the dedicated and hard working project leader of Pfam for many years. In fact as a summer student he is credited with preparing most of the families for Pfam 2.0 [1]! We’re expecting to see great things from him at his new post at HHMI’s Janelia Farm. We’ve also seen Jaina Mistry get married and move to another city, fortunately for us she’s still working part-time for Pfam remotely. Jen Daub after her whirlwind trip around the world will also be working part-time on the Rfam project from her luxurious new abode in France.

This means we have a number of opportunities for bright and enthusiastic people. We are looking to recruit a new Project Leader to lead the Pfam group. This is an exciting opportunity for a motivated, enthusiastic and experienced computational biologist, and is an influential position working with a high profile bioinformatic resource. We anticipate the candidate will lead the next phase of database development that will include community annotation and the incorporation of new developments based on the HMMER3 software. We would expect the successful candidate to have their own research ideas
and be able to deliver research outputs with the group.

We are also looking for two Computational Biologists to join the group. The successful candidate will ideally a MSc in bioinformatics or equivalent experience and a strong background in molecular biology, biochemistry, genetics or similar.

We would also like to take the opportunity to welcome Professor John Burke from the University of Vermont. John is taking a one year sabbatical with Rfam to learn about all things bioinformatic. He is already an expert on all things to do with ribozymes and RNA structure, so we expect some major improvements in Rfam in these areas.

Last but not least, we have Chris Boursnell, a refugee from the banking world, who is working us and the fine Recode database
to improve our coverage of frame-shift elements.

The annual Xfam consortium meeting was held on the 10-11th May 2010 and we have the photographic evidence to prove it.

We spent the two days listening to talks from everyone about the latest developments. We were particularly interested to hear about new developments in HMMER3 and INFERNAL – fundamental pieces of software that Xfam rely on. Nucleotide enabled HMMER3 is in development and will be great for Rfam, hopefully replacing the current BLAST pre-filters. We are also had updates on how the HMMER software scales using multiple threading and/or MPI.

We also had a number of wide ranging discussions. Erik Sonnhammer unfortunately wasn’t present this time so the usual discussion on Stockholm alignment format was avoided. However, we had a fulsome discussion of Pfam family naming nomenclature. It was generally agreed that although there were rules followed for Pfam short names, no one else in the world understood them. So we will endeavour to add a new section to our documentation about it. We discussed how much information is actually required before a DUF (domain of unknown function) is renamed to something more meaningful.

We were blessed because the Icelandic ash cloud didn’t intervene. But one of our number did leave their passport in a car bound for Oxford causing a delay home. We would like to thank all the members of the Pfam and Rfam consortia for coming and also to our other EBI attendees.