Tuesday, 7 April 2015

What's new in InterPro release 50.0 and 51.0

Faster InterPro member database processing:

InterPro releases 50.0 and 51.0 have brought some important developments from an InterPro production point of view, which we thought would be worth sharing. Release 50.0 saw the incorporation of a new version of PIRSF, which has importantly been migrated to use the HMMER3.1b analysis algorithm. This version of HMMER runs approximately one thousand times faster than the previous version used by PIRSF (HMMER2.0), helping to ensure that InterPro can continue to calculate UniProtKB match data in a timely manner. In a related development, as part of InterPro release 51.0, we debuted a sequence database pre-filtering heuristic to reduce the amount of time it takes to calculate matches against the HAMAP database (the heuristic is based on HMMER3.0, but the analysis still uses the core HAMAP algorithm, and is all implemented within the InterProScan software). This again speeds up our protein match generation process and helps to safeguard against future data growth. The PIRSF and HAMAP databases were identified as being the slowest databases to calculate matches at at the start of 2014, but after work from both the database maintainers and the InterPro team, but this is no longer the case.

A leaner UniProtKB:
At the same time, the number of proteins in UniProtKB has decreased significantly, where some 47 million sequences from highly redundant bacterial proteomes have been deleted (for details, see here, described half way down the page).

Faster and fitter InterPro production:
The majority of these developments have taken place under the hood, so it is unlikely that you will have been aware of our fitter and faster production system. What we hope you will notice, however, are more regular InterPro releases and more frequent member database updates in future, as these and other optimisations come into effect.