Wednesday, 26 November 2014

In the pipeline – streamlined InterPro production

You may have noticed that InterPro has had fewer releases than usual this year. It is not that we haven’t been working as hard as ever, integrating member database signatures into InterPro entries and adding Gene Ontology terms - we have! But a number of things have been going on behind the scenes, which we thought you might be interested in knowing about.

Sequence growth

InterPro release 1.0, back in 2000, was built using a version of Swiss-Prot/TrEMBL that contained just over 300 thousand sequences. Our current InterPro release (49.0) is built using over 77 million Swiss-Prot/TrEMBL sequences. That is a massive amount of sequence growth - and even more remarkable is the fact that almost half of these sequences have been added in the last year.

A new InterPro production pipeline

As you might imagine, processing this number of sequences can cause all kinds of problems for computational pipelines that were developed when sequence data volumes were orders of magnitudes smaller. To make sure that we can handle the kind of data volume growth we have been seeing - and expect to see in the future - we have been busy rebuilding our production pipeline. The new system is built entirely on InterProScan, which, for a variety of complicated historical reasons, the previous version was not. This change helps streamline the production process, removes a number of bottlenecks, and generally makes many things associated with data production a lot less complicated.

Further pipeline developments and a new data centre

To put these changes in place, we have had to focus a lot of our efforts on pipeline development, with knock-on effects on our release schedule. As a consequence, while we have maintained our usual rate of database integrations, these have been squeezed into slightly fewer InterPro releases. And, as a further complication, we have also recently moved all of our data (in the form of hard drives on the back of a truck - no, really!) to a new data centre, as part of EMBL-EBI’s consolidation of its Web infrastructure. This has impacted our release schedule further still. However, we believe that we are now much better placed to calculate and provide match data for our users. We think we are also better prepared for future data production challenges - as the number of protein sequences hits 100 million, and beyond.