UProC - tools for ultra-fast protein sequence classification

With rapidly increasing volumes of biological sequence data the functional
analysis of new sequences in terms of similarities to known protein families
challenges classical bioinformatics. The ultrafast protein classification
(UProC) toolbox implements a novel algorithm ("Mosaic Matching") for
large-scale sequence analysis and is now available in terms of an open source C
library. UProC is up to three orders of magnitude faster than
profile-based methods and achieved up to 80% higher sensitivity on unassembled
short reads (100 bp) from simulated metagenomes. UProC does not depend on a
multiple alignment of family-specific sequences. Therefore, in addition to the
protein domain classfication according to the Pfam database, UProC can, in
principle, also provide the detection of KEGG Orthologs.
We provide a precompiled database for KEGG Ortholog classification which we applied to the prediction of functional repertoires from short reads (see below).

In the Downloads section below you find the links for the corresponding
database files that we have precompiled for import into UProC.

These databases need to be imported using uproc-import. Even though they
are compressed with gzip, you don't have to decompress them manually, as
uproc-import will take care of this. After importing you can delete the
downloaded file if you wish.

If you have problems importing a database, verify that zlib (de-)compression is
available by running uproc-import-V (capital v). If it says zlib: no,
either decompress the database manually or install the zlib library and header
files and recompile UProC.

Note

To avoid severe performance problems, make sure you have enough
main memory (RAM) to load the whole database. This is usually a bit more
than twice the size of the downloaded file.

You can find the installation instructions in the README.rst file
contained in the software packages or rendered as HTML here.

Does UProC require additional software or programs for installation?

It very much depends on the operating system and the particular
installation, whether you might have to install additional software or
libraries. For the pre-compiled Windows binaries (see below) you don't need
to install any additional software. Compiling UProC on a Linux PC from
scratch, it depends on the particular environment. Within a typical
developer environment the configuration scripts should run without
problems. In other cases additional developer tools would have to be
installed which should be easy for most of the common Linux distributions.
We have seen the following examples for an Ubuntu 12.04 LTS distribution -
in brackets you find the command for installation:

gcc, make etc. (sudo apt-get install build-essential)

zlib header files (sudo apt-get install zlib1g-dev)

Does UProC run on a Windows PC or Laptop?

Yes, we successfully tested the following options and versions:

Compilation within the Cygwin environment requires to install Cygwin
and possibly several additional developer components within the
environment. As a shortcoming of that variant, you would also have
to run the compiled UProC programs within the cygwin environment.
Therefore we recommend to try the following second option, namely to
use the precompiled binaries for Windows (most probably the 64-bit
version), which you obtain from the UProC homepage.

Using the precompiled binaries (see the Downloads section).
We successfully tested the 64 bit binaries on several machines:

Yes, but we faced problems with slow file access on hard drives that
severely degrade the UProC performance. Currently a working solution that
we found to provide a sufficient speed on Mac OS X requires a solid state
disk (SSD) for storing the database. We expect that also a Ramdisk might
provide a possible solution. You may test UProC with a conventional hard
disk on OS X but from our experience, speed can be incredibly slow. In the
following we sketch how we installed UProC on a Macbook with OS X
Mavericks, 8 GB RAM and SSD.

We have also compiled two binary packages compiled without OpenMP for Mac
OS X that you might try in order to facilitate installation. Just click on a
package file for installation (the binaries are then installed to
/usr/local/bin). If you have an SSD use the version
with mmap, otherwise install the
version without mmap.
Afterwards you are able to use the UProC programs, for example
uproc-import and uproc-dna (see above section on Windows binaries).

When do I need the SEG program?

The SEG low complexity filtering progam (part of the BLAST suite) is not
needed if you import one of the pre-compiled databases that are available
on the UProC homepage. However, the SEG program is highly recommended if
you want to compile your own protein database. In that case, you have to
provide a multi fasta file of labeled protein sequences where the protein
family label, in general a numerical identifier, has to be placed in the
fasta comment line, preceding the corresponding amino acid sequence. This
multi fasta file should be processed with SEG to create a smaller and
better database for UProC. You can run SEG with default parameters and you
should use the masking ('X') option, e.g.
seg my_db_protein_sequences.fasta -x > my_xmasked_db_protein_sequences.fasta

We tested UProC on a large file from the Human Microbiome Project (SRS017007)
containing about 13 Gigabases of 100 bp short reads. We used UProC in short
read mode according to uproc-dna-s with multithreading enabled. If the
number of physical cores differed from the number of logical cores (in
brackets) we chose the higher number in the -t option. Runtime was measured in
terms of total wall clock time including all I/O processing.

Because Pfam27 needs much more RAM than Pfam24 we could only use it on a subset
of available computers. We also tested the UProC binaries with Pfam24 on a 8GB
notebook running Windows 8.1 which was successfull for smaller fasta files but
failed for the large HMP file above due to limited memory.