Iain Melvin, Jason Weston, William Stafford Noble, Christina Leslie

Abstract

Virtually every molecular biologist has searched a protein or DNA
sequence database to find sequences that are evolutionarily related to
a given query. Pairwise sequence comparison methods.i.e., measures of
similarity between query and target sequences.provide the engine for
sequence database search and have been the subject of 30 years of
computational research. For the difficult problem of detecting remote
evolutionary relationships between protein sequences, the most
successful pairwise comparison methods involve building local models
(e.g., profile hidden Markov models) of protein sequences. However,
recent work in massive data domains like web search and natural
language processing demonstrate the advantage of exploiting the global
structure of the data space. Motivated by this work, we present a
large-scale algorithm called PROTEMBED, which learns an embedding of
protein sequences into a low-dimensional semantic
space. Evolutionarily related proteins are embedded in close
proximity, and additional pieces of evidence, such as 3D structural
similarity or class labels, can be incorporated into the learning
process. We find that PROTEMBED achieves superior accuracy to widely
used pairwise sequence methods like PSI-BLAST and HHSearch for remote
homology detection; it also outperforms our previous RANKPROP
algorithm, which incorporates global structure in the form of a
protein similarity network. Finally, the PROTEMBED embedding space can
be visualized, both at the global level and local to a given query,
yielding intuition about the structure of protein sequence space.

Requirements

ProtEmbed uses the Torch5 Machine Learning Toolbox. See: Torch5 install instructions. It is recommended to build Torch from sources on a Linux machine.

You will also need the Torch5 "sparselab" library for handling sparse matrices. "sparselab" is part of the torch5-contrib project: sparselab_on SourceForge. The easiest way to install a package in torch is to copy the package directory into the "dev" directory of Torch before you run cmake to build Torch.