Neighborhood Correlation is a novel homology identification method
based on the observation that gene duplication and domain insertion
result in different topological structures in the sequence similarity
network. For details of Neighborhood Correlation, please refer to the
publication:

In the PLoS paper above, we applied Neighborhood Correlation to all
full length, mouse and human amino acid sequences in SwissProt Version
50.9. In an empirical validation of pairwise homology identification
performance on twenty manually curated families, we show that
Neighborhood Correlation achieves high sensitivity and specificity in
both single domain and complex multidomain families. It outperforms
traditional methods that combine sequence similarity with additional
criteria based on alignment length.

To examine the performance on individual families and sequences in
the mouse and human data set, please see
the Neighborhood Correlation Browser. The
Browser is allows exploratory analysis of the neighborhood structure
of the protein sequence similarity network. The user may

select a protein sequence of interest by keyword search,

visit one of our twenty curated families, and

browse the protein sequences in our initial dataset.

Download Neighborhood Correlation

We make available an open-source (GPL) implementation of Neighborhood
Correlation to demonstrate our algorithms and to facilitate novel
analysis of additional data sets.

Neighborhood Correlation Version 2.1

Version 2.1 further improves the performance of Neighborhood
Correlation, in two ways:

To produce Neighborhood Correlation scores for all pairs of
sequences, previous versions iterated over all N^2 pairs, for N
input sequences. This version progressively iterates through the
neighborhood of each query sequence, resulting in N * M pairwise
calculations, where M is the number of sequences in the
neighborhood of each query sequence, plus the number of sequences
in the neighborhoods of those sequences. For large datasets, this
optimization is extremely beneficial.

Neighborhood Correlation first makes the input BLAST scores
symmetric: BIT-score(x,y) = max( BIT-score(x,y), BIT-score(y,x)).
This version improves the efficiency of this calculation through
use of a compiled C function.

BUG FIX: Version 2.0 was released with LOG_10 transformation of the
input inadvertently disabled. Version 2.1 restores the correct
functionality, by using the LOG_10( BIT-score) for all internal
calculations.

Neighborhood Correlation Version 2.0

Version 2.0 is a complete rewrite of the Neighborhood Correlation
implementation. It is meant to replace Version 1.0 (previously
referred to as the "reference implementation"). Version 2.0 been
optimized to accommodate large datasets through fast computation and
greatly reduced memory usage.

This implementation has added a dependency upon
the Numpy numerical package. It
also requires a C compiler be available on the system. We believe it
to be platform independent, and have tested on Linux and MacOS.

Performance is greatly improved over Version 1.0. As a rough
guide, the set of Mouse and Human sequences used in our analysis
included 26,197 sequences. From this, all-against-all BLAST yielded
approximately 4.8 million pairwise relations. For this dataset,
Neighborhood Correlation, Version 2.0 can be expected to consume
approximately 125MB of memory. Running time for this dataset is
approximately 45 minutes on an Intel Pentium D, at 3.2GHz. Greater
than 1GB of memory, and 16 hours of running time were required by
Version 1.0.

If you are working with small (1-2 million BLAST scores), and don't
care to install Numpy, give version 1.0 a try. The input and output
are equivalent, save the following: Version 1.0 reported NC scores for
pairs that that satisfied the condition (NC(x,y) ≥ nc_thresh ||
BLAST score (x,y) exists). Now, this has been simplified to only
(NC(x,y) ≥ nc_thresh).

Neighborhood Correlation Version 1.0

This is the original "reference implementation" used to demonstrate
the algorithms in the PLoS publication.
We have focused upon an intuitive
implementation with readable code. This program requires only a basic
Python installation, and has no additional dependencies. It has been
tested with Python version 2.5 on a Linux computer. It has no
OS-specific requirements and should work on any complete Python
installation.

Supplemental Data

We predicted mouse and human homologs using Neighborhood Correlation.
Those predictions, our manually curated validation set, and a
reference implementation are available here.

The PLoS Computational Biology study was carried out on all full
length, mouse and human amino acid sequences in SwissProt Version 50.9
(11,553 mouse protein sequences and 14,644 human protein sequences).
Data used in the study and predictions made using our method are
available here:

FASTA sequences for all 26,197
human and mouse sequences used in the study.

Homologous Family
Benchmark : Panther 7.0
identifiers for all sequences in each family of our manually curated
benchmark. The Panther dataset is newer than the SwissProt dataset
used in original PLoS paper, and contains family members which were
not in SwissProt at the time. This is our most current annotation
set. (Updated 23 Aug 2011)

Recent Publications

This work extends the
identification of homologous pairs to classification of entire protein
families. We investigated the structure of the homology network and
that inferred by Neighborhood Correlation. Of principal interest is
the ability to evaluate a classification in the absence of
hand-curated data, by considering intrinsic measures of that network.
We demonstrated a strategy that reduces noise in and restores
structure to artificial networks with simulated noise, as well as to
the yeast genome homology network. We further evaluated this approach
on a hand-curated set of multidomain sequences in mouse and human and
demonstrate that classification using the rewired network delivers
dramatic improvement in Precision and Recall, compared with current
methods.

Contact

For assistance with, or questions about any of the material on this
page, please contact Jacob
Joseph or Dannie Durand. We are
always pleased to hear about new analyses.

Funding

This material is based upon work supported by the National Science
Foundation (NSF) under Grant No. DBI-0641313, the National Institutes
of Health (NIH) under Grant No. 1 K22 HG 02451-01, and a David and
Lucille Packard Foundation fellowship. Any opinions, findings and
conclusions or recommendations expressed in this material are those of
the author and do not necessarily reflect the views of NSF, NIH, or
the Packard Foundation.