The newdist software package
provides implementations of three similarity measures of different
expression power to quantify genomic relationships on the basis of
the gene connection model.

All core methods are implemented in Python while some
helper scripts are written in BASH; Some of
the core methods will output integer linear programs
(ILPs) in a format suitable for IBM's solver
CPLEX.
newdist has the following system
requirements:

Python 2.7

NetworkX ≥ 1.10

IBM ILOG CPLEX Optimization Studio

GNU/Unix (optional)

Biopython ≥ 1.6 (optional)

In the gene connection model, genomes are compared based on
their bipartite gene connection graph: Given
two genomes S and T, a
gene connection graphG(S,
T) of S and
T is a bipartite graph with one vertex for
each gene of S and one vertex for each gene of
T. An edge between two vertices, one from
S and one from T,
indicates that there is some connection between the two genes
represented by these vertices. The input to the methods in this
software package are gene connection graphs that can be constructed
from BLAST tables with the provided script
pairwise_similarities.py.

The provided methods facilitate the inter-species comparison
of gene orders between two genomes by means of counting
conserved adjacencies, which are defined as
follows: Given an integer θ ≥ 1, a pair of
index positions (i, i') with i' ≤ i
+ θ in a string is a
(θ-) adjacency.
Further, a pair of adjacencies between two genomes
S and T is
conserved if

their corresponding genes are connected in their gene
connection graph G(S, T)
and

their corresponding genes' relative orientation is
identical

The different similarity measures provided by
newdist are expressed by the following
three problem statements:

Solving Problem 2 is NP-hard even for 1-adjacencies.
Therefore we provide a third, intermediate measure, which is more
efficient to compute in practice, while still producing one-to-one
correspondences between gene extremities. It is defined as the size
of the largest subset of non-conflicting conserved
adjacencies found in a pair of genomes, where two
conserved adjacencies are denoted
conflicting if their intervals in
either genome are overlapping.

Problem 3 (adjacency matching
model). Given two genomes S
and T and a gene connection graph
G(S,T), let C be the set
of conserved adjacencies between S and
T. Compute the size
|C*| of a maximum
cardinality set of non-conflicting conserved adjacencies
C* ⊆ C.

newdist provides the following
core scripts:

enumerate_adjs.py - script to solve
Problem 1 and first part of Problem 3 (see paper);