Subgraph generation does not require checks for duplicates

The subgraph
enumeration algorithm I described the other day in part 1 works
and it's easy to understand, but it isn't fast. It generates duplicate
subgraphs and at every pass it considers all bonds, including those
which are already in the subgraph.

Here's another simple subgraph enumeration algorithm which doesn't
have these problems. A molecule has N atoms and M bonds. Therefore,
there are 2M-1 ways to select at least one bond from the
list of bonds. Each of these is a valid and unique subgraph, although
it might not be connected. Select only those which are connected and
you'll have the set of all connected subgraphs containing at least 1
bond. To that add the N subgraphs containing only one atom.

The result is an algorithm which generates all connected subgraphs,
and which does not need duplication testing. It does, however, need
connectedness testing, which the earlier algorithm did not.

Subgraph enumeration without checks for duplicates or connectivity

What if we're careful? Another way to think of the algorithm is to
think of all the subgraphs which have bond b1 plus those
which don't have b1. This is a divide-and-conquer
strategy. I know that any subgraph which has b1 must be
connected to that bond, so I only need to look at the other bonds
which come from the terminal atoms. And I know that subgraphs which
don't have b1 must either have b2 or not have
b2.

I think of this in similar terms to my first algorithm with seeds and
ways to grow a seed. There are a set of seeds, which are:

subgraphs which include b1,

subgraphs which include b2 but not b1,

subgraphs which include b3 but not b1, or b2,

subgraphs which include b4 but not b1...3,
...

subgraphs which include bM but not b1...(M-1).

and the way to grow a seed is to select bonds which are connected to a
seed but which haven't already been considered for inclusion in the
seed. I call one of these bonds an "extension", because it extends a
subgraph.

The tricky part is to figure out how to select those bonds. After
several attempts (and do bear in mind that I went through a lot of
attempts and iterations to figure out this algorithm) I decided on the
following:

1. Start with a single edge, and the set of edges which have already
been considered. (If this is the graph made from bi then
the set of edges which never again need consideration is
b1..i.)

2. Look at the set of all extensions which are 1 away from the initial
bond. (This is the same as the bonds connected to the atoms which are
at the ends of b1.) There are 2n-1 combinations
containing at least one extension. Make the corresponding
2n-1 new subgraphs. (If you only want subgraphs up to size
k then do the check here.) This stage uses all the possibilities of
incorporating the 1-away edges into the graph, which means they do not
need to be considered in subsequent stages.

3. For each of these 1-away subgraphs, find all of the extensions
which are 2 away from the initial bonds. (This is the same as the
bonds connected to the atoms which were newly added during the
previous step, and remember that there's no need to check the bonds
b1...i nor the bonds checked in step 2. All possibilities
regarding them have already been regarded and there's no need to use
them again.) Generate the 2n-1 combinations of extensions
and use them to make all of the 2-away graphs.

4, Use the 2-away graphs to make the 3-away graphs,

... and so on.

Keep iterating until no more expansions are possible, either because
there are no more bonds to consider or because an expansion would be
too large. In my code I ignore subgraphs with more than k atoms
by skipping expansions which would exceed that limit.

Implementation: getting things set up

The core implementation is simple. It starts with support for
subgraphs of size 0.

For size 1 it returns singleton subgraphs, where a Subgraph class is
identical to the "Seed" from last time; it contains "atoms" and
"bonds" as frozensets.

# Generate all the subgraphs of size 1
for atom in mol.GetAtoms():
yield Subgraph(frozenset([atom]), frozenset())
# If that's all you want then that's all you'll get
if k == 1:
return

For size 2 I just iterate over all of the bonds, and get the atoms at
the end of each bond. I'll use this to make the seeds for future
extensions.

# Generate the intial seeds. Seed_i starts with bond_i and knows
# that bond_0 .. bond_i will not need to be considered during any
# growth of of the seed.
# For each seed I also keep track of the possible ways to extend the seed.
seeds = []
considered = set()
for bond in mol.GetBonds():
considered.add(bond)
subgraph = Subgraph(frozenset([bond.GetBgn(), bond.GetEnd()]),
frozenset([bond]))
yield subgraph
extensions = find_extensions(considered, subgraph.atoms, subgraph.atoms)
if extensions:
seeds.append( (considered.copy(), subgraph, extensions) )

As you can see, a seed has:

The set of bonds which have already been considered

The subgraph itself

The possible ways to extend the subgraph into a new subgraph

Some of the atoms in a subgraph were added during the previous
iteration. The subgraph can only grow from bonds which are connected
to those new atoms and which weren't previously considered. The
"find_extensions" function (below) returns a list of all possible
extensions, where an extension is represented as the 2-ple (bond,
to_atom) and to_atom is None if and only if both atoms of the bond are
new_atoms. This can happen in C1CC1 in the expansion of CCCC to C1CCC1
since the final extension is the ring bond which closes two atoms
which are in the previous subgraph.

I use the term "internal extension" when the new bond connects two
atoms which are already in the subgraph. I have to be careful because
internal extensions will appear twice; once for each atom. I use a set
so I don't get duplicates, and at the end add those back into the list
of extensions.

def find_extensions(considered, new_atoms, all_atoms):
extensions = []
internal_extensions = set()
for atom in new_atoms:
for outgoing_bond in atom.GetBonds():
if outgoing_bond in considered:
continue
other_atom = outgoing_bond.GetNbr(atom)
if other_atom in all_atoms:
# This this is an unconsidered bond going to
# another atom in the same graph. This will
# come up twice, so prevent duplicates.
internal_extensions.add(outgoing_bond)
else:
extensions.append( (outgoing_bond, other_atom) )
# Add the (unique) internal extensions to the list of extensions
extensions.extend((ext, None) for ext in internal_extensions)
return extensions

Implementation: growing subgraph "seeds"

For larger subgraphs I do depth-first search by getting the last
element of the "seeds" stack. (If I switch to a collections.deque and
pop from the front then this becomes a breadth-first search.)

while seeds:
considered, subgraph, extensions = seeds.pop()
# I'm going to handle all 2**n-1 ways to expand using these
# sets of bonds, so there's no need to consider them during
# any of the future expansions.
new_considered = considered.copy()
new_considered.update(ext[0] for ext in extensions)
for new_atoms, new_subgraph in all_subgraph_extensions(subgraph, extensions, k):
assert len(new_subgraph.atoms) <= k
yield new_subgraph
# If no new atoms were added, and I've already examined
# all of the ways to expand from the old atoms, then
# there's no other way to expand and I'm done.
if not new_atoms:
continue
# Start from the new atoms to find bonds which can be used
# for future extensions.
new_extensions = find_extensions(new_considered, new_atoms, new_subgraph.atoms)
if new_extensions:
seeds.append( (new_considered, new_subgraph, new_extensions) )

That really is all there is to the main algorithm. Although the
function "all_subgraph_extensions" does take some explaining.an

Implementation: making all possible extensions from a subgraph

The all_subgraph_extensions function generates the new subgraphs,
which are extensions of the input subgraph. It goes through all
2n-1 combinations, excepting those which add too many atoms
to the subgraph, and merges each combination with the input.

def all_subgraph_extensions(subgraph, extensions, k):
# Generate up to 2**(len(extensions)-1) new subgraphs which are
# the possible extensions of the old subgraph. None of the new
# subgraphs will have more than k atoms.
assert len(subgraph.atoms) <= k
assert extensions
new_atoms_limit = k - len(subgraph.atoms)
# For each possible extension which is small enough
for new_atoms, combination in all_combinations(extensions, new_atoms_limit):
# Make the new subgraph
atoms = frozenset(chain(subgraph.atoms, new_atoms))
assert len(atoms) == len(subgraph.atoms) + len(new_atoms), "duplicate atom?"
bonds = frozenset(chain(subgraph.bonds, (ext[0] for ext in combination)))
# Also yield the new atoms so they can be used in the seed
yield new_atoms, Subgraph(atoms, bonds)

I need to generate all the combinations. For that I use a recursive
function.

The first item will always be the empty list, which isn't a valid
extension, so I'll always throw it away using iterator.next(). I'll
also throw away any extensions which add too many atoms.

def all_combinations(extensions, limit):
# Generate all (2**len(extensions))-1 ways to combine the
# extensions such that there is at least one extension in the
# combination and no combination has more than 'limit' atoms.
# Yield combinations as (set of new atoms, list of extensions)
n = len(extensions)
assert n >= 1
it = _all_combinations(extensions, n-1, 0)
it.next() # the first contains no extensions; ignore it
for combination in it:
atoms = set(ext[1] for ext in combination if ext[1] is not None)
if len(atoms) > limit:
continue
yield atoms, combination

I include the list of new atoms in the yield statement since
eventually they will be included as part of the new seed.

Implementation: the code (this is not the final version!)

The canonical SMARTS generator and the self-tests are essentially
unchanged from last time so I won't describe them. You can download
the entire file as slower_dfs_subgraph_enumeration.py.
(Why is this "slower"? Because in a bit I'll show a somewhat faster
version of the same algorithm.)

Cross-comparison testing

While this algorithm looks simple, it took me several days to
develop. I was glad to have the simple algorithm which I could use to
cross test because I kept finding cases I got wrong. I wrote a test
program called "cross_test.py" which generates SMARTS counts for both
algorithms and if they differ it gives a nice description of where
they differ.

Faster through better handling of "internal" and "external" extensions?

There are two types of extensions. One connects the subgraph to itself
and the other expands it to include a new atom. I call these
"internal" and "external" extensions. In my code I merged them into a
single "extension" 2-element tuple containing bond and optional
to_atom. (to_atom is None for internal extensions.)

This worked pretty well, but it precludes certain optimizations. For
example, I don't have to worry about about internal extensions making
the subgraph too large because internal subgraphs will never increase
the atom count. For another example, if the subgraph can only grow by
one atom and I have one external extension and two internal
extensions, then there's no need to include the counting overhead to
ensure that the three extensions will grow too large.

If I track the internal and external extensions separately then I can
be a bit more clever about generating the new subgraphs. I'll change
"find_extensions" to return both objects:

but the real big changes are in all_subgraph_extensions. I need to
handle three different cases: if only internal extensions are present,
if only external extensions are present, or if both types are present.

Implementation: only internal extensions

Handling only internal extensions is the easiest: enumerate all
combinations, none of which will have any atoms.

Implementation: only external extensions

If there are only external extensions then it's also pretty easy:

if not internal_extensions:
# Only external extensions
# If we're at the limit then it's not possible to extend
if limit == 0:
return
# We can extend by at least one atom.
it = limited_external_combinations(external_extensions, limit)
it.next()
for new_atoms, external_ext in it:
# Make the new subgraphs
atoms = frozenset(chain(subgraph.atoms, new_atoms))
bonds = frozenset(chain(subgraph.bonds, (ext[0] for ext in external_ext)))
yield new_atoms, Subgraph(atoms, bonds)
return

Implementation: both internal and external extensions

Finally, if there's at least one of each extension type then I need to
generate the cross-product of all internal and all external
extensions. That's easy with itertools.product:

Implementation: all external extension combinations

The all_combinations function is as before. The new function
limited_external_combinations is a variation designed for external
extensions. It keeps track of the set of atoms used by the given
extension combination and doesn't return extension combinations which
are too large.

Performance

My goal was to make the subgraph enumeration algorithm faster. Have I
managed that? I wrote a program to measure the performance of the
different algorithms. The new algorithm is about 3.5 times faster than
the first one, and paying careful attention to the extensions gave me
7% better performance.

I think this algorithm uses the best approach, but there are many ways
to further speed it up. For examples: using the integers from GetIdx()
might be better than storing the atom/bond objects directly in the
set; I could use an array of flags rather than a set; I could
hard-code the enumeration for the first 10 extensions rather than
depends on Python's slow stack functions; and I could rewrite the code
to work in Pyrex or C++.

But this is fast enough. At 50 structures per second it would take my
laptop about 12 days to process a 50 million compound database. More
likely I would pop over to Amazon, rent time on 100 machines, and have
it done in a few hours.

Granted, I could have done the same with 3.5 times more computers, but
having a clever algorithm makes me feel good. Besides distributed
computing improves throughput, not response time. If I want to
generate the subgraphs as part of a query then it's better to have the
processing take 1/60th of a second than 1/15th.

Comments and Feedback

If you liked this essay, found a problem with the code, or have
something else to add then go ahead and leave
a comment. I would especially like to hear from people who have
done non-trivial work with subgraph enumerations and in fingerprint
filter generation.