Abstract

Background

PubChem is a free and open public resource for the biological activities of small
molecules. With many tens of millions of both chemical structures and biological test
results, PubChem is a sizeable system with an uneven degree of available information.
Some chemical structures in PubChem include a great deal of biological annotation,
while others have little to none. To help users, PubChem pre-computes "neighboring"
relationships to relate similar chemical structures, which may have similar biological
function. In this work, we introduce a "Similar Conformers" neighboring relationship
to identify compounds with similar 3-D shape and similar 3-D orientation of functional
groups typically used to define pharmacophore features.

Results

The first two diverse 3-D conformers of 26.1 million PubChem Compound records were
compared to each other, using a shape Tanimoto (ST) of 0.8 or greater and a color
Tanimoto (CT) of 0.5 or greater, yielding 8.16 billion conformer neighbor pairs and
6.62 billion compound neighbor pairs, with an average of 253 "Similar Conformers"
compound neighbors per compound. Comparing the 3-D neighboring relationship to the
corresponding 2-D neighboring relationship ("Similar Compounds") for molecules such
as caffeine, aspirin, and morphine, one finds unique sets of related chemical structures,
providing additional significant biological annotation. The PubChem 3-D neighboring
relationship is also shown to be able to group a set of non-steroidal anti-inflammatory
drugs (NSAIDs), despite limited PubChem 2-D similarity.

In a study of 4,218 chemical structures of biomedical interest, consisting of many
known drugs, using more diverse conformers per compound results in more 3-D compound
neighbors per compound; however, the overlap of the compound neighbor lists per conformer
also increasingly resemble each other, being 38% identical at three conformers and
68% at ten conformers. Perhaps surprising is that the average count of conformer neighbors
per conformer increases rather slowly as a function of diverse conformers considered,
with only a 70% increase for a ten times growth in conformers per compound (a 68-fold
increase in the conformer pairs considered).

Neighboring 3-D conformers on the scale performed, if implemented naively, is an intractable
problem using a modest sized compute cluster. Methodology developed in this work relies
on a series of filters to prevent performing 3-D superposition optimization, when
it can be determined that two conformers cannot possibly be a neighbor. Most filters
are based on Tanimoto equation volume constraints, avoiding incompatible conformers;
however, others consider preliminary superposition between conformers using reference
shapes.

Conclusion

The "Similar Conformers" 3-D neighboring relationship locates similar small molecules
of biological interest that may go unnoticed when using traditional 2-D chemical structure
graph-based methods, making it complementary to such methodologies. The computational
cost of 3-D similarity methodology on a wide scale, such as PubChem contents, is a
considerable issue to overcome. Using a series of efficient filters, an effective
throughput rate of more than 150,000 conformers per second per processor core was
achieved, more than two orders of magnitude faster than without filtering.