The Daylight Clustering Package is a suite of programs providing four clustering algorithms
each of which starts with a set of structures or reactions in SMILESTM
and produces clustering results in formats suitable for further analysis.
The Clustering Package can handle both very large sets (millions of structures)
of compounds for applications like database analysis or vendor compound selection
(Jarvis-Patrick, k-modes, and sphere exclusion)
as well as smaller sets for applications such as determining structure-activity
relationships within lead optimization projects or document analysis (scaffold-directed).
In the latter case, compounds can be clustered in a way that guarantees that each
resulting cluster can be represented by a common structural scaffold.

Highlights

The modular nature of the package ensures that
users have control over the four stages of the clustering process -
descriptor assignment, similarity measure, clustering algorithm,
and post-clustering analysis and storage.

Those clustering algorithms offered for use with large datasets are
fast and scale well. This allows users to
explore the effect of changing descriptors and similarity measure
on the clustering outcome.

The scaffold-directed algorithm offered for small datasets guarantees
that each cluster has a common substructure
that meets a minimum coverage requirement.

The programs in the clustering package use dynamic memory allocation -
computable problem size is limited only by available virtual memory
and CPU speed so large datasets can be clustered.

Most of the clustering algorithms offered allow
for dynamic updating of the clustering by the addition of further
compounds to, say, a corporate collection.

The analysis program output can be tailored for
easy storage of the results in Thor/Merlin or in DayCart®.

Prototype compounds from each cluster can easily
be chosen and combined to make up a diverse representative set
from the whole collection.

Unusual ("outlier") compounds that form small
or single member clusters can easily be identified.

As compounds are assigned to a uniquely named
class, sets of similar compounds can be rapidly retrieved using
standard database or spreadsheet tools.