1. Intended Audience

This document should be read by everyone who intends to use the
Clustering Package programs. It will also be useful for people using
data generated by these programs.

This Guide is intended to provide sufficient background and information
about why, when, and how the programs in the Clustering Package programs
are designed to be used. Background information provided here is
introductory; readers are referred to the References
section of this document for more comprehensive information about
structural clustering. Program syntax is not described in detail
here; see the program manual pages for this information.

2. Introduction to the Daylight Clustering Package

The Daylight Clustering Package is a set of programs which provide
general-purpose non-parametric clustering of objects described by an
appropriate string of binary values. Commonly these will be Daylight
fingerprints (see the
Daylight Theory Manual),
however any set of binary descriptors which
fulfill the requirements of Daylight fingerprints will do. A utility program,
filter_fingertalk(1)
is provided to convert users descriptors into a format which will be
read by Daylight applications.

2.1. Clustering Package contents

The programs supplied in this Package are batch programs which run in all
Daylight supported environments except Windows. All functions needed to
cluster chemical structures are provided, including data preparation,
clustering parameter selection, clustering, and result post-processing. The
Clustering Package thus can be used independently of any other Daylight
program, however for large data sets the use of a supporting database
system such as thor/merlin or DayCart is recommended.

The Daylight Clustering Package consists of nine programs:

smi2tdt -- create file of Thor datatrees from SMILES

fingerprint -- characterize substructural content

nearneighbors -- find nearest neighbors by comparing
fingerprints

mergeneighbors -- combine nearest neighbors lists

jpscan -- tabulate J-P clustering results with varying
parameters

jarpat -- Jarvis-Patrick clustering

spherex -- Sphere exclusion selection.

kmodes -- k-modes clustering

showclusters -- process cluster data for textual display

listclusters -- sort and reformat clustering data for
further processing

sdcluster -- scaffold-directed clustering

2.2 Applicability

The Daylight Clustering Package provides a set of general methods
for clustering small organic compounds. However, given that the
aim of clustering is to group those things together which are more similar
to each other, according to the descriptor(s), than to the rest of
the set, it is therefore imperative that the descriptor(s) used vary
appropriately.

Jarvis-Patrick and kmodes clustering are most effective on data sets
containing roughly 250 to 2,500,000 objects. Alternatively, smaller
data sets are be best handled with a selection method such as spherex or
scaffold-directed clustering which can handle data sets in the 1000s.
Please note that usage of the cluster programs on very large data sets
is limited by CPU time and memory space. While the Cluster Package programs
do not impose any hard limits, e.g. clustering 10,000,000 structures is
possible, but would require good hardware and a considerable length of time.

2.3 Uses of Structural Clustering

The default descriptor generator provided in the Cluster Package is
the fingerprint(1)
program. Daylight fingerprints are structure based representing to a
first approximation the substructures present in a molecule, see
below.

Structural clustering is used for a wide variety of purposes,
including to:

Automatically group compounds into structurally related
families. The result is a set of clusters. This is at least a
partial goal of almost all structural clustering efforts.

Determine the degree of clustering of structures in a large set
for the purpose of characterizing that set of structures. The
results are the clustering statistics.

Select a limited number of compounds to represent all
structural classes in the total set for purposes such as
screening for bioactivity. The result will be a set of
representative compounds.

Locate unique or unusual compounds in a set. The result is a
list of compounds that don't cluster.

Provide a very fast and compact form of similarity searching
(i.e. given one structure, find all members of its cluster). The
results are the cluster ID's associated with each structure.

In Data Mining and related disciplines to discover
relationships between objects in a particular descriptor space.

However, users may, if they wish, generate their own binary
fingerprints, or import them from another vendors software. Indeed
the objects represented need not be chemicals, for instance one
could describe a set of screens by the activity/inactivity of a
standard set of compounds. The Cluster Package places requirements
only on the format of the fingerprint, not its content or meaning.

3. Methodology

The clustering methods used in the Daylight Clustering Package
perform the following (conceptual) procedures:

Algorithms for each of these steps are available in the graph-theory
and statistical literature. Unfortunately, much of the conventional
methodology lends itself poorly to the problem of chemical structure
clustering. Algorithms were developed or adapted from the literature
for each of these steps to achieve the general structural clustering
method used in the Clustering Package.

3.1 Structural Characterization

The goal of structural characterization is to represent molecular
connectivity as completely and compactly as possible in a way that
lends itself to an intuitive measure of similarity.

One popular method of structural characterization is to use a fixed
set of structural keys such as those used in substructure search
screens. This approach was rejected as a general method as being
overly biased by key selection. A certain amount of bias can be
tolerated when used for substructure search as long as the user is
willing to tweak the keys to match the problem, but is a terrible
disadvantage for exploratory applications of cluster analysis.
However, should users wish, binary keys of this type can be
converted to Daylight format using the
filter_fingertalk(1)
program.

Another approach is to use parametric measures of molecular
connectivity (e.g. the chi-index or information content). Such
metrics are poor candidates for cluster analysis primarily because
so little information is retained. Also, parametric measures are
poorly suited to the problem: comparison of chemical structures is
fundamentally a non-parametric operation. Currently Daylight does not
provide conversion tools for continuous data.

The default structural characterization used in the Clustering
Package is based on a binary structural fingerprint, derived as
follows.

Starting with each atom, traverse all paths, branches, and
ring closures up to a certain depth (typically 8). For each
substructure, derive a hash-like number from unique,
relatively-prime, order-dependent contributions of each atom and
bond type. Critical properties of this number are that it is
reproducible (each substructure produces a single number) and
its value and graph are not correlated (a linear congenital
generator is used to insure this).

Map each resulting number into a large range (typically 2K-64K)
to produce a redundant, large-scale, binary representation of
the substructural elements. The resultant "fingerprint"
contains a large amount of information at a low density.

Iteratively "fold" the fingerprint by OR-ing the
fingerprint in half until the bit-density reaches a minimum
required value or until the fingerprint reaches a minimum
allowable length. The resulting fingerprint now has a high
information density with a minimal (and controllable)
information loss.

Fingerprints thus generated approximate a complete characterization
of substructural content. The quality of the approximation is
determined by their information content and density which is easily
controlled. Furthermore, the information content is not biased by
the generation method - fingerprints work equally well on reagents,
drugs, dyes, and insecticides without any "tweaking". See
the Fingerprint section of the
Theory Manual
for further discussion.

3.2 Structural similarity

Structural similarity is the basis of the clustering used in the
Cluster Package, i.e, "similar" things should cluster
together. An appropriate measure of structural similarity is thus
critical to this clustering method.

In recent years the Sheffield group have shown the value of other
binary similarity coefficients ( see Holliday (2002) ). These are
all functions of the counts a,b,c,d described below. All the
usual mathematical expressions are available and described fully in
expression(5)

Starting with version 4.9, the clustering package includes the ability to
specify arbitrary similarity measures for the nearest neighbors list
generation, spherex and kmodes clustering, and statistics computations.
Similarities can be expressed in terms of bits-on-in-common c,
bits-off-in-common d, and bits-on uniquely in either fingerprint
a or b, being compared.

3.3 Clustering

3.3.1 Jarvis-Patrick

A very large number of clustering methods have been applied to the
chemical clustering problem. The test of a clustering algorithm's
effectiveness is whether it automatically classifies items of
interest into groups that make sense. Although this is a subjective
standard, it's clear that most of the clustering methods that have
been used fail this test in the general case.

Chemical structure clustering places extremely tough demands on a
clustering algorithm. For the sake of illustration, a few of these
special requirements will be shown as two-dimensional parametric
examples.

Many clustering algorithms are biased towards
finding globular clusters. Such algorithms are not suitable for
chemical clustering, where long "stringy" clusters are
the rule, not the exception.

To be effective for clustering chemical
structures, a clustering algorithm must be self-scaling, since
it is expected to find both straggly, diverse clusters (e.g.
penicillins) and tight ones (e.g. PCB's)

A chemical clustering method must be
adjustable in some way that allows control of the tightness
required to cluster (e.g., do the estradiols form a separate
cluster within the steroid cluster or not?)

Willett published several thorough analyses and comparisons of
various clustering algorithms to which the user is referred for more
information (see References). One of his conclusions was that the
Jarvis-Patrick method performed best, but that it was
computationally prohibitive for large data sets.

The method, as published (Jarvis (1973)) works like this:

For each item, find its J nearest neighbors. This requires
order(N2) CPU time, but needs to be done only once.

Two structures cluster together if:
(a) They are in each other's list of J nearest neighbors, and
(b) K of their J nearest neighbors are in common.

This method is implemented in the Clustering Package as the programs
nearneighbors and jarpat. Removing clustering
requirement (a) usually results in improved clustering due to a more
exhaustive search but at a high cost in speed. Partially relaxing
this requirement approximates the more exhaustive search and runs
even faster than the published method. jarpat provides all
three methods.

The Jarvis-Patrick algorithm appears to be an ideal method for
clustering chemical structures:

The same results are produced regardless of input order

It's a non-parametric method

Cluster resolution can be adjusted (J,K) to
match a particular need

Auto scaling is built into the method

It will find tight clusters embedded in loose ones

It is not biased towards globular clusters

The clustering step is very fast

Overhead requirements are relatively low

The main disadvantage of Jarvis-Patrick clustering is that it requires a list of near neighbors
which is computationally expensive to generate, but this is
effectively solved by the fast search methods provided by nearneighbors.

3.3.2 Singletons in Jarvis-Patrick

Singletons can be thought of either as structures which are not
assigned to a cluster or as clusters with one member. Using
reasonable rules, most sets of structures contain singletons, often
10-30% of the total. A common misconception is that singletons
somehow represent a failure of the clustering process. In fact, one
of the most important roles of clustering is to identify unique or
unusual structures, which almost invariably appear as singletons.

For some applications, however, a large percentage of singletons is
a disadvantage. For instance, consider the situation where 1,000
structures are needed to represent a database of 10,000 structures
(perhaps there are resources for only 1,000 screening tests). For
the purposes of this example, assume that a "natural"
clustering resulted in 9,000 structures being assigned to 500
clusters leaving 1,000 singletons. Selecting 500 centroids, we now
have 1,500 structures which represent the 10,000 structures in the
database, which is too many. What shall we do? There are four ways
of approaching this problem, roughly in decreasing order of
desirability:

Live with the list of 1,500 produced by clustering. This sounds
flippant, but is often the best solution. Attempting to force
singletons into unnatural clusters will inevitably merge natural
clusters which will reduce the value of cluster representatives.
Consider selecting 1,000 of 1,500 based on non-structural
properties, e.g. availability, price, etc.

Re-cluster the singletons using more lenient clustering
parameters. The jarpat option -SINGLETON_FILE
causes singleton structures to be written to a separate file for
this purpose. Advantages of this approach are that the original
"natural" clusters are unaffected, it can be pretty
effective if the singleton count needs to be reduced by about half (or less),
and if you later decide to do the remaining 500, you haven't messed up the
sampling. Disadvantages are that singletons will be under-sampled and you
will need to rerun nearneighbors on the singletons.

"Rescue" singletons using the jarpat option
-RESCUE_SIZE <L> . Singleton rescue works as
follows: clustering is done using J and K
parameters as per usual; the neighbors of each singleton are
then examined to see which cluster they belong to, and the
singleton is added to the cluster containing the plurality of
its nearest neighbors if at least L of J.
Advantages of this approach are that it's fast, automatic,
adjustable, it's in the spirit of the Jarvis-Patrick method, and
it seems intuitively correct to many people (i.e. "rescue
to nearest cluster").

Redo the clustering, relaxing clustering parameters such that
fewer than 1,000 clusters result, including singletons. This
approach does produce the best single classification of
acceptable size, but be aware that many "natural"
clusters will have been merged.

3.3.3 K-modes

The k-modes clustering algorithm is a variation on the commonly-used
k-means non-hierarchical method suitable for non-parametric data
( Huang(1998), Chaturvedi (2001) ) The method works as follows:

The user selects a number of clusters (K) and provides a set of
K initial modals to act as starting seeds.

An initial classification pass occurs:

Each item in the input dataset is assigned to the cluster such that
the distance from that item to the modal for that cluster is minimized.

As each item is added the modal for the cluster is recomputed to
include the new item.

Zero or more relocation passes occur:

Each item in the input dataset is checked against the neighboring
modals. If it is closer to another cluster modal, it is moved from
the current cluster to the new cluster.

The modals for both clusters are recomputed.

Typically relocation passes occur until no items relocate over an entire
pass. When complete, each of the K clusters is output with its member
items and the final modal fingerprint for that cluster.

k-modes is much less resource intensive than Jarvis-Patrick; the
order of the algorithm is O(K*N), which is typically much less than
the O(N^2) of Jarvis-Patrick. K-modes is a good choice for extremely
large datasets and in situations where one wants to select a specific
number of sample items from a dataset (K). Also, k-modes is a
flexible method for simple partitioning and classification operations
to select weighted subsets of a dataset rapidly. Note that, unlike
Jarvis-Patrick, K-modes does depend on the input order of the dataset.

3.3.4 Sphere Exclusion

Exclude from further consideration all items which are within a
given similarity/distance threshold to the selected item.

Repeat until all items have been selected or excluded.

Each item selected and any items excluded based on that selection
are output as a cluster for further processing. The selected item is
marked in the cluster.

Several variations of the method used for the item selection are
available (Taylor (1995), Butina (1999)). In the classic algorithm,
items are selected in the first step at random. More recently, a
number of directed selections have been developed (Lee (2002)) which
attempt to systematically cover the space of the dataset. The impact
of the directed selection is twofold. First, the algorithm is more
efficient due to locality of reference in the computation. Second,
the space is more evenly covered than with the random method.

3.3.5 Scaffold-Directed Clustering

Scaffold-directed clustering is distinct from the other three clustering
methods discussed above. The algorithm does not use externally supplied
fingerprints but rather specifically generates a set of of fingerprints
for internal use only. These fingerprints are more complex and represent
longer paths by default than typical Daylight fingerprints. In addition, unlike the
traditional Daylight fingerprints, they are non-hashed and easily interpreted.
The agglomerative algorithm uses the internal fingerprints and a single
coverage parameter to control the clustering process and thereby generate
clusters which can be characterized by relatively large scaffolds shared
by all members of a cluster. The final scaffolds are represented by SMARTS
that may contain more than one fragment.

3.4 Cluster Statistics

Jarvis-Patrick, k-modes, and spherex clustering algorithms directly produce cluster
assignments. Useful clustering statistics are number of clusters,
number and percentage of structures clustered, and the frequency
distribution of clusters by cluster size. Such statistics are useful
for selecting clustering parameters and characterizing the set of
input structures.

Since Jarvis-Patrick, k-modes, and spherex are not progressive
methods, the dendogram-type of output is not relevant.

For purposes such as selecting representative structures, an
intra-cluster statistic is needed to characterize the relationship
of a single member to its cluster. The most frequently used
statistics of this type in parametric clustering are distance from
the arithmetic mean and standard error (distance/standard
deviation). These statistics are not meaningful when working with
non-parametric data: since it makes no sense to average bitwise
data, the notion of a "mean" is meaningless. For the same
reason, the variance of a cluster is not available (variance is the
mean of the of squares of distances from the mean). A quantity
equivalent to the variance is available in the case of k-modes where
the modal acts as the "center" of the cluster.

Although we cannot determine the amount of variance unexplained by
the mean for all the non-parametric methods, we can
determine the amount of variance of the cluster that is unexplained
by each member. This turns out to be a very useful statistic; if
n is the number of members of a clusters and Cij is
the similarity coefficient between members i and j,
the statistic for member i is:

The cluster member with the lowest such statistic is the one which
explains the most variance of the cluster. In geometric terms, this
would be the member nearest the centroid, and should be a
representative structure of the group. (This member is, perhaps
loosely, called the cluster "centroid" in some of the program
output.) Note that a "parent" structure is unlikelyto be
a cluster centroid. Parent structures are, by definition,
the least substituted structures in a group and as such are extreme
members which explain very little (or none) of the within-cluster
variance. This statistic is computed by the post-processing programs
listclusters and showclusters.

Like the other three clustering methods, Scaffold-directed clustering
directly produces cluster assignments. Useful clustering statistics that are
generated include the number of clusters and number of singletons as well as
the number of compounds within a cluster and the minimum scaffold coverage.
In addition, the SMARTS representation of the scaffold for each cluster is provided
along with the cluster members.

4. Cluster Package Usage

The following sections provide general usage instructions for
programs in the Cluster Package. See the man pages for the
individual programs for authoritative program syntax reference.

4.1 Relationships of Cluster Package programs

The programs in the Cluster Package are designed to work together both
as a stand-alone package and in conjunction with other Daylight programs.
The following figure illustrates data flow between these programs.

Start with either a database or with structures in SMILES. Use
either sthorman or smi2tdt to generate a file of
thor datatrees. The fingerprint program is used for
structural characterization.

For Jarvis-Patrick clustering, the nearneighbors program is
used to generate the lists of nearest neighbors for a datafile. The
mergeneighbors program can be used to process and merge
intermediate nearest neighbors lists. jpscan is used to
determine clustering parameters, and jarpat is used to
generate the cluster results.

For k-modes or Sphere Exclusion clustering, the programs kmodes
and spherex, respectively, are used. Finally, the results of
any of the three clustering methods are post-processed with listclusters
or showclusters.

Information flow between these programs is done via Thor datatrees.
This format is rigorously defined, suitable for use in many
environments, allows individual programs to be used separately, and
is well-suited for input to most post-processing tasks. Datatypes
used include SMILES ($SMI), fingerprint (FP), near neighbors (NN),
cluster (CL), and datatypes which record computed property
generation ($FPG, $NNG, $CLG). Alternatively, input and output can
be in SMILES format which provides a simpler format for stand-alone
applications with minimal data handling needs.

4.2 Preparing Data for Clustering (smi2tdt)

The starting point for structural cluster analysis is a list of
structures to be clustered. The simplest usable format that this
list can take is a SMILES file, i.e. a sequential file containing
one SMILES per line. Since the Clustering Package programs require
Thor datatrees as input, the program smi2tdt is provided to
convert SMILES format files to Thor datatree format. After
clustering is completed, the program listclusters provides a
means of returning the data to SMILES format if desired.

4.3 Structural Characterization (fingerprint)

The fingerprint program reads a Thor datatree containing
SMILES ($SMI) rooted datatrees and adds a fingerprint (FP)
dataitem to each such datatree. Fixed-size or folded fingerprints
can be generated by specifying program options which include the
initial size (default is 2048), the minimum size (default is 64),
and minimum density (default is 0.3).

This is the same program that is used to generate fingerprints for
Merlin. The default option values are set up for use in the sort of
tasks that Merlin does, e.g., substructure screening and similarity
searching. Fingerprints generated with the default option values
contain a large amount of information at a moderate-to-high density
and are suitable (if not optimal) for clustering. If available,
existing fingerprints can be extracted from a database (via sthorman)
rather than regenerating them with fingerprint.

Although the default fingerprints should be suitable for most
databases of small organic molecules, users are encouraged to try
various fingerprint parameterizations on their own data sets.
Experiments indicate that the nearneighbors program works
acceptably with much higher than normal fingerprint densities (e.g.
generated with minimum density 0.4 or even 0.5), particularly with very
large and/or highly-clustered databases. Using higher density fingerprints
reduces the disk, memory, and time requirements of nearneighbors.
K-modes requires the use of fixed width fingerprints in order to
calculate the modal. Experiments indicate fixed width of 1024 or
2048 are fine for "drug-like" molecules. Better
discrimination is found if the path length range for the fingerprint
is increased to 0/8 from the
default 0/7.

There is normally no need to store "specialty"
fingerprints in a Thor database if they are used only to generate
cluster data. Putting them in a database doesn't make them more
useful for clustering and there is a potential for confusion. In any
event, it is strongly recommended that a recognizable run ID
be assigned to fingerprints generated with non-default parameters.
For instance, using the run ID "D42" for fingerprints
generated with minimum density 0.42 makes it less likely that they
will be confused with standard fingerprints.

Note that all subsequent steps are performed on the fingerprints,
the SMILES merely acts as a name. Users who have fingerprints
generated in other ways, say a modal from a combinatorial library
can take advantage of this to cluster compound libraries.

4.4 Clustering

4.4.1 Jarvis-Patrick (nearneighbors, mergeneighbors, jpscan, jarpat)

The Jarvis-Patrick clustering method is implemented in the
Clustering Package as the programs nearneighbors and jarpat.
The near-neighbor search is the slow step and is typically done only
once. Clustering with jarpat is relatively fast but requires
that appropriate clustering parameters are supplied. The program
jpscan is provided to assist in selection of clustering
parameters.

nearneighbors reads a Thor datatree file containing
fingerprint data, copying its input to output, adding a "Nearest
Neighbors" (NN) dataitem after each selected fingerprint.
This program uses a bunch of computational optimizations to beat
order(N2) for most chemical data sets, but it's still
CPU-intensive. The optimizations work best on highly clustered data,
but when running on very large data sets (e.g. 100,000+ structures),
allow for the possibility of very long running time (e.g. days). It
is fortunate that this step only has to be done once. Several
parameters are available, the most important being the length of
nearest neighbor lists to be generated (default value is 16) and the
maximum list size, with ties (default value is 24). Smaller values
save time and space but limit the range of clustering parameters
that can be used with the results. Create lists which are as long as
the largest set of neighbors to be examined, but no larger. The
default value is adequate for most applications; it's rarely useful
to increase it, but it can be profitably decreased when long lists
aren't useful (very small data sets) or are too expensive (very
large data sets).

nearneighbors can take advantage of multiple CPUs on some
multiprocessing machines. This option (-NUM_PROCESSES) controls the
number of child processes which get spawned on these machines. Using
multiple processes decreases the overall processing time linearly
with increased CPUs. mergeneighbors allows nearneighbors
lists generated on the same input fingerprint files to be merged.
This is extremely useful for processing of large databases. nearneighbors
can be stopped and restarted at will and the intermediate lists
can be easily merged.

jpscan and jarpat both perform Jarvis-Patrick
clustering based on nearest neighbors (NN) data. Both programs use
two Jarvis-Patrick clustering parameters: the number of neighbors to
examine and the number required to be in common. jpscan repeatedly
clusters data using all possible parameter combinations up to a
given limit (typically set to the list length, default is 16) and
outputs tables of statistics intended to help in selecting a pair of
parameters appropriate to the problem at hand. jarpat requires
that the parameters be specified and outputs the clustering results.
It is advisable to run jpscan and examine its output before
running jarpat. Both programs also allow control of the way
the clustering search is done (described in section 3.3): as
published (the default), an exhaustive search (only useful for very
small data sets), and a faster search which approximates the
exhaustive search (recommended). In addition, both programs allow
the user to control how ties in similarity are handled during the
clustering.

jarpat provides two (nonexclusive) methods for dealing with
singletons: rescuing singletons and writing them out to a separate
file. If singleton rescue is used (option -RESCUE_SIZE), rescued
singletons will appear in clusters to which they are rescued. If a
singleton file is generated (option -SINGLETON_FILE), it may be fed
back to nearneighbors and then re-clustered.

jarpat provides an additional processing option which is not
part of the original Jarvis-Patrick algorithm. This option
(-NN_BEST_THRESHOLD) allows the preprocessing of the neighbors lists
as follows: the best neighbor (excluding itself) for each structure
is compared with the threshold value. If the best neighbor has a
similarity lower than the specified threshold, then the structure is
marked as a singleton and is excluded from the clustering. This is a
useful way to discover very tight clusters within a dataset.

Note:nearneighbors, jpscan, and jarpat
read the dataset into a fixed-size memory area (default provides
room for 10,000 items) and will reallocate memory as needed to make
room for more as they are read in. This memory reallocation can
degrade performance on some systems. These programs provide an
option to specify the number of structures to be expected - it is
recommended that you use it.

4.4.2 K-modes (kmodes)

kmodes reads a Thor datatree file containing fingerprint
data, and adds a cluster datatype (CL<>) with the results of
the computation. In addition, it will write the modal fingerprints
for each of the K clusters into the first datatree for that
cluster written to output. The modal fingerprints are written with
the FPM<> datatype.

Within a single run, kmodes operates on a single size of
fingerprints and modals. Hence, all fingerprints and modal seeds
must be of the same size on input. The program detects the size from
the input dataset.

The option "-MODES <val>" controls the initial
number of clusters to select (K). If during the processing a
cluster falls below the value of the "-MIN_MODESIZE"
option, the cluster is eliminated. Thus, the output will contain
at mostK clusters.

The passes are controlled with the "-MAX_PASSES" and
the "-MIN_RELOC" options. The default behavior is for
relocation passes to occur until no items move during a pass. This
behavior can be altered by limiting the number of relocation passes
with MAX_PASSES, or to stop after sufficiently small number of items
relocated during a pass (expressed as a percentage of the entire dataset,
using the MIN_RELOC option). If MAX_PASSES is zero, the program performs a
single partitioning, writes its output, and quits.

The "-MOVE" option controls the recomputation of the
modals during the initial partitioning step. The default of TRUE
means that, for each item added to the cluster, the modal is
recomputed. When FALSE, the modals are not recomputed as members are
added to the cluster. This option only applies to the initial
partitioning pass.

The initial modal seeds can be chosen from three sources. The
default is to take the first K items in the dataset as the initial
modals. By setting the "-RANDOM" option to TRUE, the
initial modals will be selected as random members of the input
dataset. Finally, the seeds can be selected from an alternate
datafile with the "-SEED_FILE" option. In this case the
first K fingerprints in the file will be selected as seeds.

Note:kmodes reads the dataset into a fixed-size
memory area (default provides room for 10,000 items) and will
reallocate memory as needed to make room for more as they are read
in. This memory reallocation can degrade performance on some
systems. These programs provide an option to specify the number of
structures to be expected - it is recommended that you use
it.

4.4.3 Sphere Exclusion (spherex)

spherex reads a Thor datatree file containing fingerprint
data, and adds a cluster datatype (CL<>) with the results of
the computation. It also will, for each cluster, mark the selected
item (as opposed to the excluded items) with a CLT<> dataitem.

Within a single run, spherex operates on a single size of
fingerprints and modals. Hence, all fingerprints and modal seeds
must be of the same size on input. The program detects the size from
the input dataset.

The program selects compounds from the dataset in two general modes.
The default is to select compounds in order of ascending number of
bits set. The second mode, controlled with the "-RANDOM"
option.

For each item selected, the neighbors around it are excluded based
on one of two criteria. The default criteria uses a similarity/distance
threshold defined with the "-THRESHOLD", "-EXPRESSION",
and "-COMPARISON" options. Please note that the "COMPARISON"
option is only used in conjunction with the "EXPRESSION"
option. The second exclusion criteria uses a fixed number of items
based on their ranked similarity/distance, controlled with the
"-SPHEREX_RANK" option.

Note:spherex reads the dataset into a fixed-size
memory area (default provides room for 10,000 items) and will
reallocate memory as needed to make room for more as they are read
in. This memory reallocation can degrade performance on some
systems. These programs provide an option to specify the number of
structures to be expected - it is recommended that you use
it.

4.4.4 Scaffold-Directed (sdcluster)

Unlike the other three clustering methods, Scaffold-directed clustering
is a stand-alone program using an internally generated set of fingerprints.
sdcluster does not accept input TDT files, however it does generate
output TDTs which are compatible with listclusters and showclusters.
The input file must be a SMILES file that may include space-separated names.

4.5 Post-processing and Analysis (showclusters, listclusters)

showclusters and listclusters read cluster (CL) and
fingerprint (FP) dataitems in a Thor datatree (e.g. those written by
jarpat). showclusters produces summaries and tables
suitable for textual display or printing. listclusters reformats
cluster data in a way suitable for processing by other programs.
Both programs are able to sort structures by cluster and compute the
intra-cluster statistics described in section 3.7.

Cluster results to be passed on to any other program should be
processed by listclusters first. Aside from computing
intra-cluster statistics and removing temporary data items, listclusters
sorts and renumbers the clusters in a more useful, less
arbitrary manner than is done by jarpat. By default, listclusters
writes its output in Thor datatree format, but SMILES formatted
output can be also be generated.

Although showclusters does exactly the same sorts and
statistical computations as listclusters, it offers a number
of summary displays and output formatting options specific to
textual presentation. showclusters' output uses only
printable ASCII (and newline) and is suitable for use in virtually
any environment.

Add the CL datatype to your datatypes database (or use one
supplied with any Thor database). Be sure the POOL (_P)
datafield in the datatype is set to "*;*;;*".

Load the cluster data into a Thor database using sthorman.

Open the database with xvmerlin. Create column(s) for
cluster data and its fields as desired. All normal operations
will now be available (select, sort, search, depict, save, etc.)

Merlin's function as an exploratory data analysis tool is well
suited for many clustering applications. For instance, to select
structures for testing, one might: create columns for "Availability",
"Cost", and "Cluster ID"; remove all rows which
have unacceptable "Availability", sort by "Cost";
sort by "Cluster ID" (leaves items in the same cluster
sorted by cost), remove repeats in the "Cluster" column
(leaving only the lowest cost, available item in each cluster), and
display results.

Aside from its data analysis functions, Merlin can serve as a
gateway to other programs. The "Save as text..." item in
the "File" menu will write selected structures and columns
to a file in tab-delimited form which is suitable for input to
Daycart as well as programs such as Excel, SAS, Word, and others.

6. Maintaining Clustered Databases

The conventional method for updating nearest neighbors lists with
new structures is to form a combined list, rerun nearneighbors,
and re-cluster. This approach is acceptable for moderate to large
databases, but it's a real limitation for very large ones.

Beginning with Release 4.34, the nearneighbors program
provides the ability to update nearest neighbors lists with new
structures much more quickly than starting over with all structures.
To do this, use the -UPDATE_FILE (update) option to specify
an file which contains previous nearneighbors output. The
main advantage of this approach is speed: the full N-squared
comparisons don't need to be done. Disadvantages are that you need
to save the intermediate nearneighbors output (for large
databases this is typically a very large file) and you must keep
track of what is new since the previous run. Also understand that an
"update" run uses just as much memory as a "from
scratch" run and that all the nearest neighbors lists
may be changed. These disadvantages do not weigh heavily when you
are working with really large databases, where a full nearneighbors
run might take days.

The -UPDATE_FILE option to nearneighbors subsumes
the obsolete program newneighbors which was part of the
original Clustering Package.

Appendix 1. Examples

Starting from a small (less than 10000 structures) .smi file
containing SMILES and CAS Numbers, perform a cluster analysis and
display results (Note: both old- and new-style options work):

Prepare a .tdt file to be used to load the above results in a Thor
database.

$ listclusters -x cl.tdt > clusterdata.tdt

Appendix 2. References

D. Butina (1999) "Unsupervised Data Base Clustering Based
on Daylight's Fingerprint and Tanimoto Similarity: A Fast and
Automated Way To Cluster Small and Large Data Sets" Journal
of Chemical Information and Computer Sciences, 39: 747-750

Appendix 3. Clustering Package Release Notes

All programs in the Clustering Package were revamped for Release
4.34. An additional display, "Size of largest cluster",
was added to the output of jpscan. The -u (update)
feature was added to program nearneighbors. Corrected a bug
in programs showclusters and listclusters which
resulted in incorrectly-computed "unexplained variance"
statistics for some clusters based on variable-length fingerprints.
Fixed bug in jpscan resulting in a floating point exception
when computing average cluster size of zero (only happened with
pathological distributions). Changed jarpat program so it
explicitly exits with return value 0 on success so the return value
can be checked in shell scripts.

Beginning with version 4.41, nearneighbors supports
multiprocessing on Sun Solaris and Irix. The speed of nearneighbors
has been increased by approximately 4-fold because of improvements
in the fingerprint toolkit. The utility mergeneighbors has
been added. jpscan now provides statistics about rescued
singletons using the -RESCUE_SIZE option, and both jarpat
and jpscan support the `best threshold' option. Finally,
nearneighbors, mergeneighbors, jarpat, and
jpscan use the new style options (e.g.. -NEIGHBORS versus
-n).

In version 4.91, the kmodes and spherex methods were added.

Also in version 4.91, the options -EXPRESSION and -COMPARISON, used
to define the similarity/distance measure, were added. The
Jarvis-Patrick programs now generate and use variable-length
nearest-neighbors lists. These lists are used to keep track of tied
values in the end of a neighbors list. The option -MAX_NEIGHBORS
controls the maximum size of a variable-length list, including the
extra tied neighbors. The additional tied neighbors can be included
or excluded from consideration during the clustering step. The
option -JP_USE_ALL_TIES in jarpat and jpscan controls this.