The QCL’s “Open Access” library is a resource where members can deposit their compounds and extracts, and access members can request these compounds and extracts for screening on a range of biological targets. The QCL can now supply researchers in the life sciences with unique small molecules at prices well below that of commercial vendors. At present there are almost 20,000 pure compounds that can be accessed for screening.

Often, the QCL is asked with providing smaller subsets of the Open Access library as screening the whole library is not always possible. The key question addressed by QFAB was how to best select these subsets whist ensure maximum outcome from the screening.

SOLUTION PROVIDED

The diverse subset selections were performed using a clustering-based method where a specific number of representative compounds are chosen from the entire data set.

The first step consists in selecting initial seeds using a maximum dissimilarity algorithm (MaxMin[1]). This enables seeds to be spread across the chemical space. The algorithm begins by randomly choosing one molecule as the first seed. The molecule maximally distant from the first seed is selected as the next seed. The molecule maximally distant from both current seeds is selected after that. The process repeats itself until there is a sufficient number of seeds.

In the second step, clusters around seeds are let to converge towards stable clusters using the k-means clustering algorithm. After selection of the initial seeds, the non-selected molecules are assigned to the nearest seed to determine the cluster membership. For each cluster, the new cluster centre is determined by identifying the molecule with the smallest mean distance to the rest of the cluster members. Non-centre molecules are then re-assigned to the nearest centre, forming new clusters. The process is repeated 100 times to ensure that clusters are stable. The final subset is made by selecting the cluster centre molecules.

Both MaxMin and k-means methods used the Tanimoto coefficient of the structural fingerprints to assess the similarity/dissimilarity between compounds.

Figure 1 represents the cumulative proportion of the Open Access library represented by clusters of a given size and lower. For example, it can be seen that in the subset selection of 5,000 compounds, 80% of the full library is represented by clusters of size 15 or lower. There are only a few occasional clusters of size 35 to 80 compounds. In contrast the 1,000 subset selection is made from many large clusters. 10% of the full library is represented by clusters with more than 140 members.

The 5,000 compound subset selection can be assumed to be well representative of the Open Access library.

The 1,000 compound subset selection captures the diversity of the Open Access library with relatively many “extreme” compounds present, but under explore zone of high density.