A highly efficient multi-core algorithm for clustering extremely large datasets.

Kraus JM, Kestler HA - BMC Bioinformatics (2010)

Bottom Line:
We demonstrate their computational power and show their utility in cluster stability and sensitivity analysis employing repeated runs with slightly changed parameters.Computation speed of our Java based algorithm was increased by a factor of 10 for large data sets while preserving computational accuracy compared to single-core implementations and a recently published network based parallelization.Most desktop computers and even notebooks provide at least dual-core processors.

Background: In recent years, the demand for computational power in computational biology has increased due to rapidly growing data sets from microarray and other high-throughput technologies. This demand is likely to increase. Standard algorithms for analyzing data, such as cluster algorithms, need to be parallelized for fast processing. Unfortunately, most approaches for parallelizing algorithms largely rely on network communication protocols connecting and requiring multiple computers. One answer to this problem is to utilize the intrinsic capabilities in current multi-core hardware to distribute the tasks among the different cores of one computer.

Results: We introduce a multi-core parallelization of the k-means and k-modes cluster algorithms based on the design principles of transactional memory for clustering gene expression microarray type data and categorial SNP data. Our new shared memory parallel algorithms show to be highly efficient. We demonstrate their computational power and show their utility in cluster stability and sensitivity analysis employing repeated runs with slightly changed parameters. Computation speed of our Java based algorithm was increased by a factor of 10 for large data sets while preserving computational accuracy compared to single-core implementations and a recently published network based parallelization.

Conclusions: Most desktop computers and even notebooks provide at least dual-core processors. Our multi-core algorithms show that using modern algorithmic concepts, parallelization makes it possible to perform even such laborious tasks as cluster sensitivity and cluster number estimation on the laboratory computer.

Figure 3: Example for cluster evaluation via the MCA index. On the left, two possible partitionings of the data set are shown, i. e. P = {A1, A2, A3} and Q = {B1, B2, B3}. The bipartite matching graph is given on the right. Each edge is annotated with the number of intersecting data points in both partitionings. The solid lines mark the maximal matching edges. In this example the MCA index is = 0.71.

Mentions:
For the evaluation of the experiments, we here use a measure that is based on the pairwise similarity between set partitions and can be interpreted as the mean proportion of samples being consistent over the different clusterings [32,33]. Because this index behaves linearly in the number of data points it offers a better interpretability in terms of proportion of samples moving between clusters. There is no such intuitive interpretability for quadratic validity measures like Rand or Jaccard index [36,37]. The concept is illustrated in Figure 3. In the left part of Figure 3 two partitionings P and Q are compared. The correspondence or similarity between two clusters Ci and Dj is given by the size of the intersection set /Ci ∩ Dj/. The idea of the maximum cluster assignment (MCA) index is to find a bijective mapping π: {1 ... k} → {1 ... k} that maps each cluster from one clustering P to its corresponding cluster in Q such a way that the sum over all similarities is maximized. The bold lines in the right part of Figure 3 mark the maximum matching nodes in the bipartite graph representation. In this example, the best mapping is A1 ↔ B2, A2 ↔ B1, A3 ↔ B3. The MCA index is then defined as the ratio of the number of data points in the intersection sets of the corresponding clusters to the overall number of data points:

Figure 3: Example for cluster evaluation via the MCA index. On the left, two possible partitionings of the data set are shown, i. e. P = {A1, A2, A3} and Q = {B1, B2, B3}. The bipartite matching graph is given on the right. Each edge is annotated with the number of intersecting data points in both partitionings. The solid lines mark the maximal matching edges. In this example the MCA index is = 0.71.

Mentions:
For the evaluation of the experiments, we here use a measure that is based on the pairwise similarity between set partitions and can be interpreted as the mean proportion of samples being consistent over the different clusterings [32,33]. Because this index behaves linearly in the number of data points it offers a better interpretability in terms of proportion of samples moving between clusters. There is no such intuitive interpretability for quadratic validity measures like Rand or Jaccard index [36,37]. The concept is illustrated in Figure 3. In the left part of Figure 3 two partitionings P and Q are compared. The correspondence or similarity between two clusters Ci and Dj is given by the size of the intersection set /Ci ∩ Dj/. The idea of the maximum cluster assignment (MCA) index is to find a bijective mapping π: {1 ... k} → {1 ... k} that maps each cluster from one clustering P to its corresponding cluster in Q such a way that the sum over all similarities is maximized. The bold lines in the right part of Figure 3 mark the maximum matching nodes in the bipartite graph representation. In this example, the best mapping is A1 ↔ B2, A2 ↔ B1, A3 ↔ B3. The MCA index is then defined as the ratio of the number of data points in the intersection sets of the corresponding clusters to the overall number of data points:

Bottom Line:
We demonstrate their computational power and show their utility in cluster stability and sensitivity analysis employing repeated runs with slightly changed parameters.Computation speed of our Java based algorithm was increased by a factor of 10 for large data sets while preserving computational accuracy compared to single-core implementations and a recently published network based parallelization.Most desktop computers and even notebooks provide at least dual-core processors.

Background: In recent years, the demand for computational power in computational biology has increased due to rapidly growing data sets from microarray and other high-throughput technologies. This demand is likely to increase. Standard algorithms for analyzing data, such as cluster algorithms, need to be parallelized for fast processing. Unfortunately, most approaches for parallelizing algorithms largely rely on network communication protocols connecting and requiring multiple computers. One answer to this problem is to utilize the intrinsic capabilities in current multi-core hardware to distribute the tasks among the different cores of one computer.

Results: We introduce a multi-core parallelization of the k-means and k-modes cluster algorithms based on the design principles of transactional memory for clustering gene expression microarray type data and categorial SNP data. Our new shared memory parallel algorithms show to be highly efficient. We demonstrate their computational power and show their utility in cluster stability and sensitivity analysis employing repeated runs with slightly changed parameters. Computation speed of our Java based algorithm was increased by a factor of 10 for large data sets while preserving computational accuracy compared to single-core implementations and a recently published network based parallelization.

Conclusions: Most desktop computers and even notebooks provide at least dual-core processors. Our multi-core algorithms show that using modern algorithmic concepts, parallelization makes it possible to perform even such laborious tasks as cluster sensitivity and cluster number estimation on the laboratory computer.