Abstract— This paper applies the Differential Evolution (DE) and Genetic Algorithm (GA) to the task of automatic fuzzy clustering in a Multi-objective Optimization (MO) framework. It compares the performance a hybrid of the GA and DE (GADE) algorithms over the fuzzy clustering problem, where two conflicting fuzzy validity indices are simultaneously optimized. The resultant Pareto optimal set of solutions from each algorithm consists of a number of non-dominated solutions, from which the user can choose the most promising ones according to the problem specifications. A real-coded representation of the search variables, accommodating variable number of cluster centers, is used for GADE. The performance of GADE has also been contrasted to that of two most well-known schemes of MO.

1. Introduction

Optimization-based automatic clustering algorithms greatly rely on a cluster validity function (optimization criterion) the optima of which appear as proxies for the unknown “correct classification” in a previously unhandled dataset [1]. Different formulations of the clustering problem vary in the optimization criterion used. Most existing clustering methods, however, attempt to optimize just one such clustering criterion modeled by a single cluster validity index. This often results into considerable discrepancies observable between the solutions produced by different algorithms on the same data. The single-objective clustering method may prove futile (as judged by means of expert’s knowledge) in a context where the criterion employed is inappropriate. In situations where the best solution corresponds to a tradeoff between different conflicting objectives, common sense advocates a multi-objective framework for clustering.

Although there has been a plethora of papers reporting several single-objective evolutionary clustering techniques (a comprehensive survey of which can be found in [1, 2]), very few research works have so far been undertaken towards the application of evolutionary multi-objective optimization algorithms (EMOA) for pattern clustering [3, 4]. A state-of-the-art literature survey indicates that DE has already proved itself as a promising candidate in the field of evolutionary multi-objective optimization (EMO) [5 – 8]. Earlier it has also been successfully applied to single-objective partitional clustering [9 – 11]. The work reported in [3] is based on Deb et al.’s celebrated NSGA (Non Dominated Sorting genetic Algorithm)-II [12] and the clustering method described in [4] is based on PESA (Pareto Evolution based Selection) II [13], and both the algorithms are multi-objective variants of Genetic Algorithm (GA). However, the multi-objective variants of DE have not been applied to the general data clustering problems till date, to the best of our knowledge. Since DE, by nature, is a real-coded population-based optimization algorithm, we here resort to centroid-based representation scheme for the search variables. A MOO algorithm, in general, ends up with a number of Pareto optimal solutions. Here we consider the Xie-Beni index [14] and the Fuzzy C Means (FCM) measure () [15] as the objective functions. The performance of GADE has also been contrasted with two best-known EMOA-based clustering methods till date. The first of these is MOCK by Handl and Knowles [4] while the second one is based on NSGA II and was used by Bandyopadhyay et al. for pixel clustering in remote sensing satellite image data [3]. Here we report the results for ten representative datasets including the microarray Yeast sporulation data [16].

2. Multi-objective Optimization Using DE

2.1 The MO ProblemIn many practical or real life problems, there are many (possibly conflicting) objectives that need to be optimized simultaneously. Under such circumstances there no longer exists a single optimal solution but rather a whole set of possible solutions of equivalent quality. The field of Multi-objective Optimization (MO) [17 – 19] deals with simultaneous optimization of multiple, possibly competing, objective functions.
2.2 The Differential Evolution (DE) AlgorithmDE [20, 21] is a population-based global optimization algorithm that uses a floating-point (real-coded) representation. It uses crossover (binomial in this case) and mutation operations to optimize a given cost function. For want of space, we avoid mentioning the details of the DE algorithm here and refer the reader to the aforementioned literatures.
2.3 The Multi-objective Variant of DEWe have used the Multi-objective DE (MODE) [4]. MODE was proposed by Xue et al. [8]. This algorithm uses a variant of the original DE, in which the best individual is adopted to create the offspring. A Pareto-based approach is introduced to implement the selection of the best individual. If a solution is dominated, a set of non-dominated individuals can be identified and the “best” turns out to be any individual (randomly picked) from this set.
3. Multi-objective Clustering Scheme3.1 Search-variable Representation and Description of the new algorithmIn the proposed method, for n data points, each d-dimensional, and for a user-specified maximum number of clusters, a chromosome is a vector of real numbers of dimension. The first entries are positive floating-point numbers in [0, 1], each of which controls whether the corresponding cluster is to be activated (i.e. to be really used for classifying the data) or not. The remaining entries are reserved for cluster centers, each d-dimensional. For example, the i-th vector is represented as:

....

.....

The j-th cluster center in the i-th chromosome is active or selected for partitioning the associated dataset if . On the other hand, if, the particular j-th cluster is inactive in the i-th vector in DE population. Thus the s behave like control genes.

IFTHEN the j-th cluster center is ACTIVE

ELSE is INACTIVE. (1)
Conjunction of GA and DE algorithms:The Differential Evolution algorithm is applied on the first Kmax members of the chromosome (as activated by the corresponding control genes), whereas, the control genes form a binary encoded GA population, which are operated by the Genetic operators of Selection, Crossover and Mutation. Binary tournament selection is employed in this case. The different GA operators are not reiterated here due to space limitations.

3.2 Selecting the Objective FunctionsConflict among the objective functions is often beneficial since it guides to globally optimal solutions. In this work we choose the Xie-Beni index XBqand the FCM objective function Jqas the two objectives. The FCM measure Jq may be defined as:

, (2)

where q is the fuzzy exponent, d indicates a distance measure between the j-th pattern vector and i-th cluster centroid, and denotes the membership of j-th pattern in the i-th cluster. The XB index is defined as a function of the ratio of the total variation to the minimum separation sep of the clusters. Here and sep may be written as:

Note that while computing the s, using equation (12), if is equal to zero for some p, then is set to zero for all , , while is set equal to one. Subsequently the centers encoded in a vector are updated using:

(7)
3.3 Avoiding Erroneous VectorsThere is a possibility that in our scheme, during computation of the XB or Jq, a division by zero may be encountered. This may occur when one of the selected cluster centers in a DE-vector is outside the boundary of distributions of the data set. To avoid this problem we first check to see if any cluster has fewer than two data points in it. If so, the cluster center positions of this special chromosome are re-initialized by an average computation.
3.4 Selecting the Best Solution from Pareto-front For choosing the most interesting solutions from the Pareto front, we apply Tibshirani et al. Gap statistic [24], a statistical method to determine the number of clusters in a data set.

3.5 Evaluating the Clustering QualityIn this work, the final clustering quality is evaluated using two external measures. Specifically we choose the Adjusted Rand Index [25] (which is a generalization of the Rand index [26]) and the Sihouette index [27]. Silhouette width reflects the compactness and separation of the clusters. Given a set of data points and a given clustering solution, the silhouette width for each data belonging to cluster indicates a measure of the confidence of belongingness, and it is defined as:

(8)

Here denotes the average distance of data point from the other data points of the cluster to which the data point is assigned (i. e. cluster). On the other hand, represents the minimum of the average distances of data point from the data points belonging to clusters and . The value of lies between -1 and +1. Large values of (near to 1) indicate that the data point is well clustered. Overall silhouette index of a clustering solution is defined as the mean silhouette width over all the data points:

(9)

4 Experimental results

4.1 Datasets usedThe experimental results showing the effectiveness of multi-objective DE based clustering has been provided for six artificial and four real life datasets. Table 1 presents the details of the datasets. The real-life datasets are iris, wine, breast-cancer [28] and the yeast sporulation data. The sporulation dataset is available from [31].
4.2 Parameters for the AlgorithmsGADE has been used with 40 parameter vectors in each generation and each run of each algorithm was continued for 100 generations. The value of scale factor F is a random value between 0.5 and 1. The other parameters for the multi-objective GA (NSGA II) based clustering are fixed as follows: number of generations = 100, population size = 50, crossover probability = 0.8, mutation probability =. Please note that GADE and the NSGA II use the same parameter representation scheme. Clustering with MOCK was performed with the source codes available from [32].
4.3 Presentation of Results

The mean Silhouette index values of the best-of-run solutions provided by six contestant algorithms over the 10 datasets have been provided in Table 2. The best entries have been marked in boldface in each row. Table 3 enlists the adjusted rand index values except for Yeast sporulation data as no standard nominal classification is known for this dataset.

4.4.4 Significance and Validation of Microarray Data Clustering ResultsIn this section the best clustering solution provided by different algorithms on the sporulation data of yeast has been visualized using the cluster profile plot (in parallel coordinates[30]) in MATLAB 7.0.4 version. It is a common way of visualizing high-dimensional geometry. Cluster profile plots (in parallel coordinates) of seven clusters for the best clustering result (provided by GADE) on yeast sporulation data has been shown in Figure 1. The blue polylines indicate the member genes within a cluster while the black polyline indicates the centroid of that gene. The heatmap and fatigo results may be obtained from [33].

5. ConclusionsThis paper compared and contrasted the performance of GADE in an automatic clustering framework with two other prominent multi-objective clustering algorithms. The multi-objective GADE-variant used the same variable representation scheme. Tables 2 to 4 indicate that GADE was usually able to produce better final clustering results as compared to MOCK or NSGA II in terms of both adjusted Rand index and Silhouette index when all the algorithms were let run for an equal number of generations. Future research may extend the multi-objective GADE-based clustering schemes to handle discrete chromosome representation schemes that no longer depend on cluster centroids and thus are not biased in any sense towards spherical clusters.
References