In the present day scenario, there are large numbers of clustering algorithms available to groupobjects having similar characteristics. But the implementations of many of those algorithms arechallenging when dealing with categorical data. While some of the algorithms available atpresent cannot handle categorical data the others are unable to handle uncertainty. Many ofthem have the stability problem and also have efficiency issues. This necessitated thedevelopment of some algorithms for clustering categorical data and which also deal withuncertainty. In 2007, an algorithm, termed MMR was proposed [3], which uses the rough settheory concepts to deal with the above problems in clustering categorical data. Later in 2009,this algorithm was further improved to develop the algorithm MMeR [2] and it could handlehybrid data. Again, very recently in 2011 MMeR is again improved to develop an algorithmcalled SDR [22], which can also handle hybrid data. The last two algorithms can handle bothuncertainties as well as deal with categorical data at the same time but SDR has more efficiencyover MMeR and MMR. In this paper, we propose a new algorithm in this sequence, which isbetter than all its predecessors; MMR, MMeR and SDR, and we call it SSDR (Standard deviationof Standard Deviation Roughness) algorithm. This takes both the numerical and categorical datasimultaneously besides taking care of uncertainty. Also, this algorithm gives better performancewhile tested on well known datasets.

The basic objective of clustering is to group data or objects having the similar characteristics inthe same cluster and having dissimilarity with other clusters. It has been used in data miningtasks such as unsupervised classification and data summation. It is also used in segmentation oflarge heterogeneous data sets into smaller homogeneous subsets which is easily managed,separately modeled and analyzed [8]. The basic goal in cluster analysis is to discover naturalgroupings of objects [11]. Clustering techniques are used in many areas such as manufacturing,Adhir Ghosh et al Adv. Appl. Sci. Res., 2011, 2 (3):314-326_____________________________________________________________________________315Pelagia Research Library

medicine, nuclear science, radar scanning and research and also in development. For example,Wu et al. [21] developed a clustering algorithm specifically designed for handling the complexityof gene data. Jiang et al. [13] analyze a variety of cluster techniques, which can be applied forgene expression data. Wong et al. [16] presented an approach used to segment tissues in anuclear medical imaging method known as positron emission tomography (PET). Haimov et al.[20] used cluster analysis to segment radar signals in scanning land and marine objects. FinallyMathieu and Gibson [19] used the cluster analysis as a part of a decision support tool for largescale research and development planning to identify programs to participate in and to determineresource allocation.

The problem with all the above mentioned algorithms is that they mostly deal with numericaldata sets that are those databases having attributes with numeric domains .The basic reason fordealing with numerical attributes is that these are very easy to handle and also it is easy to definesimilarity on them. But categorical data have multi-valued attributes. This, similarity can bedefined as common objects, common values for the attributes and the association between two.In such cases horizontal co-occurrences (common value for the objects) as well as the verticalco-occurrences (common value for the attributes) can be examined [21].

Other algorithms, those can handle categorical data have been proposed including work byHuang[3], Gibson et al. [4], Guha et al. [13] and Dempster et al. [1]. While these algorithms ormethods are very helpful to form the clusters from categorical data they have the disadvantagethat they cannot deal with uncertainty. However, in real world applications it has been found thatthere is often no sharp boundary between clusters. Recently some work has been done by Huang[8] and Kim et al. [14] where they have developed some clustering algorithms using fuzzy sets,which can handle categorical data. But, these algorithms suffer from the stability problem as theydo not provide satisfactory values due to the multiple runs of the algorithms.

Therefore, there is a need for a robust algorithm that can handle uncertainty and categorical datatogether. In this sequence S. Parmar et al [3] in 2007, B.K.Tripathy et al [2] in 2009 and [22] in2011 proposed three algorithms which can deal with both uncertainty and categorical attributestogether. But the efficiency and stability come into play when Purity ratio is measured. Thepurity ratios of MMR, MMeR and SDR are in the increasing order.

In this paper, a new algorithm called Standard Deviation of Standard Deviation Roughness(SSDR) algorithm is proposed, which has higher purity ratio than all the previous algorithms inthis series and previous to that. We establish the superiority of this algorithm over the others bytesting them on a familiar data base, the zoo data set taken from the UCI repository.

MATERIALS AND METHODS

2.1 MaterialsIn this section we first present the literature review as the basis of the proposed work, thedefinitions of concepts to be used in the work and also present the notations to be used.

2.1.1 Literature ReviewIn this section we present the literature of existing categorical clustering algorithms. Dempster etal.[1]presents a partitional clustering method, called the Expectation-Maximization (EM)algorithm. EM first randomly assigns different probabilities to each class or category, for eachcluster. These probabilities are then successively adjusted to maximize the likelihood of the dataAdhir Ghosh et al Adv. Appl. Sci. Res., 2011, 2 (3):314-326_____________________________________________________________________________316Pelagia Research Library

given the specified number of clusters. Since the EM algorithm computes the classificationprobabilities, each observation belongs to each cluster with a certain probability. The actualassignment of observations to a cluster is determined based on the largest classificationprobability. After a large number of iterations, EM terminates at a locally optimal solution. Hanet al.[26]propose a clustering algorithm to cluster related items in a market database based onan association rule hypergraph. A hypergraph is used as a model for relatedness. The approachtargets binary transactional data. It assumes item sets that define clusters are disjoint and there isno overlap amongst them. However, this assumption may not hold in practice as transactions indifferent clusters may have a few common items. K-modes[8]extend K-means and introduce anew dissimilarity measure for categorical data. The dissimilarity measure between two objects iscalculated as the number of attributes whose values do not match. The K-modes algorithm thenreplaces the means of clusters with modes, using a frequency based method to update the modesin the clustering process to minimize the clustering cost function. One advantage of K-modes isit is useful in interpreting the results[8]. However, K-modes generate local optimal solutionsbased on the initial modes and the order of objects in the data set. K-modes must be run multipletimes with different starting values of modes to test the stability of the clustering solution.Ralambondrainy[15]proposes a method to convert multiple categories attributes into binaryattributes using 0 and 1 to represent either a category absence or presence, and to treat the binaryattributes as numeric in the K-means algorithm. Huang[8]also proposes the K-prototypesalgorithm, which allows clustering of objects described by a combination of numeric andcategorical data. CACTUS (Clustering Categorical Data Using Summaries)[23]is asummarization based algorithm. In CACTUS, the authors cluster for categorical data bygeneralizing the definition of a cluster for numerical attributes. Summary informationconstructed from the data set is assumed to be sufficient for discovering well-defined clusters.CACTUS finds clusters in subsets of all attributes and thus performs a subspace clustering of thedata. Guha et al.[6]propose a hierarchical clustering method termed ROCK (Robust Clusteringusing Links), which can measure the similarity or proximity between a pair of objects. UsingROCK, the number of links are computed as the number of common neighbors between twoobjects. An agglomerative hierarchical clustering algorithm is then applied: first, the algorithmassigns each object to a separate cluster, clusters are then merged repeatedly according to thecloseness between clusters, where the closeness is defined as the sum of the number of linksbetween all pairs of objects. Gibson et al.[4]propose an algorithm called STIRR (SievingThrough Iterated Relational Reinforcement), a generalized spectral graph partitioning method forcategorical data. STIRR is an iterative approach, which maps categorical data to non-lineardynamic systems. If the dynamic system converges, the categorical data can be clustered.Clustering naturally lends itself to combinatorial formulation. However, STIRR requires a non-trivial post-processing step to identify sets of closely related attribute values[23]. Additionally,certain classes of clusters are not discovered by STIRR[23]. Moreover, Zhang et al.[24]arguethat STIRR cannot guarantee convergence and therefore propose a revised dynamic systemalgorithm that assures convergence. He et al.[7]propose an algorithm called Squeezer, which isa one-pass algorithm. Squeezer puts the first-tuple in a cluster and then the subsequent-tuples areeither put into an existing cluster or rejected to form a new cluster based on a given similarityfunction. He et al.[25]explore categorical data clustering (CDC) and link clustering (LC)problems and propose a LCBCDC (Link Clustering Based Categorical Data Clustering), andcompare the results with Squeezer and K-mode. In reviewing these algorithms, some of themethods such as STIRR and EM algorithms cannot guarantee the convergence while others havescalability issues. In addition, all of the algorithms have one common assumption: each objectcan be classified into only one cluster and all objects have the same degree of confidence whengrouped into a cluster[5]. However, in real world applications, it is difficult to draw clearAdhir Ghosh et al Adv. Appl. Sci. Res., 2011, 2 (3):314-326_____________________________________________________________________________317Pelagia Research Library

boundaries between the clusters. Therefore, the uncertainty of the objects belonging to the clusterneeds to be considered.

One of the first attempts to handle uncertainty is fuzzy K-means[9]. In this algorithm, eachpattern or object is allowed to have membership functions to all clusters rather than having adistinct membership to exactly one cluster. Krishnapuram and Keller[18]propose a probabilisticapproach to clustering in which the membership of a feature vector in a class has nothing to dowith its membership in other classes and modified clustering methods are used to generatemembership distributions. Krishnapuram et al.[17]present several fuzzy and probabilisticalgorithms to detect linear and quadratic shell clusters. Note the initial work in handlinguncertainty was based on numerical data. Huang[8]proposes a fuzzy K-modes algorithm with anew procedure to generate the fuzzy partition matrix from categorical data within the frameworkof the fuzzy K-means algorithm. The method finds fuzzy cluster modes when a simple matchingdissimilarity measure is used for categorical objects. By assigning confidence to objects indifferent clusters, the core and boundary objects of the clusters can be decided. This helps inproviding more useful information for dealing with boundary objects. More recently, Kim et al.[14]have extended the fuzzy K-modes algorithm by using fuzzy centroid to represent theclusters of categorical data instead of the hard-type centroid used in the fuzzy K-modesalgorithm. The use of fuzzy centroid makes it possible to fully exploit the power of fuzzy sets inrepresenting the uncertainty in the classification of categorical data. However, fuzzy K-modesand fuzzy centroid algorithms suffer from the same problem as K-modes, that is they requiremultiple runs with different starting values of modes to test the stability of the clusteringsolution. In addition, these algorithms have to adjust one control parameter for membershipfuzziness to obtain better solutions. This necessitates the effort for multiple runs of thesealgorithms to determine an acceptable value of this parameter. Therefore, there is a need for acategorical data clustering method, having the ability to handle uncertainty in the clusteringprocess while providing stable results. One methodology with potential for handling uncertaintyis Rough Set Theory (RST) which has received considerable attention in the computationalintelligence literature since its development by Pawlak in the 1980s. Unlike fuzzy set basedapproaches, rough sets have no requirement on domain expertise to assign the fuzzymembership. Still, it may provide satisfactory results for rough clustering. The objective of thisproposed algorithm is to develop a rough set based approach for categorical data clustering. Theapproach, termed Standard deviation of Standard deviation roughness (SSDR), is presented andits performance is evaluated on large scale data sets.

2.1.2 Basics of rough setsMost of our traditional tools for formal modeling, reasoning and computing are deterministic andprecise in character. Real situations are very often not deterministic and they cannot be describedprecisely. For a complete description of a real system often one would require by far moredetailed data than a human being could ever recognize simultaneously, process and understand.This observation led to the extension of the basic concept of sets so as to model imprecise datawhich can enhance their modeling power. The fundamental concept of sets has been extended inmany directions in the recent past. The notion of Fuzzy Sets, introduced by Zadeh [10] dealswith the approximate membership and the notion of Rough Sets, introduced by Pawlak [12]captures indiscernibility of the elements in a set. These two theories have been found tocomplement each other instead of being rivals. The idea of rough set consists of approximationof a set by a pair of sets, called the lower and upper approximations of the set. The basicassumption in rough set is that, knowledge depends upon the classification capabilities of humanbeings. Since every classification (or partition) of a universe and the concept of equivalenceAdhir Ghosh et al Adv. Appl. Sci. Res., 2011, 2 (3):314-326_____________________________________________________________________________318Pelagia Research Library

relation are interchangeable notions, the definition of rough sets depends upon equivalencerelations as its mathematical foundations [12].

Let U (≠) be a finite set of objects, called the universe and R be an equivalence relation overU. By U / R we denote the family of all equivalence classes of R (or classification of U) referredto as categories or concepts of R and [x]Rdenotes a category in R containing an element x ∈U.By a Knowledge base, we understand a relation system k= (U, R), where U is as above and R is afamily of equivalence relations over U.

For any subset P (≠) ⊆ R, the intersection of all equivalence relations i n P is denoted by IND(P) and is called the indiscernibility relation over P. The equivalence classes of IND (P) arecalled P- basic knowledge about U in K. For any Q∈R, Q is called a Q-elementary knowledgeabout U in K and equivalence classes of Q are called Q-elementary concepts of knowledge R.The family of P-basic categories for all≠ P ⊆R will be called the family of basic categories inknowledge base K. By IND (K), we denote the family of all equivalence relations defined in k.Symbolically, IND (K) = {IND (P):≠ P ⊆ R}.

For any X ⊆ U and an equivalence relation R ∈ IND (K), we associate two subsets,{/:}RX Y U R Y X=   and{/:}RX Y U R Y X=    , called the R-lower and R-upperapproximations of X respectively. The R-boundary of X is denoted by BNR(X) and is given byBNR(X) =RX RX. The elements ofRXare those elements of U which can be certainlyclassified as elements of X employing knowledge of R. The borderline region is the undecidablearea of the universe. We say X is rough with respect to R if and only ifRX RX, equivalentlyBNR(X) ≠. X is said to be R- definable if and only ifRX RX=, or BNR(X) =. So, a set isrough with respect to R if and only if it is not R- definable.

2.1.3 DefinitionsDefinition 2.1.3.1 (Indiscernibility relation (Ind (B))): Ind (B) is a relation on U. Given twoobjects xi, xj∈ U, they are indiscernible by the set of attributes B in A, if and only if a (xi) = a (xj)for every a∈B. That is, (xi, xj∈ Ind (B) if and only ifa∈B where B⊆A, a (xi) = a (xj).

Definition 2.1.3.2 (Equivalence class ([xi]Ind (B))): Given Ind (B), the set of objects xihaving thesame values for the set of attributes in B consists of an equivalences classes, [xi]Ind(B). It is alsoknown as elementary set with respect to B.

Definition 2.1.3.3 (Lower approximation): Given the set of attributes B in A, set of objects X inU, the lower approximation of X is defined as the u nion of all the elementary sets which arecontained in X. That is

BX

= ∪ xi| [xi]Ind (B)⊆ X}.Definition 2.1.3.4 (upper approximation): Given the set of attributes B in A, set of objects X inU, the upper approximation of X is defined as the u nion of the elementary sets which have anonempty intersection with X.That is

Definition 2.1.3. 5 (Roughness): The ratio of the cardinality of the lo wer approximation and thecardinality of the upper approximation is defined a s the accuracy of estimation, which is ameasure of roughness. It is presented as

RB(X) = 1-| || |BBXX

If RB(X) = 0, X is crisp with respect to B, in other wor ds, X is precise with respect to B. If RB(X)<1, X is rough with respect to B, That is, B is vag ue with respect to X.

Definition 2.1.3.7 (Mean roughness): Let A have n attributes and ai∈ A. X be the subset ofobjects having a specific value α of the attribute ai. Then we define the mean roughness for theequivalence class ai=α, denoted by MeR (ai=α) as

MeR (ai=α) =1( (//( 1)jna ijj iR X a n== )) ∑.

Definition 2.1.3.8 (Standard deviation) : After calculating the mean of each ai∈ A, we will applythe standard deviation to each aiby the formula

SD (ai=α) =12i1(1/( 1)) ( (/MeR(a ))ina iin R X a = = )  =∑

Definition 2.1.3.9 (Distance of relevance): Given two objects B and C of categorical data with nattributes, DR for relevance of objects is defined as follows:

3. DR (bi, ci) =| |i iB Cieq eqnoif aiis a numerical attribute; where iBeq is the numberassigned to the equivalence class that contains bi.iCeq is similarly defined and noi is the totalnumber of equivalence classes in numerical attribute ai.

Definition 2.1.3.10 (Purity ratio) : In order to compare SDR with MMeR and MMR and all otheralgorithms which have taken initiative to handle categorical data we developed animplementation. The traditional approach for calculating purity of a cluster is given below.

Purity (i)=the number of data occuring in both theth cluster and its corresponding classthe number of data in the data set

Over all Purity=#1( )#ofclustersiPurity iofclusters=∑

METHODS

In this section we present the main algorithm of the paper and the experimental part deals withan example.

2.2.1 Proposed AlgorithmIn this section we present our algorithm which we call SSDR. The notations and definitions ofconcepts have been discussed in the previous section.

Experimental PartIn this section we present the experimental hybrid table which the characterization of variousanimals in terms of size, animality, color and age. In later section we will show the efficiency ofthis algorithm. The experimental table is as follows:

Let us consider the value of k is 3 that is k=3 whi ch mean the number of clusters will be 3.Initially the value of CNC is 1 and the value of th e ParentNode is U which indicates, the initialvalue of ParentNode is whole table. So, we need to apply our algorithm three times to get thedesired clusters.

Computational PartSo, initially CNC < k and CNC ≠1 is false. So it will calculate the average distan ce of the parentnode, but initially only one table we have so there is no need to calculate the average distance,directly we will calculate the roughness of each at tribute relative to the rest of the attributeswhich is known as relative roughness. So, when i=1, the value of aiis SIZE that is ai= size.This attribute has three distinct values Small,  Medium and Large so considering α = SmallAdhir Ghosh et al Adv. Appl. Sci. Res., 2011, 2 (3):314-326_____________________________________________________________________________322Pelagia Research Library

first we get X={A1, A4} (where X is a subset of obj ects having one specific value α of attributeai) and considering j=2(as i ≠ j) we get aj=Animality. So the equivalence classes of ajis {(A1,A2), A3, A4, (A5, A6, A7)} and the lower approximation of (ja iX a= )is given by(ja iX a= )= {ϕ} and the upper approximation of (ja iX a= )is given by(ja iX a= )= {A1,A2, A4}. So, the roughness of ai(when ai=SIZE and α=Small) is given by

(/jX iR X a= )= 1 -| ( || ( |jja ia iX aX a= )= )= 1 -03= 1

Now, by changing the value of j (when j=3, 4,) and keeping constant the value of ai(ai=size)and α (α=Small) we need to find the roughness of airelative to the attributes COLOR (whenj=3) and AGE (when j=4) and is given by

Now, to get the standard deviation of ai(ai=size) when α=Small we need to find the mean ofthese values and is given by1 1 03+ +=23. And applying standard deviation formula we get thevalue 0.4714 and will be stored in a variable.

This similar process will be continued by changing the value of α (for α=Medium and Large)and keeping constant the value of ai. And lastly we will get three standard deviation values foreach different α. And again we will store those values in a variable. After calculating the SD(standard deviation) of each α we will take the minimum value of those different values of α andwill store it in another variable.

The above procedure will be continued for each ai(for ai=ANIMALITY, COLOR and SIZEwhen i=2, 3 and 4) and the corresponding values will be stored in the variable. After completingthe above step we will take those minimum values for next calculation. We will apply SD(standard deviation) to those minimum values to get the Splitting attributes. If the value of SDdoes not match with the minimum values then will we take the nearest minimum vale as thesplitting attribute and will do the binary splitting that is we will divide this table into twoclusters.

Let after splitting we have got two cluster c1 and c2 and c1 contains 2 elements and c2 contains5 elements. So now we need to calculate the average distance to choose the clustering table forfurther calculation. This can be done by applying distance of relevance formula.

Let us see how we calculate DR (distance of Relevance). For example let us take two tuple A4and A6 which is as follows

But for DR (bage, cage) we need to follow some different method as AGE is the numericalattribute. To calculate the DR of a numerical attribute we need to exclude that numerical attributefrom that table and need to find the average equivalence class of all attributes. So, in this case weneed to exclude the attribute AGE first and then we have to find the average equivalence class.

So, the average equivalence class is (3+4+2)/3 = 3. In this case we have got a integer value butwe can get a fraction also then we need to take either its floor value or its roof value.

Now we need to sort the attribute value of the attribute AGE. After sorting in ascending orderwe get {5, 7, 9, 16, 25, 28, 30}. Now we will distribute these numbers into three sets which is asfollows

Set 1 = {5, 7}Set 2 = {9, 16}Set 3 = {25, 28, 30}Now we will calculate DR (bage,cage). In our case bage= 30 and cage= 5. So, we will put 3 and 1 inplace of 30 and 5 as 30 belongs to the set 3 and 5 belongs to the set 1.

So, in this fashion we will apply this algorithm until we get the desired number of cluster. In ourcase we will stop when we will get C3 because in our case the total number of clusters is 3.

RESULTS AND DISCUSSION

In this section we present the original result that is tested on ZOO dataset which was also takenby MMR, MMeR and SDR algorithm.

The ZOO data has 18 attributes and out them 15 are Boolean attribute, 2 are numeric and 1 isanimal name and it has 101 objects. The total objects are divided into seven classes so; we needto stop when we will get seven clusters. After taking the ZOO dataset as the input we have gotthe following output which is as follows:Table 3

3.2.1 Comparison of SSDR with MMeR, MMR, SDR and Al gorithms based on FUZZY SetTheoryTill the development of MMR, the only algorithms wh ich aimed at handling uncertainty in theclustering process were based upon fuzzy set theory [26].These algorithms based on fuzzy settheory include fuzzy K-modes, fuzzy centroids. The K-modes algorithm replaces the means ofthe clusters (K-means) with modes and uses a freque ncy based method to update the modes inthe clustering process to minimize the clustering c ost function. Fuzzy K-modes generates a fuzzypartition matrix from categorical data. By assignin g a confidence to objects in different clusters,the core and boundary objects of the clusters are d etermined for clustering purposes. The fuzzycentroids algorithm uses the concept of fuzzy set t heory to derive fuzzy centroids to createclusters of objects which have categorical attribut es. But in MMR, MMeR and in SDR they haveused rough sets concept to build those algorithms b ut as compared to efficiency MMeR is moreefficient than MMR and less efficient than SDR but SSDR is much more efficient than other.

3.2.2 Emperical AnalysisThe earlier algorithms for classification with unce rtainty like K-modes, Fuzzy K-modes andFuzzy centroid on one hand and MMR, MMeR and SDR on the other hand were applied to ZOOdata sets. Table 4 below provides the comparison of purity for these algorithms on this datasets.It is observed that SSDR has a better purity than a ll other algorithms when applied on zoo dataset.

As mentioned earlier, all the fuzzy set based algor ithms face a challenging problem that is theproblem of stability. These algorithms require grea t effort to adjust the parameter, which is usedto control the fuzziness of membership of each data point. At each value of this parameter, thealgorithms need to be run multiple times to achieve a stable solution.Adhir Ghosh et al Adv. Appl. Sci. Res., 2011, 2 (3):314-326_____________________________________________________________________________325Pelagia Research Library

MMR, MMeR and SDR on the other hand have no such pr oblem. SSDR continues to have theadvantages of MMR, MMeR and SDR over the other algo rithms as mentioned above. But it hashigher purity than MMR, MMeR and SDR which establis hes its superiority over MMR, MMeRand SDR.Table 4

*In this case we have got the same Purity ratio as compared to SDR but as standard deviation hasbetter central tendency over mean or minimum it wil l give better result for other data sets.Manually it has been checked for a small data set t hat it is giving much better result than MMR,MMeR and SDR

CONCLUSION

In this paper, we proposed a new algorithm called S SDR, which is more efficient than most ofthe earlier algorithms including MMR, MMeR and SDR, which are recent algorithms developedin this direction. It handles uncertain data using rough set theory. Firstly, we have provided amethod where both numerical and categorical data ca n be handled and secondly, by providingthe distance of relevance we are getting much bette r results than MMR where they are choosingthe table to be clustered, according to the number of objects. The comparison of purity ratioshows its superiority over MMeR. Future enhancement s of this algorithm may be possible byconsidering hybrid techniques like rough-fuzzy clus tering or fuzzy-rough clustering.