4What is Cluster Analysis?Cluster: A collection of data objectssimilar (or related) to one another within the same groupdissimilar (or unrelated) to the objects in other groupsCluster analysis (or clustering, data segmentation, …)Finding similarities between data according to the characteristics found in the data and grouping similar data objects into clustersUnsupervised learning: no predefined classes (i.e., learning by observations vs. learning by examples: supervised)Typical applicationsAs a stand-alone tool to get insight into data distributionAs a preprocessing step for other algorithms

5Applications of Cluster AnalysisData reductionSummarization: Preprocessing for regression, PCA, classification, and association analysisCompression: Image processing: vector quantizationHypothesis generation and testingPrediction based on groupsCluster & find characteristics/patterns for each groupFinding K-nearest NeighborsLocalizing search to one or a small number of clustersOutlier detection: Outliers are often viewed as those “far away” from any cluster

6Clustering: Application ExamplesBiology: taxonomy of living things: kingdom, phylum, class, order, family, genus and speciesInformation retrieval: document clusteringLand use: Identification of areas of similar land use in an earth observation databaseMarketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programsCity-planning: Identifying groups of houses according to their house type, value, and geographical locationEarth-quake studies: Observed earth quake epicenters should be clustered along continent faultsClimate: understanding earth climate, find patterns of atmospheric and oceanEconomic Science: market resarch

7Basic Steps to Develop a Clustering TaskFeature selectionSelect info concerning the task of interestMinimal information redundancyProximity measureSimilarity of two feature vectorsClustering criterionExpressed via a cost function or some rulesClustering algorithmsChoice of algorithmsValidation of the resultsValidation test (also, clustering tendency test)Interpretation of the resultsIntegration with applications

8Quality: What Is Good Clustering?A good clustering method will produce high quality clustershigh intra-class similarity: cohesive within clusterslow inter-class similarity: distinctive between clustersThe quality of a clustering method depends onthe similarity measure used by the methodits implementation, andIts ability to discover some or all of the hidden patterns

9Measure the Quality of ClusteringDissimilarity/Similarity metricSimilarity is expressed in terms of a distance function, typically metric: d(i, j)The definitions of distance functions are usually rather different for interval-scaled, boolean, categorical, ordinal ratio, and vector variablesWeights should be associated with different variables based on applications and data semanticsQuality of clustering:There is usually a separate “quality” function that measures the “goodness” of a cluster.It is hard to define “similar enough” or “good enough”The answer is typically highly subjective

11Requirements and ChallengesScalabilityClustering all the data instead of only on samplesAbility to deal with different types of attributesNumerical, binary, categorical, ordinal, linked, and mixture of theseConstraint-based clusteringUser may give inputs on constraintsUse domain knowledge to determine input parametersInterpretability and usabilityOthersDiscovery of clusters with arbitrary shapeAbility to deal with noisy dataIncremental clustering and insensitivity to input orderHigh dimensionality

13Major Clustering Approaches (II)Model-based:A model is hypothesized for each of the clusters and tries to find the best fit of that model to each otherTypical methods: EM, SOM, COBWEBFrequent pattern-based:Based on the analysis of frequent patternsTypical methods: p-ClusterUser-guided or constraint-based:Clustering by considering user-specified or application-specific constraintsTypical methods: COD (obstacles), constrained clusteringLink-based clustering:Objects are often linked together in various waysMassive links can be used to cluster objects: SimRank, LinkClus

15Partitioning Algorithms: Basic ConceptPartitioning method: Partitioning a database D of n objects into a set of k clusters, such that the sum of squared distances is minimized (where ci is the centroid or medoid of cluster Ci)Given k, find a partition of k clusters that optimizes the chosen partitioning criterionGlobal optimal: exhaustively enumerate all partitionsHeuristic methods: k-means and k-medoids algorithmsk-means (MacQueen’67, Lloyd’57/’82): Each cluster is represented by the center of the clusterk-medoids or PAM (Partition around medoids) (Kaufman & Rousseeuw’87): Each cluster is represented by one of the objects in the cluster

16The K-Means Clustering MethodGiven k, the k-means algorithm is implemented in four steps:Partition objects into k nonempty subsetsCompute seed points as the centroids of the clusters of the current partitioning (the centroid is the center, i.e., mean point, of the cluster)Assign each object to the cluster with the nearest seed pointGo back to Step 2, stop when the assignment does not change

17An Example of K-Means ClusteringArbitrarily partition objects into k groupsUpdate the cluster centroidsThe initial data setLoop if neededReassign objectsPartition objects into k nonempty subsetsRepeatCompute centroid (i.e., mean point) for each partitionAssign each object to the cluster of its nearest centroidUntil no changeUpdate the cluster centroids

18Comments on the K-Means MethodStrength: Efficient: O(tkn), where n is # objects, k is # clusters, and t is # iterations. Normally, k, t << n.Comparing: PAM: O(k(n-k)2 ), CLARA: O(ks2 + k(n-k))Comment: Often terminates at a local optimalWeaknessApplicable only to objects in a continuous n-dimensional spaceUsing the k-modes method for categorical dataIn comparison, k-medoids can be applied to a wide range of dataNeed to specify k, the number of clusters, in advance (there are ways to automatically determine the best k (see Hastie et al., 2009)Sensitive to noisy data and outliersNot suitable to discover clusters with non-convex shapes

19Variations of the K-Means MethodMost of the variants of the k-means which differ inSelection of the initial k meansDissimilarity calculationsStrategies to calculate cluster meansHandling categorical data: k-modesReplacing means of clusters with modesUsing new dissimilarity measures to deal with categorical objectsUsing a frequency-based method to update modes of clustersA mixture of categorical and numerical data: k-prototype method

20What Is the Problem of the K-Means Method?The k-means algorithm is sensitive to outliers !Since an object with an extremely large value may substantially distort the distribution of the dataK-Medoids: Instead of taking the mean value of the object in a cluster as a reference point, medoids can be used, which is the most centrally located object in a cluster12345678910

22The K-Medoid Clustering MethodK-Medoids Clustering: Find representative objects (medoids) in clustersPAM (Partitioning Around Medoids, Kaufmann & Rousseeuw 1987)Starts from an initial set of medoids and iteratively replaces one of the medoids by one of the non-medoids if it improves the total distance of the resulting clusteringPAM works effectively for small data sets, but does not scale well for large data sets (due to the computational complexity)Efficiency improvement on PAMCLARA (Kaufmann & Rousseeuw, 1990): PAM on samplesCLARANS (Ng & Han, 1994): Randomized re-sampling

25AGNES (Agglomerative Nesting)Introduced in Kaufmann and Rousseeuw (1990)Implemented in statistical packages, e.g., SplusUse the single-link method and the dissimilarity matrixMerge nodes that have the least dissimilarityGo on in a non-descending fashionEventually all nodes belong to the same cluster

26Dendrogram: Shows How Clusters are MergedDecompose data objects into a several levels of nested partitioning (tree of clusters), called a dendrogramA clustering of the data objects is obtained by cutting the dendrogram at the desired level, then each connected component forms a cluster

27DIANA (Divisive Analysis)Introduced in Kaufmann and Rousseeuw (1990)Implemented in statistical analysis packages, e.g., SplusInverse order of AGNESEventually each node forms a cluster on its own

28Distance between ClustersXDistance between ClustersXSingle link: smallest distance between an element in one cluster and an element in the other, i.e., dist(Ki, Kj) = min(tip, tjq)Complete link: largest distance between an element in one cluster and an element in the other, i.e., dist(Ki, Kj) = max(tip, tjq)Average: avg distance between an element in one cluster and an element in the other, i.e., dist(Ki, Kj) = avg(tip, tjq)Centroid: distance between the centroids of two clusters, i.e., dist(Ki, Kj) = dist(Ci, Cj)Medoid: distance between the medoids of two clusters, i.e., dist(Ki, Kj) = dist(Mi, Mj)Medoid: a chosen, centrally located object in the cluster2828

29Centroid, Radius and Diameter of a Cluster (for numerical data sets)XCentroid: the “middle” of a clusterRadius: square root of average distance from any point of the cluster to its centroidDiameter: square root of average mean squared distance between all pairs of points in the cluster2929

30Extensions to Hierarchical ClusteringMajor weakness of agglomerative clustering methodsCan never undo what was done previouslyDo not scale well: time complexity of at least O(n2), where n is the number of total objectsIntegration of hierarchical & distance-based clusteringBIRCH (1996): uses CF-tree and incrementally adjusts the quality of sub-clustersCHAMELEON (1999): hierarchical clustering using dynamic modeling

31BIRCH (Balanced Iterative Reducing and Clustering Using Hierarchies)Zhang, Ramakrishnan & Livny, SIGMOD’96Incrementally construct a CF (Clustering Feature) tree, a hierarchical data structure for multiphase clusteringPhase 1: scan DB to build an initial in-memory CF tree (a multi-level compression of the data that tries to preserve the inherent clustering structure of the data)Phase 2: use an arbitrary clustering algorithm to cluster the leaf nodes of the CF-treeScales linearly: finds a good clustering with a single scan and improves the quality with a few additional scansWeakness: handles only numeric data, and sensitive to the order of the data record

33CF-Tree in BIRCH Clustering feature:Summary of the statistics for a given subcluster: the 0-th, 1st, and 2nd moments of the subcluster from the statistical point of viewRegisters crucial measurements for computing cluster and utilizes storage efficientlyA CF tree is a height-balanced tree that stores the clustering features for a hierarchical clusteringA nonleaf node in a tree has descendants or “children”The nonleaf nodes store sums of the CFs of their childrenA CF tree has two parametersBranching factor: max # of childrenThreshold: max diameter of sub-clusters stored at the leaf nodes

35The Birch Algorithm Cluster Diameter For each point in the inputFind closest leaf entryAdd point to leaf entry and update CFIf entry diameter > max_diameter, then split leaf, and possibly parentsAlgorithm is O(n)ConcernsSensitive to insertion order of data pointsSince we fix the size of leaf nodes, so clusters may not be so naturalClusters tend to be spherical given the radius and diameter measures

36CHAMELEON: Hierarchical Clustering Using Dynamic Modeling (1999)CHAMELEON: G. Karypis, E. H. Han, and V. Kumar, 1999Measures the similarity based on a dynamic modelTwo clusters are merged only if the interconnectivity and closeness (proximity) between two clusters are high relative to the internal interconnectivity of the clusters and closeness of items within the clustersGraph-based, and a two-phase algorithmUse a graph-partitioning algorithm: cluster objects into a large number of relatively small sub-clustersUse an agglomerative hierarchical clustering algorithm: find the genuine clusters by repeatedly combining these sub-clusters

37KNN Graphs & Interconnectivityk-nearest graphs from an original data in 2D:EC{Ci ,Cj } :The absolute inter-connectivity between Ci and Cj: the sum of the weight of the edges that connect vertices in Ci to vertices in CjInternal inter-connectivity of a cluster Ci : the size of its min-cut bisector ECCi (i.e., the weighted sum of edges that partition the graph into two roughly equal parts)Relative Inter-connectivity (RI):

38Relative Closeness & Merge of Sub-ClustersRelative closeness between a pair of clusters Ci and Cj : the absolute closeness between Ci and Cj normalized w.r.t. the internal closeness of the two clusters Ci and Cjand are the average weights of the edges that belong in the min-cut bisector of clusters Ci and Cj , respectively, and is the average weight of the edges that connect vertices in Ci to vertices in CjMerge Sub-Clusters:Merges only those pairs of clusters whose RI and RC are both above some user-specified thresholdsMerge those maximizing the function that combines RI and RC

39Overall Framework of CHAMELEONConstruct (K-NN)Sparse GraphPartition the GraphData SetK-NN GraphP and q are connected if q is among the top k closest neighbors of pMerge PartitionRelative interconnectivity: connectivity of c1 and c2 over internal connectivityRelative closeness: closeness of c1 and c2 over internal closenessFinal Clusters

41Probabilistic Hierarchical ClusteringAlgorithmic hierarchical clusteringNontrivial to choose a good distance measureHard to handle missing attribute valuesOptimization goal not clear: heuristic, local searchProbabilistic hierarchical clusteringUse probabilistic models to measure distances between clustersGenerative model: Regard the set of data objects to be clustered as a sample of the underlying data generation mechanism to be analyzedEasy to understand, same efficiency as algorithmic agglomerative clustering method, can handle partially observed dataIn practice, assume the generative models adopt common distribution functions, e.g., Gaussian distribution or Bernoulli distribution, governed by parameters41

42Generative ModelGiven a set of 1-D points X = {x1, …, xn} for clustering analysis & assuming they are generated by a Gaussian distribution:The probability that a point xi ∈ X is generated by the modelThe likelihood that X is generated by the model:The task of learning the generative model: find the parameters μ and σ2 such thatthe maximum likelihood42

44A Probabilistic Hierarchical Clustering AlgorithmFor a set of objects partitioned into m clusters C1, ,Cm, the quality can be measured by,where P() is the maximum likelihoodIf we merge two clusters Cj1 and Cj2 into a cluster Cj1∪Cj2, then, the change in quality of the overall clustering isDistance between clusters C1 and C2:If dist(Ci, Cj) < 0, merge Ci and Cj44

47Density-Based Clustering: Basic ConceptsTwo parameters:Eps: Maximum radius of the neighbourhoodMinPts: Minimum number of points in an Eps-neighbourhood of that pointNEps(q): {p belongs to D | dist(p,q) ≤ Eps}Directly density-reachable: A point p is directly density-reachable from a point q w.r.t. Eps, MinPts ifp belongs to NEps(q)core point condition:|NEps (q)| ≥ MinPtsMinPts = 5Eps = 1 cmpq

48Density-Reachable and Density-ConnectedA point p is density-reachable from a point q w.r.t. Eps, MinPts if there is a chain of points p1, …, pn, p1 = q, pn = p such that pi+1 is directly density-reachable from piDensity-connectedA point p is density-connected to a point q w.r.t. Eps, MinPts if there is a point o such that both, p and q are density-reachable from o w.r.t. Eps and MinPtspp1qpqo

49DBSCAN: Density-Based Spatial Clustering of Applications with NoiseRelies on a density-based notion of cluster: A cluster is defined as a maximal set of density-connected pointsDiscovers clusters of arbitrary shape in spatial databases with noiseCoreBorderOutlierEps = 1cmMinPts = 5

50DBSCAN: The Algorithm Arbitrary select a point pRetrieve all points density-reachable from p w.r.t. Eps and MinPtsIf p is a core point, a cluster is formedIf p is a border point, no points are density-reachable from p and DBSCAN visits the next point of the databaseContinue the process until all of the points have been processedIf a spatial index is used, the computational complexity of DBSCAN is O(nlogn), where n is the number of database objects. Otherwise, the complexity is O(n2)

52OPTICS: A Cluster-Ordering Method (1999)OPTICS: Ordering Points To Identify the Clustering StructureAnkerst, Breunig, Kriegel, and Sander (SIGMOD’99)Produces a special order of the database wrt its density-based clustering structureThis cluster-ordering contains info equiv to the density-based clusterings corresponding to a broad range of parameter settingsGood for both automatic and interactive cluster analysis, including finding intrinsic clustering structureCan be represented graphically or using visualization techniques

53OPTICS: Some Extension from DBSCANIndex-based: k = # of dimensions, N: # of pointsComplexity: O(N*logN)Core Distance of an object p: the smallest value ε such that the ε-neighborhood of p has at least MinPts objectsLet Nε(p): ε-neighborhood of p, ε is a distance valueCore-distanceε, MinPts(p) = Undefined if card(Nε(p)) < MinPtsMinPts-distance(p), otherwiseReachability Distance of object p from core object q is the min radius value that makes p density-reachable from qReachability-distanceε, MinPts(p, q) =Undefined if q is not a core objectmax(core-distance(q), distance (q, p)), otherwise

57DENCLUE: Using Statistical Density FunctionsDENsity-based CLUstEring by Hinneburg & Keim (KDD’98)Using statistical density functions:Major featuresSolid mathematical foundationGood for data sets with large amounts of noiseAllows a compact mathematical description of arbitrarily shaped clusters in high-dimensional data setsSignificant faster than existing algorithm (e.g., DBSCAN)But needs a large number of parameterstotal influence on xinfluence of y on xgradient of x in the direction of xi

58Denclue: Technical EssenceUses grid cells but only keeps information about grid cells that do actually contain data points and manages these cells in a tree-based access structureInfluence function: describes the impact of a data point within its neighborhoodOverall density of the data space can be calculated as the sum of the influence function of all data pointsClusters can be determined mathematically by identifying density attractorsDensity attractors are local maximal of the overall density functionCenter defined clusters: assign to each density attractor the points density attracted to itArbitrary shaped cluster: merge density attractors that are connected through paths of high density (> threshold)

63STING: A Statistical Information Grid ApproachWang, Yang and Muntz (VLDB’97)The spatial area is divided into rectangular cellsThere are several levels of cells corresponding to different levels of resolution

64The STING Clustering MethodEach cell at a high level is partitioned into a number of smaller cells in the next lower levelStatistical info of each cell is calculated and stored beforehand and is used to answer queriesParameters of higher level cells can be easily calculated from parameters of lower level cellcount, mean, s, min, maxtype of distribution—normal, uniform, etc.Use a top-down approach to answer spatial data queriesStart from a pre-selected layer—typically with a small number of cellsFor each cell in the current level compute the confidence interval

65STING Algorithm and Its AnalysisRemove the irrelevant cells from further considerationWhen finish examining the current layer, proceed to the next lower levelRepeat this process until the bottom layer is reachedAdvantages:Query-independent, easy to parallelize, incremental updateO(K), where K is the number of grid cells at the lowest levelDisadvantages:All the cluster boundaries are either horizontal or vertical, and no diagonal boundary is detected

66CLIQUE (Clustering In QUEst)Agrawal, Gehrke, Gunopulos, Raghavan (SIGMOD’98)Automatically identifying subspaces of a high dimensional data space that allow better clustering than original spaceCLIQUE can be considered as both density-based and grid-basedIt partitions each dimension into the same number of equal length intervalIt partitions an m-dimensional data space into non-overlapping rectangular unitsA unit is dense if the fraction of total data points contained in the unit exceeds the input model parameterA cluster is a maximal set of connected dense units within a subspace6666

67CLIQUE: The Major StepsPartition the data space and find the number of points that lie inside each cell of the partition.Identify the subspaces that contain clusters using the Apriori principleIdentify clustersDetermine dense units in all subspaces of interestsDetermine connected dense units in all subspaces of interests.Generate minimal description for the clustersDetermine maximal regions that cover a cluster of connected dense units for each clusterDetermination of minimal cover for each cluster6767

69Strength and Weakness of CLIQUEautomatically finds subspaces of the highest dimensionality such that high density clusters exist in those subspacesinsensitive to the order of records in input and does not presume some canonical data distributionscales linearly with the size of input and has good scalability as the number of dimensions in the data increasesWeaknessThe accuracy of the clustering result may be degraded at the expense of simplicity of the method6969

71Determine the Number of ClustersEmpirical method# of clusters: k ≈√n/2 for a dataset of n points, e.g., n = 200, k = 10Elbow methodUse the turning point in the curve of sum of within cluster variance w.r.t the # of clustersCross validation methodDivide a given data set into m partsUse m – 1 parts to obtain a clustering modelUse the remaining part to test the quality of the clusteringE.g., For each point in the test set, find the closest centroid, and use the sum of squared distance between all points in the test set and the closest centroids to measure how well the model fits the test setFor any k > 0, repeat it m times, compare the overall quality measure w.r.t. different k’s, and find # of clusters that fits the data the best71

72Measuring Clustering Quality3 kinds of measures: External, internal and relativeExternal: supervised, employ criteria not inherent to the datasetCompare a clustering against prior or expert-specified knowledge (i.e., the ground truth) using certain clustering quality measureInternal: unsupervised, criteria derived from data itselfEvaluate the goodness of a clustering by considering how well the clusters are separated, and how compact the clusters are, e.g., Silhouette coefficientRelative: directly compare different clusterings, usually those obtained via different parameter settings for the same algorithm72

73Measuring Clustering Quality: External MethodsClustering quality measure: Q(C, T), for a clustering C given the ground truth TQ is good if it satisfies the following 4 essential criteriaCluster homogeneity: the purer, the betterCluster completeness: should assign objects belong to the same category in the ground truth to the same clusterRag bag: putting a heterogeneous object into a pure cluster should be penalized more than putting it into a rag bag (i.e., “miscellaneous” or “other” category)Small cluster preservation: splitting a small category into pieces is more harmful than splitting a large category into pieces73

75Entropy-Based Measure (I): Conditional EntropyEntropy of clustering C:Entropy of partitioning T:Entropy of T w.r.t. cluster Ci:Conditional entropy of Tw.r.t. clustering C:The more a cluster’s members are split into different partitions, the higher the conditional entropyFor a perfect clustering, the conditional entropy value is 0, where the worst possible conditional entropy value is log k75

76Entropy-Based Measure (II): Normalized mutual information (NMI)Mutual information: quantify the amount of shared info between the clustering C and partitioning T:It measures the dependency between the observed joint probability pij of C and T, and the expected joint probability pCi * pTj under the independence assumptionWhen C and T are independent, pij = pCi * pTj, I(C, T) = 0. However, there is no upper bound on the mutual informationNormalized mutual information (NMI)Value range of NMI: [0,1]. Value close to 1 indicates a good clustering76

78SummaryCluster analysis groups objects based on their similarity and has wide applicationsMeasure of similarity can be computed for various types of dataClustering algorithms can be categorized into partitioning methods, hierarchical methods, density-based methods, grid-based methods, and model-based methodsK-means and K-medoids algorithms are popular partitioning-based clustering algorithmsBirch and Chameleon are interesting hierarchical clustering algorithms, and there are also probabilistic hierarchical clustering algorithmsDBSCAN, OPTICS, and DENCLU are interesting density-based algorithmsSTING and CLIQUE are grid-based methods, where CLIQUE is also a subspace clustering algorithmQuality of clustering results can be evaluated in various ways

88PAM (Partitioning Around Medoids) (1987)PAM (Kaufman and Rousseeuw, 1987), built in SplusUse real object to represent the clusterSelect k representative objects arbitrarilyFor each pair of non-selected object h and selected object i, calculate the total swapping cost TCihFor each pair of i and h,If TCih < 0, i is replaced by hThen assign each non-selected object to the most similar representative objectrepeat steps 2-3 until there is no change

89PAM Clustering: Finding the Best Cluster CenterCase 1: p currently belongs to oj. If oj is replaced by orandom as a representative object and p is the closest to one of the other representative object oi, then p is reassigned to oi

90What Is the Problem with PAM?Pam is more robust than k-means in the presence of noise and outliers because a medoid is less influenced by outliers or other extreme values than a meanPam works efficiently for small data sets but does not scale well for large data sets.O(k(n-k)2 ) for each iterationwhere n is # of data,k is # of clustersSampling-based methodCLARA(Clustering LARge Applications)

91CLARA (Clustering Large Applications) (1990)CLARA (Kaufmann and Rousseeuw in 1990)Built in statistical analysis packages, such as SPlusIt draws multiple samples of the data set, applies PAM on each sample, and gives the best clustering as the outputStrength: deals with larger data sets than PAMWeakness:Efficiency depends on the sample sizeA good clustering based on samples will not necessarily represent a good clustering of the whole data set if the sample is biased

92CLARANS (“Randomized” CLARA) (1994)CLARANS (A Clustering Algorithm based on Randomized Search) (Ng and Han’94)Draws sample of neighbors dynamicallyThe clustering process can be presented as searching a graph where every node is a potential solution, that is, a set of k medoidsIf the local optimum is found, it starts with new randomly selected node in search for a new local optimumAdvantages: More efficient and scalable than both PAM and CLARAFurther improvement: Focusing techniques and spatial access structures (Ester et al.’95)

96Rock Algorithm Method Compute similarity matrixUse link similarityRun agglomerative hierarchical clusteringWhen the data set is bigGet sample of transactionsCluster sampleProblems:Guarantee cluster interconnectivityany two transactions in a cluster are very well connectedIgnores information about closeness of two clusterstwo separate clusters may still be quite connected

97Measuring Clustering Quality: External MethodsClustering quality measure: Q(C, Cg), for a clustering C given the ground truth Cg.Q is good if it satisfies the following 4 essential criteriaCluster homogeneity: the purer, the betterCluster completeness: should assign objects belong to the same category in the ground truth to the same clusterRag bag: putting a heterogeneous object into a pure cluster should be penalized more than putting it into a rag bag (i.e., “miscellaneous” or “other” category)Small cluster preservation: splitting a small category into pieces is more harmful than splitting a large category into pieces97

98Assessing Clustering TendencyAssess if non-random structure exists in the data by measuring the probability that the data is generated by a uniform data distributionTest spatial randomness by statistic test: Hopkins StaticGiven a dataset D regarded as a sample of a random variable o, determine how far away o is from being uniformly distributed in the data spaceSample n points, p1, …, pn, uniformly from D. For each pi, find its nearest neighbor in D: xi = min{dist (pi, v)} where v in DSample n points, q1, …, qn, uniformly from D. For each qi, find its nearest neighbor in D – {qi}: yi = min{dist (qi, v)} where v in D and v ≠ qiCalculate the Hopkins Statistic:If D is uniformly distributed, ∑ xi and ∑ yi will be close to each other and H is close to If D is clustered, H is close to 198