Data Mining Algorithms In R/Clustering/Density-Based Clustering

Clustering techniques have an important role in class identification of records on a database, therefore it’s been established as one of the main topics of research in data mining. Spatial clustering techniques are a subset of clustering techniques applied on databases whose records have attributes intrinsically related to some spatial semantics. Most of the traditional clustering techniques described in the previous chapters can be applied to spatial databases. However, when applied to tasks like class identification on a spatial context, the traditional techniques might be unable to achieve good results, e.g. elements in the same cluster might not share enough similarity or the performance may be prohibitively poor.

The class identification task assisted by spatial clustering algorithms has a wide range of applications as finding relevant information on increasingly large spatial databases have recently become a highly demanded task. Examples include geographic data from satellite images, medical x-ray image processing or pattern recognition in machine learning samples.

Although there have been proposed an extensive number of techniques for clustering space-related data, many of the traditional clustering algorithm specified by them suffer from a number of drawbacks. Firstly, techniques based on k-partitioning such as those based on k-means are restricted to clusters structured on a convex-shaped fashion. Many databases have clusters with a broad variety of shapes (Figure 1), hence the traditional k-partitioning algorithms will fail to produce satisfactory results. Secondly, most techniques require previous knowledge of the database (domain knowledge) to determine the best input parameters. For example, k-means takes as input the number of expected clusters, k, which must be previously known for whichever database it’s applied on. In many real-life databases there is not an a priori domain knowledge and therefore choosing parameters values based on guesses will probably lead to incomplete and undesirable results. Finally, most techniques are not scalable, which means they can not be used for large databases, such as those made up by hundreds of thousands of elements. Although most techniques derives algorithms belonging to a polynomial run time complexity class, the cost can become prohibitively large when they are applied on huge databases, e.g. millions or billions of elements.

The restrictions mentioned above can be overcome by using a new approach, which is based on density for deciding which clusters each element will be in. The next session will introduce this new approach, DBSCAN, which stands for density-based algorithm for discovering clusters in large spatial databases with noise.

DBSCAN: Density Based Spatial Clustering of Applications with NoiseEdit

The idea behind constructing clusters based on the density properties of the database is derived from a human natural clustering approach. By looking at the two-dimensional database showed in figure 1, one can almost immediately identify three clusters along with several points of noise. The clusters and consequently the classes are easily and readily identifiable because they have an increased density with respect to the points they possess. On the other hand, the single points scattered around the database are outliers, which means they do not belong to any clusters as a result of being in an area with relatively low concentration.

Figure 1: A two-dimensional spatial database with three non-convex shaped clusters and several outliers.

Furthermore, as will be explained in the following sections, the DBSCAN algorithm requires at most two parameters: a density metric and the minimum size of a cluster. As a result, estimating the number of clusters a priori is not a need, as opposed to other techniques, namely k-means. Finally, as will be demonstrated later, the DBSCAN is efficient even when applied on large databases.

Prior to describing the DBSCAN algorithm, some concepts must be explained to its fully understanding. The elements of the database can be classified in two different types: the border points, the points located on the extremities of the cluster, and the core points, which are located on its inner region. A neighborhood of a point p is a set of all points that have a distance measure less than a predetermined value, called Eps. Therefore, the neighborhood size of the core points is generally bigger than that of the border points. A point p is directly density-reachable from another point q if it belongs to the neighborhood of q and q's neighborhood size is greater than a given constant MinPts. By deriving canonically the previous concept, one can define a generic density-reachability: a point p is density-reachable from q if there exists a chain of points p1,..., pn, where p1 = q, pn = p, and pi+1 is directly density-reachable from pi. The former concepts are then combined to the definition of a cluster D:

q,p : if q D and if q is density-reachable from p and p's neighborhood is greater than the MinPts threshold, then p belongs to D.

q,p : if q D and p D then there must be a point t such that q and p are directly reachable from t.

According to the above definition, there might be some points not belonging to any of the generated clusters, those points are outliers (noise). In other words, a cluster D is formed by a set of points that respects a certain degree of concentration, which is set by the MinPts and Eps constraints. By adjusting those values, one can find clusters of varying shapes and densities.

In other words, a cluster D is formed by a set of points that respects a certain degree of concentration, which is set by the MinPts and Eps constraints. By adjusting those values, one can and clusters of varying shapes and densities.

The DBSCAN algorithm (Algorithm 1) starts by randomly selecting an element P from the database. If P is not a core point, i.e. P has fewer than MinPts neighbors, it will be marked as noise. Otherwise it will be marked as being in the current cluster and the ExpandCluster (Algorithm 2) function will be called. Its purpose is to find all points that are density-reachable from P and are currently being marked as unclassified or noise. Despite being a recursive function, ExpandCluster is implemented without using recursion explicitly. The recursive behaviour is accomplished by using a set whose size varies as new density-reachable points are found. The algorithm ends when all points have been properly classified.

Finally, after identifying all clusters, one might wonder that a border point might belong to two clusters at the same time. For this matter, the currently implement algorithm will consider the ambiguous points as being part of the cluster which aggregated them firstly.

R is a programming language and software environment for statistical computing. Besides being a widely used tool for statistical analysis, R aggregates several data mining techniques as well. Therefore it has become a major tool for simple tasks aiming to discover knowledge on databases. R's source code is freely available under the GNU General Public License and has been ported for several platforms other than Unix varieties, like Windows and MacOS. Although R uses primarily a command line interface, several GUIs are available, which increases its user-friendliness. The DBSCAN technique is available on R's fpc package, by Christian Hennig, which implements clustering tasks for fixed point clusters. The DBSCAN implementation offers high-configurability, as it allows choosing several parameters and options values. Moreover, fpc's DBSCAN has a visualization interface, which make it possible to visualize the clustering process iteratively.

seeds: Configuration regarding to the inclusion of the isseed-vector in the dbscan-object, can be TRUE or FALSE.

showplot: Plotting configuration:

0: no plot

1: plot per iteration

2: plot per subiteration

countmode: NULL or vector of point numbers for reporting progress.

The first three parameters are, by far, the most important ones: data is the input data used for clustering; eps and MinPts variables retains the same meaning as before, that is, they are the minimum distance between elements and the necessary quantity of points to form a cluster respectivelly. The parameter showplot is related to the visualization of the results. Parameter method impacts directly the efficiency of the algorithm and can be used for calibrating memory-performance tradeoff on memory-constrained environments.

The example in this section will illustrate the fpc's DBSCAN usage on the database depicted in figure 1. It will also show the clustering mechanism as a iterative visualization process. Firstly, the data to be clustered must be created:

Each cluster is represented by a single column and the rows shows how many of seed (core), border and total points it has. Notice that the tiny clusters scattered throughout the database are, in fact, formed by more than a single point. Thus, they are considered valid clusters because the MinPts parameter was not set and its default value is too low.

Points that belong to a cluster are delimited by triangles of different colors. All points belonging to the same cluster are delimited by triangles of the same color. Points that do not belong to any cluster retain the same color as before and, as will be showed, are usually represented by black circles. Moreover, setting showplot parameter to 2 implies that on each subiteration of the DBSCAN algorithm, the routine will show the partial clusters on the screen. Following are five of these moments:

First image shows the points before the DBSCAN routine is called (command plot(x)). The second one shows the beginning of the clustering algorithm, the first major cluster is being formed in the upper center of the database. The fifth image shows the 19 clusters which were obtained when the algorithm had ended.

Identifying celestial objects by capturing the radiation they emit is recurring task on astronomy. An astronomical entity might be itself the source of electromagnetic radiation (i.e. a star) or might reflect it from other sources (i.e. a planet). Typically, an entity will emit radiation on different wavelengths, which, together, will help identifying its class: it might be a planet, an star of any kind or age, or even a galaxy or other exotic entity previously unidentified. The intensity collected from each range of the eletromagnetic spectrum is stored on an individual two-dimensional grid (e.g. one for ultraviolet and other for gamma rays). Modern research groups on astronomy are capable of collecting and storing thousands of large-dimension grids, each of them representing a different view (or image) from the sky and using up to several terabytes of storage space.

Besides capturing the electromagnetic intensity emitted from an entity at a given range of the spectrum, the sensors might capture noise caused by the sensors as well as diffuse emission from the atmosphere and the space itself. Traditionally, a method for eliminating diffuse emission and some of the noise is to constrain the relevant intensity by a known threshold. For example, the intensity will only be considered relevant when surpassing 5 rms. A greater problem arises due to the fact that electromagnetc waves at different spectrum ranges will yield different interference when traversing a medium. This phenomenum is called Rayleigh Scattering and will ultimately cause the wave to defleect upon being captured by the sensor. Therefore, when considering several images taken from the same region of space, but from different spectrum ranges, the following steps that must done for correctly identifying the celestial entities.

Figure 2: The original image. At a wavelength of 4,865 MHz, it shows the center of a galaxy.

Figure 3: Source emissions after pre-processing.

Figure 4: Clustering results for MinPts = 5 and eps = 5.

As an example, consider figure 2, which shows an image taken from the center of a galaxy. This image consists of a pathologic example of noise on an astronomical picture. Following is the methodology for extracting the celestial objects on this image.

Firstly, a pre-processing step must be applied to the removal of noise and diffuse emission. As stated before, this might be accomplished by using a threshold. This step it is shown at figure 3, whereas the original image can be viewed at figure 2 and represents the original image depicting the center of a galaxy. The threshold was set to 50 which means that only pixels whose intensity are less than 50 (and consequently darker) are being considered.

Secondly, the DBSCAN algorithm can be applied on individual pixels to link together a complete emission area at the images for each channel of the electromagnetic spectrum. This is done by setting the eps parameter to some value that will define the minimum area required for a source to be considered. The eps parameter will define the distance metric in terms of pixels. Each of the generated cluster will define a celestial entity. Figure 4 shows the result of this step with eps and MinPts parameter set to 5.

After identifying all clusters, one can apply a multi-espectral correlation process in order to consider the results (generated clusters) from every electromagnetic wavelength. It will not be detailed here, but a common approach would be only considering clusters which has one or more counterparts close enough with respect to some threshold on the other channels of the electrogmagnetic spectrum.

Figure 4 shows the results after the clustering has been performed by the DBSCAN algorithm with MinPts and eps set to 5. By looking at the results, we can see that many isolated points have not been clustered because MinPts parameter restrict the size of a cluster by a minimum value of elements. This restriction will remove clusters that are too small to be classified as a relevant emission source and consequently are classified as noise. By restricting the MinPts parameter, celestial objects whose intensity is weak (e.g. they might be too far or do not emit strong radiation.) are eliminated from the results. This is generally desirable but can become a problem when emission sources defined by smaller areas must be analysed. Therefore, MinPts value must be carefully chosen since it may drastically modify the results.

Nevertheless, 64 clusters and 224 outliers were found. Most clusters do not have a large number of points, with some exceptions, like the large cluster located at the center of the database, which represents galaxy core which is an area with strong emission. The eps parameter takes an important role as it will define the minimum radius of pixels that represents the same emission source. Being set to 5 implies that points whose euclidian distance is greater than 5 do not belong to the same emission source. Although five is relatively large for the eps parameter, some clusters like those on (300,0) were not clustered together. This is because despite being close to each other they do not represent the same celestial object according to the eps value. Therefore, setting eps plays a role similar to that of MinPts, since it will define noise objects with respect to the distance measure, the same was done by the count measure of the MinPts.

The need to automatize the process of class identification of celestial entities is becoming increasinly important as astronomy and astrophysics become prominent areas of research. There is a growing demand for this task as technology evolves ands yields more and larger data samples to be analysed. Otherwise impossible without computational aid, this task have become easier with the assistance of the DBSCAN technique, which allowed larger samples to be analysed along with more precise results.