Cluster analysis and classification of process data by use of principal curves

Abstract:

In this thesis a new method of clustering as well as a new method of classification is proposed. Cluster analysis is a statistical method used to search for natural groups in an unstructured multivariate data set. Clusters are obtained in such a way that the observations belonging to the same group are more alike than observations across groups. For instance, long data records are found in mineral processing plants, where the data can be reduced to clusters according to different ore types. Most of the existing clustering methods do not give reliable results when applied to engineering data, since these methods were mainly developed in the domains of psychology and biology. Classification analysis can be regarded as the natural continuation of cluster analysis. In order to classify objects, two types of observations are needed. The first are those observations whose group memberships are known a priori, which can be acquired through cluster analysis. The second kind of observations are those whose group memberships are unidentified. By means of classification these observations are allocated to one of the existing groups. Both of the proposed techniques are based on the use of a smooth one-dimensional curve, passing through the middle of the data set. To formalise such an idea, principal curves were developed by Hastie and Stuetzle (1989). A principal curve summarises the data in a non-linear fashion. For clustering, the principal curve of the entire unstructured data set is extracted. This one-dimensional representation of the data set is then used to search for different clusters. For classification, a principal curve is fitted to every known group in the data set. The observations to be assigned to one of the known groups are allocated to the group closest to the new point. Clustering with principal curves grouped engineering data better than most of the well-known clustering algorithms. Some shortcomings of this method were also established. Classification with principal curves gave similar, optimal results as compared to some existing classification methods. This classification method can be applied to data of any distribution, unlike statistical classification technique