Transcription

1 CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS

2 CHAPTER 4 CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS 4.1 Introduction Optical character recognition is one of the most popular areas of research in pattern recognition because of its immense application potential. There are two fundamental approaches for character recognition. One is template matching and the other is feature classification. In the template matching approach, recognition is based on the correlation of test character with a set of stored templates. The template matching techniques are more sensitive to font and size variations of the characters, and the time complexity of the template matching techniques varies linearly with the number of templates. Because of this disadvantage classification methods based on learning from examples have been widely applied to character recognition. So artificial neural networks with supervised learning is one of the most successful classifier methods for character recognition. Character recognition, an application of pattern recognition basically involves identification of similar data within a collection, which resembles new input. Since artificial neural networks have the ability to learn from examples, generalize well from training and capable of creating relationship amongst the information, so the neural networks are suitable for recognition of handwritten characters. In the present work radial basis function networks and probabilistic neural networks are selected for the following reasons. Radial basis function neural networks have fast training and learning rate because of the locally tuned neurons and these networks exhibit universal approximation property and have good generalization ability (Park and Wsandberg, 1991). A probabilistic neural network integrates the characteristics of statistical pattern recognition, back propagation neural networks and it has the ability to identify the boundaries between the categories of patterns (Jeatrakul and Wang, 2009). This chapter explores the application of radial basis function networks and probabilistic neural networks for Telugu character recognition. 67

3 4.2 Classification with Neural Networks Classification is one of the most frequently encountered decision making tasks of human activity. A classification problem occurs when an object needs to be assigned to a predefined group or a class based on a number of observed attributes related to that object. So the objective of classification is to analyze the input data and to develop accurate description of the model for each class using the features present in the data. The model is used to predict the class label of unknown records and such modeling is referred as predictive modeling. The identification of handwritten characters comes under the classification because the decision or prediction is made based on samples collected from different persons to cover various handwriting styles. Artificial neural networks, usually called neural networks, emerged as an important tool for classification. Neural networks are simplified models of the biological nervous system which consists of highly interconnected network of a large number of processing elements called neurons in an architecture inspired by the brain (Rajasekaran and Pai, 2009). Neural networks learn by examples. They can be trained with known examples of the problem. Once appropriately trained the network can be put in effective use in solving unknown or untrained instances of the problem. The research activities in neural networks have established that neural networks are promising alternatives to various conventional classifications methods (Zhang, 2000). The advantage of neural networks lies in the following theoretical aspects. First, neural networks are data driven self adaptive methods in that they can adjust themselves to the data without any explicit specification of functional or distribution form of the underlying models. Second, they are universal function approximators in that neural networks can approximate any function with arbitrary accuracy (Hornik et al., 1991). Since any classification procedure seeks a functional relationship between the group membership and attributes of the object, accurate identification of the underlying function is doubtlessly important. Third, neural networks are non-linear models, which make them flexible in modeling real-world complex relationships. Finally, neural networks are able to estimate the posterior probabilities, which provide the basis for establishing classification rule and performing statistical analysis. 68

4 Because of the advantages mentioned above, the system was designed using two types of artificial neural networks, one is radial basis function networks and, the other probabilistic neural networks. 4.3 Classifier Accuracy Measures Using the training data to model a classifier or predictor and then to estimate the accuracy of the resulting learning model with the same training set can result in misleading optimistic estimates due to over specialization of the learning algorithm to the data. Accuracy is better measured on a test set consisting tuples that were not used to train the model. The accuracy of classifier on a given set is the percentage of test set tuples that are correctly classified by the classifier. In the pattern recognition literature, this is also referred to as the overall recognition rate of the classifier i.e., it reflects how well the classifier recognizes tuples of various classes. A confusion matrix is a useful tool for analyzing how well a classifier can recognize tuples of different classes (Han & Camber, 2009) which tabulates the records correctly and incorrectly predicted by the model. Each entry C ij in the confusion matrix denotes the number of records from class i predicted to be of class j. For a classifier to have a good accuracy, most of the tuples would be represented along the diagonal of the confusion matrix, with rest of the entries being close to zero. The confusion matrix may have additional rows or columns to provide totals or recognition rate per class. Although confusion matrix provides information needed to determine how well a classification model performs, summarizing the information with a single number would make it convenient to compare the performance of different models. It is also necessary to know how well a classifier identifies tuples of a particular class and how well it correctly labels the tuples that do not belong to the class. The above mentioned two aspects can be met by using performance metrics such as sensitivity or recall, specificity, positive predictive value (PPV) or precision, F-measure and accuracy (Tan et al., 2007). 69

5 Sensitivity(Recall): It measures the actual members of the class which are correctly identified as such. It is also referred as true positive rate (TPR). It is defined as the fraction of positive examples predicted correctly by the classification model. TP Sensivity = TP + FN Classifiers with large sensitivity have very few positive examples misclassified as the negative class. Specificity: It is also referred to as true negative rate. It is defined as the fraction of negative examples which are predicted correctly by the model. TN Specificity = TN + FP Precision (Positive Predictive Value): Precision determines the fraction of records that actually turns out to be positive in the group the classifier has declared as positive class. Precision = TP TP + FP The higher the precision is, the lower the number of false positive errors committed by the classifier. Negative Predictive Value (NPV): It is the proportion of samples which do not belong to the class under consideration and which are correctly identified as non members of the class. NPV= TN ( TN + FN) F-measure: Precision and recall are two widely used metrics for evaluating the correctness of the pattern recognition algorithm. To build a model that maximizes both precision and recall is the key challenge of classification algorithm. Precision and recall can be summarized into another metric known as F-measure which is the harmonic mean of precision and recall and is given by, 70

6 F- Measure = 2* precision * recall ( precision + recall) Accuracy: Accuracy is used as a statistical measure of how well a binary classification test identifies or excludes a condition. It is a measure of proportion of true results Accuracy = ( TP + TN ) ( TP + FP + TN + FN) Where TP=True Positives, TN=True Negatives, FP=False Positives, FN=False Negatives. 4.4 Evaluating the Performance of a Classifier It is often useful to measure the performance of a classifier on the test set because such a measure provides an unbiased estimate of its generalization error. The accuracy computed from the test set can also be used to compare the relative performance of a classifier on the same domain. This section addresses some of the methods for estimating the performance of a classifier using the measures discussed in the previous section Hold Out Method In this method the data set is partitioned into two disjoint sets, called the training and the test sets respectively. A classification model is induced from the training set and its performance is evaluated on the test set. The proportion of data reserved for training and for testing is typically at the discretion of the user. The accuracy of the classifier can be estimated based on the accuracy of the induced model on the test set. The holdout method has certain draw backs. First, fewer labeled examples are available for training because some of the records are withheld for testing. Second, the method may be highly dependent on the composition of training and test sets. The smaller the training set size, the larger the variance of the model. On the other hand, if the training set is too large, then the estimated accuracy computed from the smaller test set is less reliable. 71

7 4.4.2 Random Sub Sampling The holdout method can be repeated several times to improve the estimation of a classifier performance. Let acc i be the model accuracy during the i th iteration. The overall accuracy is given by, acc acc = k k i= 1 This method still encounters some of the problems associated with the hold out method because it does not utilize as much data as possible for training. It also has no control over the number of times the record is used for testing and training. Consequently some records might be used for training more often than others. i Cross Validation An improvement to the random sampling is cross validation. In this approach each record is used the same number of times for training and exactly once for testing. If the data is partitioned in two equal sized subsets, one of the subsets for training and the other for testing. Then the roles of the two subsets are swapped. This approach is called a twofold cross validation. The total error is obtained by summing the errors for both the runs. In the two fold cross validation each record is used exactly once for training and once for testing. The K-fold cross validation method generalizes the approach by segmenting the data into K equal sized partitions. During each run, one of the partitions is chosen for testing while rest of them used for training. The procedure is repeated K times so that each partition is used for testing exactly once. Again the total error is found by summing up the error for K runs. A special case of k-fold cross validation sets k=n where N is the size of the data set. This approach is called leave-one-out approach where each test set contains only one record. This approach has the advantage of utilizing as much data as possible for training. The drawback of this approach is that it is computationally expensive to repeat the procedure N times. Furthermore, since each test set contains only one record, the variance of the estimated performance metrics tends to be high. 72

8 4.5 Architecture of Radial Basis Function Network Radial basis function networks have extensive research interest because they are universal approximators, fast learning speed due to locally tuned neurons (Moody and Darken, 1989) and they have compact topology than other neural networks. Radial basis function network is used for a wide range of applications primarily because it can approximate any regular function and its training speed is faster than multi layer perceptron (MLP). The architecture of RBF network is shown in Figure 4.1. INPUT LAYER HIDDEN LAYER OUTPUT LAYER Figure 4.1: Architecture of Radial Basis Function Network Radial basis function network consists of three layers, namely input layer, hidden layer and output layer. Each node in the input layer corresponds to a component of the feature vector F. The second layer is the only hidden layer in the neural network that applies non- linear transformation from input space into hidden space by employing non-linear activation function such as Gaussian kernel. The output layer consists of linear neurons connected to all the hidden neurons. The number of neurons in the output layer is equal to the number of classes. The number of neurons and the activation functions at the hidden layer and the output layer describe the behaviour of the network and these two issues are addressed in the next two sections. 73

9 4.5.1 Selection of Centers in the Hidden Layer The hidden layer of RBF neural network classifier can be viewed as a function that maps the input patterns from a non-linear separable space to linear separable space. In the new space, the responses of the hidden layer neurons form a new feature vector for pattern discrimination. Due to this the discriminative power of the network is determined by RBF centers. There are different methods to select the centers. Commonly used methods are, i. To choose a hidden neuron centered on each training pattern. However, this method is computationally very costly and takes up huge amount of memory. ii. iii. Other method is, to choose the random subset out of the training set, and the centers of the Gaussian radial basis functions are set to the centers of the subset. The drawback of this method is that it may lead to the use of an unnecessary large number of basis functions in order to achieve adequate performance. Another method is K-means clustering, used to find a set of centers which more accurately reflects the distribution of the data points. The number of centers is decided in advance and each center is supposed to be representative of a group of data points. The steps for K-means algorithm are as follows 1. Select the K points as initial centers. 2. Repeat 3. Form K clusters by assigning each point to the closest center. 4. Re-compute the centroid of each center 5. Until centroids do not change Activation Functions The commonly used activation function is the localized Gaussian basis function given by 74

10 x - µ 2 G( x - µ ) = exp i i 4.1 2σ 2 Where X is the training example, µ i is the center of the hidden i th neuron and σ is the spread factor or width which has a direct effect on the smoothness of the interpolating function. The width of the basis function is set to a value which is a multiple of the average distance between the centers. This value governs the amount of smoothing. The activation at the output neurons is defined by the summation Y ( x) = w* G( x - µ ) + i i b 4.2 Where w is the weight vector and computed by W T 1 T ( G G) G d = Where d is the target class matrix. 4.6 Design and Implementation of Radial Basis Function Network The universal approximation property of radial basis function made the network suitable for character recognition which is one of the important applications of pattern recognition, the architecture of which has been explained in the previous section. The number of neurons in the input layer is equal to the number of attributes in the feature vector of the character image. The data set of character images have been collected from 60 persons. The features were extracted from preprocessed images and the dimensionality reduction has been performed using factor analysis as explained in chapter 3. So, 18 variables obtained after factor analysis represent the elements of the feature vector. Hence the number of neurons at the input layer is equal to 18. The discriminative power of network depends on the selection of centers and the number of centers in hidden layer. The K-means clustering algorithm was used to form the centers in the hidden layer. Classification accuracy with different number of 75

11 centers was verified, and the accuracy was found to be maximum when the number of centers is equal to 100. The information is provided in Table 4.1. Table 4.1: Percentage of Characters Correctly Classified for Different Number of Centers Number of Centers % Characters Correctly Identified The activation function of the hidden neurons is calculated by using the Gaussian radial basis function as given in equation 4.1. The smoothing parameter or width of the basis function which is a multiple of average distance between the centers is set equal to 2.4, where the classifier accuracy is maximum.the average width of the neuron is 0.6 and the classifier accuracies for different widths which are multiples of the average width are shown in Table 4.2. Table 4.2: Percentage of Characters Correctly Classified for Different Values of σ with RBF Network σ % Characters Correctly Classified The number of neurons at the output layer is equal to the number of classes used for classification, which in this case is equal to 10. The activation of the output neurons is calculated by summation function given in equation 4.2. The confusion matrix with the hidden neurons as 100 and width of basis function as 2.4 is shown Figure 4.2.With 10- fold cross validation the accuracy of the classification is 78.8%. 76

12 Figure 4.2: Confusion Matrix with Radial Basis Function Network 4.7 Architecture of Probabilistic Neural Network Architecture of probabilistic neural network is shown in Figure 4.3. The probabilistic neural network is composed of many interconnected processing units or neurons organized in four successive layers. They are Input layer, two hidden layers (one is pattern layer and the other is summation layer) and an output layer. The input layer does not perform any computation and simply distributes the input to the neurons in the pattern layer. 77

13 Figure 4.3: Architecture of Probabilistic Neural Network On receiving a pattern x from the input layer, the neurons x ij of the pattern layer compute the output as given by, T ( x x ) ( x x ) 1 ij ij φ ij ( x) = exp d/2 d ( 2π) σ 2σ Where d denotes the dimension of the pattern vector x, σ is the smoothing parameter and x ij is the neuron vector. The summation layer neurons compute the maximum likelihood of pattern x being classified into C i by summarizing and averaging the output of all neurons that belong to the same class. P i ( x) = ( 2π ) 1 d/2 d σ 1 N i x Ni exp j 1 x ij T x 2σ = 2 x ij Where N i denotes the total number of samples in a class C i. If the apriori probabilities for each class are the same, and the losses associated with making an incorrect decision for each class are the same, the decision layer unit classifies the pattern x in accordance with the Baye s decision rule based on the output of all the summation layer neurons. 78

14 Where C Λ (x) Λ C ( x) = arg max{ Pi ( x)} i = 1,2,..., m 4.6 denotes the estimated class of pattern x and m is the total number of classes in the training samples. 4.8 Design and Implementation of Probabilistic Neural Network Probabilistic neural network integrates the characteristics of Stastical pattern recognition and back propagation neural network and capable of identifying the boundaries between the categories of patterns. Because of this property the probabilistic neural network is selected for character recognition whose architecture has been described in the previous section. The network architecture is determined by the number of samples in the training set and the number of attributes used to represent each sample (Specht, 1990). The input layer provides input values to all neurons in the pattern layer and has as many neurons as the number of attributes used to represent the character image. So the number of input neurons is equal to 18, similar to the input layer neurons in the radial basis function network as explained in section 4.6. The number of pattern neurons is determined by the number of samples in the training set. Each pattern neuron computes the distance measure between the input and the training sample represented by the neuron using equation 4.4. The summation layer has a neuron for each class and the neurons sum all the pattern neuron s output corresponding to members of that summation neuron s data class to obtain the estimated probability density function using equation 4.5. The single neuron in the output layer then determines the final data class of the input image by comparing all the probability density functions from the summation neurons and choosing the data class with the highest value of the probability density function. The value of the smoothing parameter σ, which is one of the factors that influence the classification accuracy, is fixed at 1.4 where the classification accuracy is maximum. The values of σ and percentage of characters classified for each σ are shown in Table

15 Table 4.3: Percentage of Characters Correctly Classified for Different Values of σ with PNN σ % Characters Correctly Classified The model developed with probabilistic neural network is tested with σ=1.4 and with 10- fold cross validation. For each fold 540 images are used for training and 60 images are used for testing. The percentage of characters correctly classified is 72.5 and the results of classification are shown as confusion matrix in Figure 4.4. Figure 4.4: Confusion Matrix with Probabilistic Neural Network 80

16 4.9 Results and Discussion To compare the performance of different classifiers, it is convenient if the information is summarized for each class by using the performance metrics such as sensitivity, specificity, accuracy, F-measure, as explained in section 4.3. The summary of the confusion matrix for radial basis function network and probabilistic neural network are shown in Table 4.4 and Table 4.5 respectively. Table 4.4: Summary of Performance Metrics for RBF Network Class Accuracy Sensitivity Specificity Precision NPV F- Measure Table 4.5: Summary of Performance Metrics for PNN Network Class Accuracy Sensitivity Specificity Precision NPV F -Measure

17 The observations from the results are as follows 1. Percentage of characters classified correctly with RBF network is 78.8% and with PNN the percentage of characters classified correctly is The Performance metric accuracy which is a function of specificity and sensitivity is a measure for comparing two classifiers. The accuracy of RBF network for all the classes except classes with labels 8 and 10 is above 95% where as with PNN the accuracy for four classes with labels 1, 3, 4, 5 are above 95%,and for the remaining is less than 95%. The comparison of accuracy measure is shown in figure 4.5 Figure 4.5: Accuracy Measure 3. Building a model that maximizes both precision and recall is a key challenge in classification algorithm (Tan et al. 2007). Precision and recall can be summarized into another metric known as F-measure as explained in performance metrics. A high value of F-measure ensures both precision and recall are reasonably high. From definition of F-measure it is evident that the maximum possible value is 1 and if the values are nearer to 1 then the performance of the classifier is considered to be good. The F-measure for both the classes is shown in the form of a graph in figure 4.6. With the first method 82

18 the value of F-measure is less than 0.7 for classes with the labels 8, 10 and with PNN the value is less than 0.7 for classes with labels 2, 6, 8 and 10. Figure 4.6: F-Measure 83

19 4.10 Conclusions In this work two classification models, one is radial basis function networks and the other is probabilistic neural networks has been implemented using MATLAB(R2009b). The work was carried out with 600 images collected from 60 people and the result is tested with 10-fold cross validation. With RBF network 474 characters are classified correctly, while with PNN 435 characters are classified correctly. The following observations are made from the results: 1 Only for class with label 3 the values of accuracy and F-measure are found to be good with PNN and for all the remaining classes RBF is showing good results. 2 Except for class with label 10 the value of F measure is nearer to one, the reason being the character considered for class with label 10 has similar structure with classes with labels 2, 6 and 7. The accuracy of all the classes is above 90% with both the methods. And the overall accuracy of the RBF network is found to be better from the results. 84

Evaluating Classifiers Charles Elkan elkan@cs.ucsd.edu January 18, 2011 In a real-world application of supervised learning, we have a training set of examples with labels, and a test set of examples with

Classification Vladimir Curic Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University Outline An overview on classification Basics of classification How to choose appropriate

Classification Vladimir Curic Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University Outline An overview on classification Basics of classification How to choose appropriate

Unsupervised Learning Unsupervised learning Until now, we have assumed our training samples are labeled by their category membership. Methods that use labeled samples are said to be supervised. However,

Evaluating Machine-Learning Methods Mark Craven and David Page Computer Sciences 760 Spring 2018 www.biostat.wisc.edu/~craven/cs760/ Some of the slides in these lectures have been adapted/borrowed from

Why MultiLayer Perceptron/Neural Network? Neural networks, with their remarkable ability to derive meaning from complicated or imprecise data, can be used to extract patterns and detect trends that are

Department of Engineering Science University of Oxford January 27, 2017 Many datasets consist of multiple heterogeneous subsets. Cluster analysis: Given an unlabelled data, want algorithms that automatically

Expectation-Maximization Nuno Vasconcelos ECE Department, UCSD Plan for today last time we started talking about mixture models we introduced the main ideas behind EM to motivate EM, we looked at classification-maximization

2. On classification and related tasks In this part of the course we take a concise bird s-eye view of different central tasks and concepts involved in machine learning and classification particularly.

Week 4 Based in part on slides from textbook, slides of Susan Holmes Part I Classification & Decision Trees October 19, 2012 1 / 1 2 / 1 Classification Classification Problem description We are given a

1 Introduction Determining the right number of neurons and layers in a multilayer perceptron. At first glance, artificial neural networks seem mysterious. The references I read often spoke about biological

Support Vector Machines About the Name... A Support Vector A training sample used to define classification boundaries in SVMs located near class boundaries Support Vector Machines Binary classifiers whose

Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 3, Issue. 11, November 2014,

Chapter 4 Fuzzy Segmentation 4. Introduction. The segmentation of objects whose color-composition is not common represents a difficult task, due to the illumination and the appropriate threshold selection

COMPUTATIONAL INTELLIGENCE Fundamentals Adrian Horzyk Preface Before we can proceed to discuss specific complex methods we have to introduce basic concepts, principles, and models of computational intelligence

A DEEP analysis of the META-DES framework for dynamic selection of ensemble of classifiers Rafael M. O. Cruz a,, Robert Sabourin a, George D. C. Cavalcanti b a LIVIA, École de Technologie Supérieure, University

Yuki Osada Andrew Cannon 1 Humans are an intelligent species One feature is the ability to learn The ability to learn comes down to the brain The brain learns from experience Research shows that the brain

A novel supervised learning algorithm and its use for Spam Detection in Social Bookmarking Systems Anestis Gkanogiannis and Theodore Kalamboukis Department of Informatics Athens University of Economics

Glossary of data mining terms: Accuracy Accuracy is an important factor in assessing the success of data mining. When applied to data, accuracy refers to the rate of correct values in the data. When applied

Radial Basis Function Networks As we have seen, one of the most common types of neural network is the multi-layer perceptron It does, however, have various disadvantages, including the slow speed in learning

CS246: Mining Massive Data Sets Winter 2018 Problem Set 2 Due 11:59pm February 8, 2018 Only one late period is allowed for this homework (11:59pm 2/13). General Instructions Submission instructions: These

Sum-Product Networks STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 15, 2015 Introduction Outline What is a Sum-Product Network? Inference Applications In more depth

CHAPTER 6 PERCEPTUAL ORGANIZATION BASED ON TEMPORAL DYNAMICS This chapter presents a computational model for perceptual organization. A figure-ground segregation network is proposed based on a novel boundary