Neural network classifiers for GIS data: improved search strategies

Gordon German
School of Computing, Curtin University of Technology, Kent Street, Perth, Western Australia
E-mail: gordon@cs.curtin.edu.au

Abstract

Artificial neural networks have several advantages when used as classifiers of complex geographic and remotely-sensed datasets. They normally require no assumptions on the data distribution and can be trained with relatively small sample sets. Further, they are robust classifiers that require little data preparation prior to use. The author has previously shown a robust methodology for selecting an architecture and reducing the time required for training (German et al., 1998), where each hidden-layer node is responsible for the creation and positioning of a separating hyperplane in attribute-space.

In an extension of the prior methodology, hyperplanes that already produce adequate separation in attribute space are 'frozen' in position prior to training. This allows the network to focus on areas of poor separability and spawn additional hyperplanes if required. Results obtained on a complex real-world dataset shows a significant improvement in classification performance, as well as a reduction in training time.

1. Introduction

The pixel-level classification of complex geographic datasets is an arduous task, in which the increase in the amount and variety of satellite-acquired data (Wilkinson et al., 1995), has made it more difficult for human classifiers to deduce the relationships within the data. Automated supervised classifiers, based on both statistical and machine-learning approaches, often are used to provide a discrete chloropleth map in which a given pixel is assigned a particular class label.

A classifier’s function can be formulated in terms of a mapping of its input variables to its (given) output conditions. We can write:

.............. (1)

where p is the number of attributes, q is the number of classes and n is the number of samples. The goal of classification is to select an output class from a different phenomenological domain (P , the classification scheme) to that of the input attributes (Â ) for each input vector xp. These transformation models can be categorised as unsupervised or supervised classifiers. In the case of supervised classification, the user chooses the scheme P and the classifier learns an approximation G '(n, p) to the required transfer function G (n, p). This is accomplished by examining a small set (the training set) of the data for which the correct classification has already been determined (by ground survey, or a previous classification); hence, the (common) scheme of ground cover type is often derived from an attribute domain that may comprise several bands of LANDSAT data, as well as ancillary data such as digital elevation models, rainfall, etc.

The classifier must then be of such complexity as to enable encoding of the approximated transformation function G¢ (n, p). For statistical classifiers, such as the gaussian maximum likelihood classifier (MLC) or minimum-distance-to-mean (MDM – see Richards, 1986), this often translates into defining some set of probability density functions (p.d.f.) that describe the class distributions. A discriminant function is then constructed that uses these p.d.f.s to provide the required transformation, i.e., these classifiers operate by use of a discriminantrule, which divides the p dimensional attribute-space Âp into disjoint regions, one region per class. The problem for these statistical methodologies lies in the production of a satisfactory p.d.f. for each class, which may require certain assumptions and limitations on the data. For instance, the popular MLC assumes a gaussian distribution for the class variability and requires as an absolute minimum p+1 input vectors per class in the training set to avoid the class covariance matrix (Si) becoming singular. Generally, it is recommended to have at least 10p to 100p input vectors per class to accurately model the class distribution with the gaussian p.d.f. (Mardia et al., 1979; Swain & Davis, 1978).

In contrast, the task-based multi-layered perceptron (MLP), a form of neural network based on machine-learning techniques, requires no assumptions on the class data distributions (although in general, it does assume a gaussian noise distribution through the data) and has no real lower limit on the size of the training classes. The MLP consists of three layers of nodes, as per Figure 1. Each node of the (single) hidden layer, along with its subset of the W weight matrix, is responsible for the generation of a separating hyperplane within the p dimensional attribute-space. The task of each node is to separate out one particular class (ci) from one other (cj). The output layer nodes and associated connections are then responsible for the amalgamation of one or more of these hyperplanes into q-1 class boundary decision surfaces and in a process analogous to a ‘superclass’ formation, can sum disjoint regions to provide a final classification strategy. Used correctly, the MLP can give a significant improvement over the classification performance of an MLC on complex, real-world data, as shown later (or see Gahegan & West, 1998 for comparison details).

2. Overview of the MLP classifier

The MLP classifier’s operation is covered in detail in German & Gahegan (1996). In brief, an input vector is placed on the input nodes and is propagated to the output layer via the weight connections and the hidden-layer. This is done for each vector in the training set (one iteration). Each node in the hidden and output layers transforms the sum of its inputs via an activation function, normally a sigmoid of the form:

where y is the node’s output, x is the summed input, q is a threshold, or bias value and f is a gain, normally treated as a constant and set to 1. During the training phase, the network seeks to reduce its output error E by varying the connecting weight matrices W and Uafter each iteration. The cost function E is the metric used by the network to judge the effectiveness with which it has learned the transfer function G (n, p) of Eqn. 1. The most commonly used cost function is the mean-squared error (MSE) function given by:

.............(3)

where n = number of input patterns (vectors) in the training set, q = number of output nodes, tj is the expected output at node j (of the output layer) and zj is the actual output of that node. Ideally, E should be a measure of how far from optimal the current state of the network is. In this guise, the MLP classifier is a "winner-takes-all" classifier, i.e., it provides a single class label for each input pixel.

The weight variation is performed by some form of minimisation routine (converging on a minimal E across the multi-dimensional weight-space) and there are many numerical routines available to implement this (e.g., Battiti & Masulli, 1990; Luenberger, 1984; Bryson & Ho, 1969). Traditionally, the standard neural network minimisation function is based on gradient descent, a rather simplistic numerical method using first-order information derived from the partial derivatives of the weight vectors that require the user to "guess" certain parameters prior to implementation. An MLP implementing gradient descent is commonly referred to as a "vanilla back-prop" neural network and has a reputation for being difficult to configure and initialise (e.g., Murata et al., 1991; Knerr et al., 1990; Jacobs, 1988) and may take 5,000 to 100,000 iterations to converge. Even so, it is unfortunately the method of choice for many GIS researchers when implementing neural network classifiers (for example, see Kamata & Kawaguchi, 1993; Foody, 1995; Bischof et al., 1992). The minimisation routine favoured by the author is a modified, multiple-strategy scaled conjugate gradient (SCG) routine, adapted from Moller (1993), which requires no user-provided parameters. It uses estimated second-order information and performs better in complex spaces than gradient descent without the overhead of true second-order routines such as Newton-Raphson and can choose between several search strategies, depending on the local error surface (see German, 1999). Acceptable convergence is often in the order of a few hundred or so iterations.

3. Problems with the MSE cost function

Ideally, during training, one would expect an inverse relationship between the cost function and the classification performance; for every decrease in E, there would be a corresponding increase in the percentage-correct (PCC) classification rate (see Figure 2); however, for the MSE cost function of Eqn. 3, the relationship on the Kioloa dataset is as Figure 3.

Obviously, a reduction in E does not always lead to an increase in classification. Alternative cost functions, such as the classification figure-of-merit (CFM function, see Hampshire & Kumar, 1993) and the buffered MSE (BMSE, see German, 1999), have been devised, which attempt to alleviate this problem. Unfortunately, CFM cost functions, although highly efficient, do not model output noise as a gaussian distribution

and so cannot be used directly with the SCG-based minimisation routines. The author has used CFM functions with vanilla back-prop, where they do exhibit greater performance, but with all the drawbacks that go along with that form of neural architecture. (See German (1999) for details.) On the other hand, BMSE can be used readily with SCG. The underlying idea of BMSE is that, so long as the correct output node has the highest output value (hence the input vector is assigned to the correct class), the network should not be penalised for small ‘errors’ in the output of losing nodes, provided they are below some error threshold (k ). By comparison, the MSE algorithm penalises all nodes regardless. Figure 4 shows the performance of the BMSE algorithm on the Kioloa dataset, which is much closer to the ideal relationship of Figure 2.

Figure 4: Relationship of BMSE cost function and performance on the Kioloa dataset.

4. Performance of the MLP classifier

Table 1 gives some comparative performance figures for various classifiers on the same complex dataset. The dataset is from the Kioloa area of New South Wales, Australia, which has been made available as a NASA pathfinder data-set through the Australian National University in Canberra (Lees and Ritman, 1991). A Landsat composite image is shown in Figure 5. The study area is a complex natural forest habitat of approximately 15 km square and there are 9 floristic-level classes to be delineated from 11 attribute layers (four of these represent nominal or ordinal data, four are Landsat TM bands) i.e. q = 9

and p = 11. The output classes, in order, are: dry sclerophyl, E. Botryoides, lower slope, wet E. Maculata, dry E. Maculata, rainforest ecotone, rainforest, cleared land and water. The ground truth (1705 pixels from an image of 275 625 pixels) does not consist of homogenous regions of pixels, rather they are randomly scattered throughout the dataset, in all a difficult task for any classifier, as can be seen from Table 1. The PCC figure gives the overall percentage correctly identified by the classifier, and the ANR figure is an average of the performance over each of the 9 classes. The Discrete Output Neural NETwork (DONNET) MLP is a simulation written by the author to implement the results of the current research at Curtin University into neural network classifiers. The code is available on the Web at www.cs.curtin.edu.au/~gordon/donnet. The DONNET MLP’s architecture and weights are derived as per German & Gahegan (1996). The DONNET MLP, Vanilla back-prop and the C4.5 figures were all produced by the author, with the C4.5 decision tree optimised for maximum generalisation. The MLC classification is from Gahegan & West (1998), who do not provide validation figures.

Table 1: Comparison of various classifiers on the Kioloa dataset.

Training ..................................Validation

Net Type

PCC

ANR

PCC

ANR

DONNET MLP

81.13%

73.38%

72.61%

60.37%

C4.5 Decision Tree

78.31%

65.27%

66.96%

52.36%

Vanilla back-prop

68.88%

47.18%

50.77%

43.23%

MLC

50.50%

48.50%

-

-

As is described in German et al. (1998), the MLP’s performance can be further improved by the addition of particular hidden-layer nodes to aid in the formation of more complex decision-surfaces. With prudently placed additional nodes (determined from analysis of the zero-confusion task matrix and the boundary misclassification rate, or BMR matrix as per German et al., 1998), the performance can be improved to 67.32% ANR for the validation set (82.11% ANR for the training set) after only 300 iterations, considerably better than the vanilla back-prop, MLC or decision tree figures.

5. Restricting the search space

Prior to training, the task-based MLP (German, 1994) has its initial weights, and therefore separating hyperplanes, calculated from the set of linear discriminant functions that maximise the ratio of the within-class variance to the between-class variance for each pair of classes. The class confusion matrix of the classifier prior to training, is shown in Table 2. Note this already shows some classification ability due to the non-random way in which the initial weights are selected. For such a classifier, we can define a Boundary Misclassification Rate (BMR) matrix The initial BMR matrix for an MLP with the Kioloa dataset is given in Table 3. This table gives the performance at each class-pair boundary and is derived from the class confusion matrix of Table 2 (see German et al., 1997). Positive figures are errors of omission, negative figures are errors of commission, and figures in bold are errors comprising of both (dual-error figures). For example, the table shows that the hyperplane defining the class 1:3 boundary (-9.7%, hidden node 2) has incorrectly assigned 9.7% of class 1 to class 3. Similarly, the class 2:3 boundary (+2.2%, node 9) has incorrectly assigned 2.2% of class 3 to class 2 and the class 3:4 boundary (0%, node 16) has correctly assigned all of classes 4 and 5. In fact, there are 13 zero error boundaries. This implies that the weights incident to hidden-layer nodes 8, 15, 16, 21, 25, 26, 29, 30, 32, 33, 34, 35 and 36 (reading across the columns of Table 3) are optimally positioned to accomplish their specific primary tasks.

Some of these "zero-confusion" hyperplanes will be redundant in the sense that one (or more) of them may, in fact, do more than one task (see Figure 6). The redundant hyperplanes can be identified by the network (German et al., 1997) and can be used by the network to aid in the construction of more complex decision surfaces elsewhere in the attribute-space.

Table 2: 9 Class Confusion Matrix (0 iterations)

Class1

Class2

Class3

Class4

Class5

Class6

Class7

Class8

Class9

Totals

True 1

162

1

0

17

15

0

4

5

0

204

True 2

24

0

0

5

3

3

5

4

0

44

True 3

25

0

1

1

3

0

4

1

0

35

True 4

42

0

0

97

21

2

6

0

0

168

True 5

22

0

0

22

73

1

3

0

0

121

True 6

10

0

0

31

6

17

1

0

0

65

True 7

14

3

0

9

7

0

25

0

0

58

True 8

0

2

0

0

0

0

0

109

0

111

True 9

0

0

0

0

0

0

0

0

333

333

Table 3: 9 Class BMR Matrix (0 iterations) on the Kioloa dataset.

Class1

Class2

Class3

Class4

Class5

Class6

Class7

Class8

Class9

Avg

Class1

-

10.4%

-9.7%

12.4%

10.7%

7.3%

+5.7%

2.2%

0%

7.3%

Class2

-

-

+2.2%

2.5%

-1.1%

-1.6%

4.3%

-1.1%

0%

2.9%

Class3

-

-

-

0%

-2.3%

-0.9%

-2.9%

-1.2%

0%

2.4%

Class4

-

-

-

-

17.8%

-11.8%

9.8%

0%

0%

6.8%

Class5

-

-

-

-

-

+1.0%

3.0%

0%

0%

4.5%

Class6

-

-

-

-

-

-

4.7%

0%

0%

3.4%

Class7

-

-

-

-

-

-

-

0%

0%

3.8%

Class8

-

-

-

-

-

-

-

-

0%

0.6%

Class9

-

-

-

-

-

-

-

-

-

0.0%

As can be seen from Figure 6, as long as one of the redundant hyperplanes is kept in its optimal position, the others can be moved by the network for use in other areas; however, during training, this optimally positioned hyperplane is still often repositioned by the minimisation routine in its search for a global minimum for E. Even for the BMSE cost function, there are still areas of Figure 4 where the classification rate does not increase for a given decrease in the cost function.

Figure 6: Hyperplane ‘A’ is constructed to separate classes 1 and 2 and ‘C’ for classes 2 and 3; however, each does both task perfectly, so one of them is redundant.

Removing the associated weights from the minimisation routine's search space could eliminate this unnecessary "jittering" of the optimally placed hyperplanes. In essence, we "freeze" the values of the Wweights associated with one of these optimally placed hyperplanes. This should produce shorter training times and an incremental increase in classification performance.

Freezing of these weights will necessitate some method of setting the U weights that are associated with the hyperplane, as these are not static, but vary as the network attempts to produce a final classification scheme. The values for this subset of the U matrix (u_tmp) can be calculated via the methodology shown in German & Gahegan (1996). The procedure is as follows:

Collect the appropriate hidden layer node outputs (a matrix X of size n´1, where n = number of samples in the training set).

Determine the matrix L (of size n´q) as the values expected at the inputs of the output layer nodes for the given targets (if the output activation function is sigmoid, an input value of, say -2.0 will give an output approaching 0.1; +2.0 will give an output approaching 0.9). Each row of L represents an expected vector of node inputs for a particular training vector and each column represents one of the output nodes.

A u_tmp vector is then calculated as the inner product of X-1 and L. A singular-value decomposition (SVD) is used to determine X-1 as it is a non-square, non-symmetric matrix (a pseudo-inverse algorithm is too computationally intensive).

The u_tmp vector is inserted into the U matrix at the appropriate points.

Steps 1 to 4 are repeated for each hidden-layer node to be frozen.

The minimisation routine does not use the frozen weights of the W matrix, nor the associated output weights of the U matrix. Because of this restricted search space, we would expect a slight reduction in the computational time per iteration, as well as a reduction in the number of iterations required for equivalent convergence.

6. A worked example on the Kioloa dataset

The MLP is constructed with 11 input nodes, 36 hidden nodes, and 9 output nodes. The W weights are calculated from the Fisher pairwise linear discriminant functions and the U matrix is calculated from the known training class labels and the hidden layer outputs (see German & Gahegan, 1996, for complete details). In this initial state, prior to training, the class confusion matrix of Table 2, the previous BMR matrix of Table 3 and the zero-confusion task matrix of Table 4 can be derived. The zero-confusion task matrix shows the group of 7 compatibly redundant hyperplanes that do exactly the same tasks with 100% accuracy, or zero confusion and is derived by counting the number of times each hidden-layer node "fires" for a particular class. In this particular example there is only one group, unsurprisingly associated with class 9 (water), the confusion matrix showing that only class 9 is initially perfectly separated (generally, the zero-confusion task matrix may describe multiple groups with identical zero-confusion tasks). From these 7 hyperplanes, we require one to be left in position (which one is arbitrary), while the others can be used by the network in other areas that may require more complex decision surfaces (e.g., the class 4:5 boundary, as can be seen from Table 2). Let us choose hyperplane h8 to be "frozen" in position. This hyperplane will perform all of the tasks in the 3rd column of Table 4 with 100% accuracy. To accomplish this we must ensure that the network’s minimisation routine does not change any of the weights feeding into node 8 and modify the weights feeding out of it appropriately, at each iteration.

We run the network for 300 iterations and produce the tables and figures below. Table 5 gives the classification results, compared to the original MLP model, Table 6 shows the class confusion matrix, and Figure 7 shows the comparative % classification as a function of the number of training iterations.

Table 5: Comparison of various classifiers on the Kioloa dataset.

Training........................................ Validation

Net Type

PCC

ANR

PCC

ANR

Restricted DONNET

84.72%

76.53%

75.44%

64.34%

Unrestricted DONNET

81.13%

73.38%

72.61%

60.37%

Vanilla back-prop

68.88%

47.18%

50.77%

43.23%

Table 6: 9 Class Confusion Matrix (300 iterations)

Class1

Class2

Class3

Class4

Class5

Class6

Class7

Class8

Class9

Totals

True 1

170

6

1

13

13

0

1

0

0

204

True 2

5

28

0

3

2

0

3

3

0

44

True 3

13

1

15

0

2

0

3

1

0

35

True 4

17

3

2

125

13

6

2

0

0

168

True 5

10

3

0

14

91

3

0

0

0

121

True 6

3

3

0

8

1

50

0

0

0

65

True 7

5

1

0

6

4

0

42

0

0

58

True 8

0

0

0

0

0

0

0

111

0

111

True 9

0

0

0

0

0

0

0

0

333

333

Figure 7: Classification performance during training on the Kioloa training set. The upper plot is for the restricted network.

As can be seen from Table 5 and Figure 7, the restricted MLP produces a better final classification (up by 4%) of 64.34% ANR on the validation set, with a slightly higher learning rate. The final classified image is shown in Figure 8, along with an image classified by the MLC. The superiority of the MLP image is evident in the lower granularity of the image, especially in the central regions.

7. Conclusions

We have shown how to improve the performance of a neural network classifier on GIS data, by intelligent restriction of hyperplane movement. The freezing of the appropriate weights and recalculation of the hidden-layer/output-layer weights achieve this. The method requires additional analysis prior to starting the training phase; however, by reducing the amount of unnecessary hyperplane readjustments, an improvement in the learning rate is achieved. Although this is only slight, over the few hundred iterations that are typical of the training phase, it leads to a significant overall benefit.

Figure 8: Classified images of the Kioloa dataset. The upper image is from the MLC classifier, the lower from the DONNET MLP.

German, G., M. Gahegan and G. West, 1997. Predictive Assessment of Neural Network Classifiers For Applications in GIS, Proceedings of the 2nd International Conference on Geocomputation, University of Otago, NZ.