Σχόλια 0

Το κείμενο του εγγράφου

Department of Photogrammetry and Geoinformatics,Faculty of CivilEngineering, Budapest University of Technologyand Economics,

1111 Budapest,Műegyetem rkp. 3,palancz@epito.bme.hu

2

Department of Geodesy and Surveying,Faculty of Civil Engineering, BudapestUniversity of Technology and Economics,

volgyesi@eik.bme.hu

3

Department of Control Engineering and Information Technology, Faculty ofElectrical Engineering and Informatics, Budapest University of Technology andEconomics, lkovacs@iit.bme.hu

Abstract:The efficiency of the application of soft computing methods like Artificial NeuralNetworks (ANN) or Support Vector Machines (SVM) depends considerably on therepresentativeness of the learning sample set employed for training the model. In this studya simple method based on the Coefficient of Representativity (CR) is proposed forextracting representative learning set from measured geospatial data. The methodeliminating successively the sample points having low CR value from the dataset isimplemented in Mathematica and its application is illlustrated by the data preparation forthe correction model of the Hungarian gravimetrical geoid based on current GPSmeasurements.

Keywords:machine learing, representativness of data, geospatial data.

1

Introduction

During the last decade, machine learning algorithms, such as artificial neuralnetworks (ANN) and support vectors machines (SVM) have extensively used forwide range of applications. They have been applied for classification, regression,feature extraction, data prediction and spatial data analysis.

To ensure generalization properties of machine learning methods like artificialneural networks and support vector machines, the set ofmeasured data

should besplit into learning and testing sets,[1]. The question is how to divide the measuredsample set into these three sets in order to extract the most

information as it ispossible. This is especially important when the numberof

samples is relativelysmall. There are different methods suggested how to carry out the learning andtesting process taking into account this requirement,[2]. Optimal sampling

scheme would be regular triangular or square grids, which keep the maximumstandard error to a minimum,

[3]. However,

geospatial data samples are irregularlyspaced and do not form rectangular grid. Qualitatively these irregularities areindicated by local clustering and dispersion, butfor numerical computations one

needs

quantitative characterization of the deviation from the optimal, uniformspatial sample distribution. There are different indices introduced to indicate therepresentativeness of a realsample distribution,

[4]. In this study we employed theCoefficient of Representativity (CR) proposed by[4].

2

Measures of representativity

Let us suppose,that we have {xi,

yi,zi} measured sample points and their{xi,yi}coordinatesare on a convex region, see Figure 1.

2.1

Nearest Neighbours Index

One of the possible characterizations of the representativity of this sample set wassuggested by [5] viaNearest Neighbours Index

(NNI). TheNNI

is defined as theratio of the mean of theNearest Neigbours distances

(NNIdist):

(1)

Figure 1

Measured data sample points and the border of the convex region.

whereN

is the number of sampling points and to the mean of theNearestNeigbours distances

for uniform distribution of the points. ThisMean RandomDistance

(MRD) is defined as:

(2)

whereSToral

is the total surface of the investigated region.ThustheNNI

is equal to:

(3)

TheNNI

is close to 1 for the sampling points having a uniform spatial distribution.WhenNNI

< 1, the samples are more clustered than expected compared to auniform random distribution. In the contrary, anNNI

> 1 indicates a dispersion ofthe samples.

The main limitation of this index isthat thisisa global measure, and gives

noinformation about local clusters or dispersions.

2.2

Voronoi polygons

Voronoi polygons have the property to contain only one measurement and to havea geometry

that will include all the datapoints that are closer to the measurementthan those associated to clustered data, [6]. The area of the Voronoi polygonbelonging to a sample point may be considered

as

the region of attraction of thispoint, because the points of this region are closer to this sample points than toother sample points,

see Figure 2.

Figure2

Voronoi polygons of the data samples and the border points.

Figure3

Intensity plot of the Voronoi polygons

corresponding to their size.

In case of uniform distribution of the sample points, the size of the region ofattractionof every sample point

–

the ares of the corresponding Voronoi polygons–

is

the same.

Therefore the histogram of the areas of these polygons might help describequantitatively the homogenity ofthesample set.

The main handicape of this measure is that points can be clustered and still haverelatively large Voronoi polygons. In an other words, large Voronoi polygons donot

guarantee that the points are isolated.

For example, the Voronoi polygon belonging to point 6 is larger than thosebelonging to point 3 or point 5. However, the distance between points 3-

5 isgreater than the distance between points 5-

6 (Figure2):

2.3

Coefficient of Representativity

Dubois, [4], suggested a new measure that combines both the distance of eachpoint to its nearest neigbour and the surface of the Voronois polygons. Thismeasure, calledCoeffient of Representativity

(CR) is a product oftwo terms:

(4)

which will take into account the surface of the Voronoi polygon. It is equal to theratio of the surface of the Voronoi polygon (SV) to the ideal surface it should haveto obtainin case ofa homogeneous sample set.This surface is simply defined asthe mean surface (Sm) that is the total area of the investigated regionSTotal, dividedby the number of sampling pointsN:

Figure4

Intensity plot of the CR values. A polygon gray level intensity is proportional with its CR.

(5)

The second termB, is equal to the ratio of the squared distance between a pointtoits nearest neighbour (NNdist) to the mean surface of the Voronoi polygons:

(6)

For reqular grid where points are distributed in the middle of each cell of gridNNdist2

the area of the Voronoi polygons are differentfrom the measure based ofCR, compare Figure3

and Figure4.

3

Constructing optimal learning set

Once we have a measure of

the representativity of a dataset, an algorithm can bedeveloped to extract samples from the irregulardataset to form the best learningset as possible.

This optimal extraction process can be considered as acombinatoricmax-min

problem. Namely, from the measuredn

patterns, oneshould selectm

<n

samples in a way, that in the constructed learning set theminimum of

CR

will be the greatest considering every possible

combinations. Strictly saying, it is amax(min(CR)) combinatoric problem, and onemay solve it by genetic algorithm.

Figure5

Intensity plot of the CR values after eliminating two samples.

However, such an algorithm is very time consuming, therefore a suboptimalalgorithm may be employed as an alternative solution. In this case, we constructthe learning setbyeliminating sucessively samples from theoriginal set of then

samples. Namely, we simply drop out the sample, which has actually the minimalCR

and repeat this actionm

-

n

times.

The implementation of this algorithm underMathematica

5.2 is availablein [8].

Let us eliminate two samples of the

dataset, see Figure 1.

It can be clearly seen

on Figure 5 comparing it with Figure 4,

that the homogenityof sample set has been considerably impoved by elimination of the sample pointshaving lowCR

values.

As illustration of the application of the method for real world problem, a learningset will be constructed for a neural network to be trained to model the Hungariangravimetrical/GPS geoid.

4

Learning set for the

Hungarian geoid

4.1

Data preprocessing

Recently GPS measurements provide more precise data than gravimetricalmeasurements did before. However, their numbers are considerably less than thoseof the gravimetrical ones. Therefore it is reasonable to use them for correction.The values of the correction of the gravimetrical geoid-

the so calledcorrectorsurface-

are based on the differences between the GPS and the gravimetricalmeasurements,

[7]. In case of Hungary we have the following dataset for thecorrector surface,

see Figure 6.

Figure 6

Locations of the sample values of the corrections

and the convex border of theHungarianregion.

Clustering and dispersion of the datapoints can be clearly seen

on Figure 6.

4.2

Computing Voronoi tesselations

First, we compute the Voronoi polygons, see Figure 7.

Figure7

Voronoi tesselations.

4.3

Computing Coefficient of Representativity

The CR values for the sample points can be computed, see Figure 8.

Figure8

The distribution of CR in the Voronoi cells.

Smaller the value of CR darker the corresponding cellregion.

Figure 9 demonstrates the distribution of the CR, indicating the majority of thesmall values.

The statistics of the CR distribution of the original sample set

is showed in

Table1.

Figure9

The histogram of the CR distribution of the original data set.

Table 1. Statistics of CR distribution of the original data set (304 points).

Min

Max

Mean

Standarddeviation

0.00235

4.712

0.449

0.593

4.4

Sucessive elimination of sample points having low CR

In order to create the learning set, we eliminatem

=110 sample points

from theoriginaln

= 304 datapoints.

Figure10

Locations of the sample values of the corrections after elimination of 110 points.

Figure 11

Voronoi tesselations of the learning set.

Figures 10-12 show the remained points after elimination as learning set, theVoronoi tessalation and the distribution of the CR values respectively.

On Figure 13 can be seen how considerably changed the CR distribution.

The statistics of the CR distribution of the original sample set are

in Table 2.

Figure 12

The distribution of CR in the Voronoi cells in the learning set.

Figure 13

The histogram of the CR distribution in the learning set.

Table 2. Statistics of CR distribution of the learning set (194 points).

Min

Max

Mean

Standarddeviation

0.1606

4.767

0.563

0.469

Conclusions

The suggested method is proved to be successful to decrease considerably theinhomogenity of the learning dataset and the differences in the CR indices of thedata points. An improvement of this method would be the application of Voronoitessalation on non-convex region. In this way the effect of non-convex countryborder can be taken into account and more realistic CR values could be computed.

Acknowledgement

The authors would like to thank A. Kenyeres providing the GPS/levelling data ofHungary.