Criteria for Evaluating Alternative Hypocentres

R J Willemann

Like many agencies computing hypocentres, the International
Seismological Centre may change its location algorithm or travel time
model in an attempt to improve hypocentral accuracy. Where so-called
ground-truth locations are known, a newly computed epicentre or depth
that is closer to ground-truth is certainly superior. For the great
majority of events, however, no ground truth location is known. Is
there any way to evaluate the quality of new hypocentres for which no
ground truth is known?

My answer is that a better hypocentre should fit the data better.
Perhaps you agree that this is true in the case of a better
travel-time model, but doubt that it is true for a better location
algorithm. That case, however, simply emphasizes that importance of
measuring the misfit appropriately. For example, if your algorithm
assigns different a priori weights to different data, then the misfit
measure might include the weights. It turns out, however, that this
must be done with some care and that treatment a posteriori weights
can be particularly troublesome.

For examples, I'm will use some of the results from a relocation study
carried out mainly by Chen Qi-fu of the China Seismological Bureau,
who spent most of last year at the ISC as a Royal Society visiting
research fellow. Qi-fu is helping the ISC to evaluate the utility of
3-dimensional global tomographic models for routinely locating
earthquakes for the ISC Bulletin. His objectives last year were based
on simply replacing the Jeffreys-Bullen travel times with various
other models. For the relatively few events with good ground truth he
obtained the most accurate locations with a block model of Kelli
Karason and Rob van der Hilst. I will be using Qi-fu's relocations of
about 3800 ISC earthquakes of 1998/January, which includes no
ground-truth events.

What I show on this slide is that, at first glance, there appears to
be very little improvement in the travel time residuals - the average
of the RMS misfits was reduced by just a few per cent. If you study
the histogram of RMS residuals you may be able to convince yourself
that there is "clear" improvement but even if we managed to show that
this change is statistically significant I think that you would agree
with me that the improvement is surprisingly small. Now one thing
that we have to keep in mind is that the misfit computed by the ISC is
a weighted RMS value. The weights are computed using Jeffreys' method
of Uniform Reduction, which enabled Jeffreys and Bullen to develop
their famous travel time tables.

Unlike standard weighted least squares, Uniform Reduction assumes
no a priori information about data quality, but instead assigns
smaller weights to arrival times with larger residuals. The weights
must be computed iteratively, of course, based on the residuals in the
previous iteration. There is one free parameter, m, for which the
ISC always uses 0.05. The other apparent parameter,
s, is actually computed iteratively
as well: it is weighted RMS, computed from the residuals and weights
of the previous iteration.

This "data dependent" parameter turns out to have some peculiar
effects. Suppose, for example, that the original data include some
clear outliers, say 10% to 30% of residuals are twice as large as the
rest of the data. Suppose, further, that you find some way to bring
those in-line with the other data, perhaps they come from stations
where inaccurate station corrections were used previously. It is easy
to show with some numerical trials that your final estimate of
sU is
quite likely increase! The reason is that the outliers were
previously assigned w = 0 and made no contribution to sU. But after
correcting them, the former outliers do contribute to sU and, of
course, are as likely to be larger than the other residuals as
smaller.

We could try simply ignoring the weights when computing a final misfit
value. I am not suggesting abandoning Uniform Reduction - the
locations are still computed in the same way with iteratively updated
weights and sigma. I'm saying that after we have computed the
hypocentres using Uniform Reduction, let's re-calculate the data
misfits ignoring the weights. So that's what I've plotted here and, I
hope you'll agree, improvement of the residuals as a result of using a
3-D model is much more clear. (As a reminder, I have included the
histogram of weighted RMS from the first slide in miniature.)

Now among these unweighted RMS values, the larger ones are quite
meaningless: many of the values > 1s are heavily influenced by a few
large outliers. Thus, to quantify the improvement I might simply say
let's consider all of the events where the unweighted RMS misfit of
the original locations was <1s. For these events the new travel time
model and relocations cut the average RMS misfit from 0.51s to 0.38s.
This is a 25% reduction, which is certainly much better than the
reduction of the weighted RMS residual.

Perhaps you are still disappointed. After all, we are switching from
a model developed closer to the year when seismometers were invented
than to the present. You had hoped, maybe, that the improvement would
be expressed not as a percentage reduction but as a factor - 3 times
smaller errors, 5 times, who knows, 10 times smaller errors. Perhaps
the problem is that we are ignoring exactly the cases where the most
improvement was possible: where the residuals were large!

Also, the question of statistical significance has still not been
addressed. So these are the two issues I want dig into: statistical
significance and a way to include all of the data.

These histograms show the shape of the probability density functions.
Perhaps they even look like chi-square distributions to you.
Unfortunately, since the data include many outliers, they are not
expected to be chi-square and, in fact they do have large deviations
from chi-square even if you take account of the number of stations
used to compute each hypocentre. Without a good statistical model it
is very difficult to state the uncertainties of the means. That is,
while we can say that the mean unweighted RMS of Bulletin residuals is
0.51, we cannot say, for example, that it was 0.51 ± 0.03, meaning
that we expect it to be between 0.48 and 0.54 in about ten out of
twelve months each year.

If, instead of plotting the density distributions we plot the
cumulative distributions then, perhaps, you will be reminded of a very
well-known test of statistical significance. These plots are based on
the same data as the histograms, but show the integrals of the density
function. If these were two independent samples drawn from the same
population, then their cumulative distributions would never differ
very much. In fact, they differ by quite a bit: Out of all events
that were "not too bad" to begin with, only about 1 in 8 Bulletin
hypocentres have an unweighted RMS residual less than 0.3s, but nearly
half of the hypocentres computed with Karason and van der Hilst's
model fit the data this well.

One standard way of comparing two distributions plotted this way is
the Kolmogorov-Smirinov test. Kolmogorov and Smirnov showed how to
compute the probability of the CDF's of two samples from the same
population differing by any given amount, without making any
assumptions about the population distribution. That's exactly what we
require here, since we don't know the distribution of residuals.

With samples of more than 1400 values, a difference as large as this
is basically impossible for two samples drawn from the same population
- the probability of this happening by chance is something like
10-137. So, Qi-fu's relocations of ISC events achieved a better fit
to the data with a very high level of confidence.

The other problem with the tests that we have looked at so far is
treatment of large residuals, which unduly influence the unweighted
RMS, or standard deviation. The feature of standard deviation that is
causing trouble is that it is not robust. That is, an arbitrarily
small fraction of the data - even just one sample - can wreak complete
havoc.

Now, what I have shown here is the cumulative distribution function of
arrival time residuals for one earthquake. The residuals from the ISC
Bulletin are shown in black and the residuals from Qi-fu's relocation
are shown in red, and each set looks reasonably well-behaved. The
density function is proportional to the slope, so it has a high value
near r=0 and smaller values for larger positive or negative residuals
- that is, the residuals more or less normally distributed.

In a normal distribution, 68% of the values are within 1 standard
deviation of the mean. In this example the standard deviation of
Qi-fu's residuals is 0.9s, but 80% of the residuals are within ±0.9s,
quite a bit more than 68%. What's gone wrong is that there is one
moderately outlying residual of -2.5s, which changes the standard
deviation so much that it no longer remotely represents the range over
which we find 68% of the values.

Of course if that's what we really wanted we could just compute it:
sort the absolute value of residuals, go up to 68% and, bang, that's
our measure of how broadly the residuals are distributed. Some people
would object that this ignores the other 32% of the data but it
doesn't, really, because those other 32 out of every 100 values tell
us where to draw the line. Some unknown fraction of the values beyond
that are outliers, so we are happy to limit their role computing our
statistic in order to make it robust.

Actually, the flaw in the 68th percentile measure is that it doesn't
make as much use as it might of small residuals, i.e. the good data.
Just as large residuals can be increased without bound and leave the
68th percentile unchanged, so almost all of the small residuals could
be reduced to 0, leaving aside the values just below the 68th
percentile, and the statistic is unchanged. Thus, a better measure of
the scale of the residuals is to set aside some fixed fraction,
starting from the largest ones, and then compute the mean square of
those that remain. Statisticians call this "trimmed variance". It is
no longer an estimate of sample variance, but it is robust and it does
take full advantage of all of the "good data", i.e. values within the
fraction of all data that we decide to keep.

An alternative to trimming the sample, throwing away some of the data,
is to "Winsorise" it, which means to take all of the data that we were
about to throw away and instead pretend that they have values equal to
the largest ones that we are keeping. It sounds peculiar, but
statistics of the Winsorised sample have the same robustness features
as those of the trimmed sample, yet for samples without any outliers
the Winsorised statistics are somewhat closer to the standard sample
statistics.

So, I have mentioned five different "measures of scale", and let's
review their properties. Standard deviation takes advantage of all
good data - the small residuals - but it is not robust because it
tries too hard to take advantage of all the data, some of which are
outliers. Uniform Reduction also takes advantage of all good data, by
assigning them large weights and it achieves robustness by assigning
null weights to outliers. Unfortunately it has the flaw I described
previously of sometimes getting larger when we reduce some of the
residuals. Statisticians refer to a measure of scale that avoids this
flaw as "measures of dispersion", so we can say that the Uniform
Reduction RMS fails to be a measure of dispersion.

Each of the other measures of scale is robust and a measure of
dispersion; in principal we could use any of them compare differently
computed hypocentres. But the 68th percentile would fail to show
improvement if we only made the good half of the residuals better, so
I'm not going to use it. Between the trimmed variance and the
Winsorised variance it really doesn't make any difference. Actually,
I sort of wish that I had used the trimmed variance because it's
easier to describe, but that's not what I did and so, rather than
re-computing everything I'm going to show results using the Winsorised
variance.

So, we are interested in how well a new travel time model or algorithm
works with different subsets of the events. If it was a new algorithm
we might be interested in how well it performs when there are only a
few stations or when the station distribution is poor. For the
example I'm using the model accuracy varies from place to place, so
I'm going to look at regional subsets. From the viewpoint of the
method it doesn't really matter: somehow the evaluator chooses
potentially interesting subsets of the events.

For each event in a subset, I compute the square root of the
Winsorised variance of its residuals, i.e. this robust measure of
dispersion that I have described. I used the 20% Winsorised sample.
This sounds pretty extreme - throwing away the 20% largest values and
the 20% smallest leaving me with only 60% of the original sample, but
it's what they do in all of the text books on robust statistics, so as
I set out across this relatively unfamiliar terrain I'll just follow
the convention.

So now I have two numbers in hand for each earthquake: the root
Winsorised variance of the Bulletin residuals and of the re-location
residuals. I sort each set of numbers, plot the sample CDF's, as
shown here, measure the Kolmogorov-Smirnov statistic, and then compute
the level of confidence with which I can say that the samples differ.
This example is for Flinn-Engdahl region 12 - Kermadec - and the
samples differ with more than 99.9% confidence. That is, re-location
of Kermadec trench earthquakes using the 3-dimensional model has
definitely improved fit to the data.

So, I repeat this for the 50 Flinn-Engdahl seismic regions around the
world. Actually I do it for 48 regions because Qi-fu's test included
no events in two of the regions. Now, if the changes were actually
random rather than genuine improvements then I would expect the
confidence levels to be uniformly distributed between 0% and 100%.
That is, just by chance I would expect 10% - 4 or 5 of 48 regions - to
have distributions that apparently differ with more than 90%
confidence. If fact it turns out that the difference is significant
at the 90% level in 13 of 48 regions, or more than one quarter.

One thing to keep in mind here is that the Komogorov-Smirnov test has
relatively "low power". That is, when we control the rate "false
positives", the test will produce a large rate of "false negatives".
Obviously this is not desirable; it's the price we pay for not knowing
the population distribution.

So, in the global map of Flinn-Engdahl regions I have blanked out the
identifying numbers in most cases - I left in the numbers for the 13
regions where the distribution of Winsorised RMS is most significantly
changed. Many of them are regions with poor station coverage - broad
oceans and the southern hemisphere generally. There are exceptions to
this characterisation of regions where we were able to improve ISC
hypocentres; the Philippines have excellent coverage of course. But
regions 11-14 have the most significant improvement of all, and the
poor station coverage in this part of the world is notorious, since
only Australian and New Zealand stations to their west and south
record many events.

Well, having identified subsets of events where some new hypocentres
fit data particularly well, or poorly, we will probably want to take a
look at individual earthquakes. We never did solve the problem of
estimating uncertainty of our statistics: we can say what the value of
Winsorised variance is, but we can't given an uncertainty and then
expect some known fraction of the values to fall in that range, say
for events next month.

So, we're back to using the relatively low-power Kolmogorov-Smirnov
test, this time to compare residuals of individual earthquakes. Since
the number of values in each sample can be quite small we're going to
suffer from a high rate of false negatives: instances where the test
fails to show that the residuals have been improved. We also face the
problem that the two sample are not drawn from completely independent
populations. The new travel times have, hopefully, corrected a large
part of the model error that contributes to the residuals but errors
arising from mis-measured time, mis-association of arrivals, and so on
are all common to the two samples. So, realistically, we might come
to use quite low confidence levels to point us to earthquakes where
residuals have changed enough to be interesting.

This example is the same one that I used earlier when I described the
Winsorised variance. In this case we don't have to worry about all of
those caveats about low power and dependent populations: the change in
the residuals is significant at more than the 99% confidence level.

So, we now have two ways to compare the residuals from different
hypocentres for an earthquake: the Kolmogorov-Smirnov confidence that
the distributions differ and the difference between the root
Winsorised variances, for which I cannot estimate an uncertainty (or
even a confidence that it is non-null), but which at least does give
me an idea of whether the new hypocentre made the residuals better or
worse.

OK, two interesting numbers for many examples - let's make a scatter
plot and see if there's any relationship between them. Here's what I
found: In most of the 13 regions where I previously reckoned that
there was a significant change, the difference between the Winsorised
variances indicated improvement in almost every case where the
Kolmogorov-Smirnov confidence exceeded something like 0.2. I show
examples from a couple of the Flinn-Engdahl region here. I think that
we're really getting somewhere; we have good evidence that while the
residuals do become somewhat more dispersed in some cases, those are
just the earthquakes where we didn't change the residual distributions
perceptibly anyway.

What's more, we have a general rule that works so well that the
exceptions are rare enough that we can think about taking a look at
them individually to try and work out what went wrong. If, instead,
there were many points in the upper right quarter of these figures
then we might be seriously concerned, since there was a good number of
earthquakes where we changed the residuals a lot, and they got worse.

Let's take a closer look at those four southwestern Pacific
regions, New Zealand, Kermadec, Tonga-Fiji and Vanuatu, where the
change in the distribution of Winsorised variances was most
significant. First the moderately good news: the Winsorised variance
indicates that we have reduced dispersion of the residuals for 183 out
of 242 these earthquakes. The bad news, of course, is that we have
increased dispersion of the residuals for the other 59 earthquakes
and, more seriously, we have done so while changing the residuals
significantly in many cases. That is, there is an uncomfortably large
number of earthquakes for which the relocation changed the residuals
quite a bit and made them worse!

The most striking thing that I found among earthquakes in these
regions is that, with only a few exceptions, the residuals became more
dispersed only when the re-location was done with a fixed depth. That
is, if the re-location failed to converge with a free depth, the
program fixed the earthquake at the depth in the Bulletin hypocentre
and tried to converge on a solution that way. That's why there's this
concentration of points representing earthquakes for which the
Bulletin and 3-D hypocentres have exactly the same depth. It didn't
just work out that way; the data were insufficient to tell the depth,
so depth in re-location was fixed to the depth in the Bulletin. The
fact that the residuals often became more dispersed tells us that,
while we don't know the depth, in those cases the 3-D travel time
model is not very consistent with the Bulletin depth.

Before I give a summary of the methods that I've talked about, I must
describe the results for one of the 13 regions where the
Komogorov-Smirnov test indicated a significant change in the
distribution of Winsorised variances - region 48, "Pamir - Hindu
Kush". For 23 of 25 the test events in this region the Winsorised
variance increased: the residuals became more dispersed. What's more,
the Kolmogorov-Smirnov test for comparing the residual distributions
of individual earthquakes often gave a confidence greater than 0.2.
That is, by the criterion that I have become accustomed to indicate
significant changes in the residuals, these residuals really did
change. And they got worse!

After what I saw in the southwestern Pacific, the first thing I
looked for was fixed depths. No joy there: these earthquakes mostly
have depths of 100 - 150 km in the Bulletin, and with the 3-D model
the hypocentral solutions converged with a free depth again, usually
ending up 10 - 20 km deeper, and occasionally >40 km deeper.

This region is extraordinary on several counts, there is no place
in the world with crust much thicker than here. For earthquakes at
mantle depths thick crust would not ordinarily have much of an effect
on the majority of ISC hypocentres, since we think of ourselves as
computing teleseismic locations. Many of the earthquakes in this
region are located using a proportion of locally recorded times that
is unusually large, at least compared to the other 12 regions where
residuals changed significantly

Ray paths used in the tomographic inversion are not as dense here
as in some other regions, so the model may be less accurate here.
Alternatively, perhaps we have not been sufficiently thorough in
making use of a realistic 3-D mantle model, but it is not immediately
apparent how this could explain why changing from the J.-B. travel
times (with no crustal corrections) actually made things worse

In the last few slides I have started to take a close look at some
features of Chen Qi-fu's re-locations using Karason and van der
Hilst's tomogrphaphic model. My main point is somewhat more general:
that the fit to the data is an important feature of any set of
hypocentres, and that a quantitative comparison of how well two sets
of hypocentres fit the data can be meaningful. There are a few
caveats.

First, the misfit measure must have a couple of properties that,
fortunately, statisticians have already described pretty well. One of
them, however, is not a property of most misfit measures used in
iteratively computing hypocentres.

Second, incomplete knowledge of the population distribution limits our
choice of tests for statistical significance. And third, the
residuals from two sets of hypocentres computed from the same data are
not really independent. Both of these latter two points tend to
increase the rate of false negatives. That is, if you put complete
faith in the statistical tests, then things are probably better than
you think.