Benchmarking,
in this context, is the assessment of homogenisation algorithm
performance against a set of realistic synthetic worlds of station data
where the locations and size/shape of inhomogeneities are known a priori.
Crucially, these inhomogeneities are not known to those performing the
homogenisation, only those performing the assessment. Assessment of both
the ability of algorithms to find changepoints and accurately return
the synthetic data to its clean form (prior to addition of
inhomogeneity) has three main purposes:

1) quantification of uncertainty remaining in the data due to inhomogeneity 2) inter-comparison of climate data products in terms of fitness for a specified purpose 3) providing a tool for further improvement in homogenisation algorithms

Here
we describe what we believe would be a good approach to a comprehensive
homogenisation algorithm benchmarking system. Thfis includes an
overarching cycle of: benchmark development; release of formal
benchmarks; assessment of homogenised benchmarks and an overview of
where we can improve for next time around (Figure 1).

Creation of realistic clean synthetic station data

Firstly,
we must be able to synthetically recreate the 30000+ ISTI stations such
that they have the correct variability, auto-correlation and
interstation cross-correlations as the real data but are free from
systematic error. In other words, they must contain a realistic seasonal
cycle and features of natural variability (e.g., ENSO, volcanic
eruptions etc.). There must be a realistic persistence month-to-month in
each station and geographically across nearby stations.

Creation of realistic error models to add to the clean station data

The
added inhomogeneities should cover all known types of inhomogeneity in
terms of their frequency, magnitude and seasonal behaviour. For example,
inhomogeneities could be any or a combination of the following:
- geographically or temporally clustered due to events which affect
entire networks or regions (e.g. change in observation time); - close to end points of time series; - gradual or sudden; - variance-altering; - combined with the presence of a long-term background trend; - small or large; - frequent; - seasonally or diurnally varying.

Design of an assessment system

Assessment
of the homogenised benchmarks should be designed with the three
purposes of benchmarking in mind. Both the ability to correctly locate
changepoints and to adjust the data back to its homogeneous state are
important. It can be split into four different levels: -
Level 1: The ability of the algorithm to restore an inhomogeneous world
to its clean world state in terms of climatology, variance and trends. - Level 2: The ability of the algorithm to accurately locate changepoints and detect their size/shape. -
Level 3: The strengths and weaknesses of an algorithm against specific
types of inhomogeneity and observing system issues. -
Level 4: A comparison of the benchmarks with the real world in terms of
detected inhomogeneity both to measure algorithm performance in the
real world and to enable future improvement to the benchmarks.

The benchmark cycle

This
should all take place within a well laid out framework to encourage
people to take part and make the results as useful as possible. Timing
is important. Too long a cycle will mean that the benchmarks become
outdated. Too short a cycle will reduce the number of groups able to
participate.

Producing
the clean synthetic station data on the global scale is a complicated
task that has now taken several years but we are close to completion of a
version 1. We have collected together a list of known
regionwide inhomogeneities and a comprehensive understanding of the many
many different types of inhomogeneities that can affect station data.
We have also considered a number of assessment options and decided to
focus on levels 1 and 2 for assessment within the benchmark cycle. Our
benchmarking working group is aiming for release of the first benchmarks
by January 2015.