Thursday, October 15, 2015

In June 2014, the first version of the global databank was released
(Rennie et al., 2014), which included
data from nearly 50 different sources and an algorithm to resolve duplicate
stations and piece together complete temperature time series. Since then, there
have been monthly updates, appending new data to existing stations. Thanks to
user feedback, along with additional analysis described below, minor changes were introduced
and implemented to the merge program to ensure the most accurate data were
incorporated in the final product. This, along with updates to current sources
required a small change to the versioning system. The remainder of this post will highlight the changes implemented in the global land surface
databank, version 1.1.0. More information about the structure of the databank,
including sources, formats, and merge algorithm, can be found on the databank
website (www.surfacetemperatures.org/databank)

Updates to Stage 1 and Stage 2 Data

The databank design includes six
data Stages, starting from the original observation to the final quality
controlled and bias corrected products. For the purposes of this update, only
three stages were modified: digitized data (Stage One), data converted to a
common format (Stage Two), and the merged dataset (Stage Three).

The highest priority source comes from the
Global Historical Climatology Network – Daily (GHCN-D) dataset (Menne et al. 2012). In June 2015, GHCN-D
underwent a large update, which included a new average temperature element
(TAVG), along with the addition of 1,400 stations that are a part of the World
Meteorological Organization’s (WMO) Regional Basic Climatology Network (RBCN).
Because these stations are important for real time updates, it was necessary to
include this new version in the latest merge.

Further assessment was also done on one of
our sources known as “russsource.” This source contained over 36,000 stations
reporting maximum and minimum temperature. While the original format was
consistent across all stations, it was discovered that this source included 27 individual
sources. It was decided to split these sources up and place them individually
in the merge following the source hierarchy defined by the databank working
group. Because of some duplication with sources used in GHCN-D, only 20 of the
27 sources were included. In addition, station ID’s were brought into the Stage
Two data, so that the merge’s ID test could be implemented. The same was done
for the source known as “ghcnsource.”

Other
than the above, no additional sources were added to the source hierarchy. One source however was removed (crutem4), because it was determined that
the use of these stations as a last resort was causing stations to be unique
because of the data changes through bias corrections. Candidate stations from
crutem4 were matched with their respective target stations through metadata
tests, but were chosen as unique from the data tests, because of these
corrections. In order to avoid excessive station duplication, this source was
removed.

Changes to Merge Algorithm

The merge algorithm, as described by Rennie et al. 2014, underwent no code changes.
However, a couple of thresholds were modified in order to maximize the amount
of data the final recommended product would have. The thresholds are
defined in a configuration file that is required for the program to run
successfully.

The first step of the merge algorithm takes
into account the metadata between a target and candidate station, including the
stations latitude, longitude, elevation and name. A quasi-probabilistic
comparison is made and the result is a metadata metric between 0 and 1. In
version 1.0.0, this metric needed to pass a threshold of 0.50 in order to be
considered for merging. Analysis showed that too many stations were being
pulled through and forcing merges between stations that shouldn’t have. As a
result, a stricter threshold of 0.75 was applied, in order to avoid this issue.

In addition, once a candidate station is
chosen to merge with a candidate station, it needs to fill in a gap of at least
60 months (5 years) in order to be added to the target station. It was
determined that this gap was too large, and target stations with short gaps in
its data were not being filled in by qualifying candidate stations. This gap
threshold has been reduced to 12 months as a result.

Similar to version 1.0.0, all decisions made
were tested against an independent dataset generated from hourly data for US
stations available in the Integrated Surface Dataset (Smith et al. 2011). Results
only show a small change between the two versions

Results

Version 1.1.0 of
the recommended merge contains 35,932 stations (Figure 1), nearly 4,000
stations more than v1.0.0 (32,142). Figure 2 depicts that the addition of
stations reflect the most recent period, as there is relatively a 10% increase
in the number of stations since 1950. It should be noted that there is a drop
in coverage prior to 1950 with the new version. However it is the author’s
opinion that this was reflected by removing crutem4 as one of the sources.
Including this source had made candidate stations unique, due to differences in
its data as a result of the data providers bias corrections. While the number
of stations is lower during this time period for v1.1.0, it should be noted
that the number of gridboxes used in analysis (Figure 3) was either equal, or
slightly higher than v1.0.0.

Stage Three
normally includes a merge recommended and endorsed by ISTI, along with variants
showing the structural uncertainty of the algorithm. Due to time constraints,
these variants are not available, however will be provided at a later date.

Figure 1: Location of all stations in the recommended Stage Three component of the databank. The color corresponds to the number of years of data available for each station. Stations with longer periods of record mask stations with shorter periods of record when they are in approximate identical locations.

Figure 2: Station count of recommended merge v1.1.0 by year from 1850 to 2014, compared to version 1.0.0, along with GHCN-M version 3.

Figure 3: Percentage of global coverage with respect to 5 degree gridboxes for the recommended merger v1.1.0 by year from 1850-2014, comparted to version 1.0.0, along with GHCN-M version 3.