Thursday, December 11, 2014

I'm doing an analysis of Diurnal Temperature Range (DTR; more on that when published) but as part of this I just played with a little toy box model and the result is sufficiently of general interest to highlight here and maybe get some feedback.

So, for most stations in the databank we have data for maximum (Tx) and minimum (Tn) that we then average to get Tm. Now, that is not the only transform possible - there is also DTR which is Tx-Tn. Although that is not part of the databank archive its a trivial transform. In looking at results running NCDC's pairwise algorithm distinct differences in breakpoint detection efficacy and adjustment distribution arise, which have caused great author team angst.

This morning I constructed a simple toy box where I just played what if. More precisely what if I allowed seeded breaks in Tx and Tn in the bound -5 to 5 and considered the break size effects in Tx, Tn, Tm and DTR:

The top two panels are hopefully pretty self explanatory. Tm and DTR effects are orthogonal which makes sense. In the lowest panel (note colours chosen from colorbrewer but please advise if issues for colour-blind folks):
red: Break largest in Tx
blue: Break largest in Tn
purple: break largest in DTR
green: break largest in Tm (yes, there is precisely no green)
Cases with breaks equal in size are no colour (infintesimally small lines along diagonal and vertices at Tx and Tn =0)

So …

if we just randomly seeded Tx and Tn breaks in an entirely uncorrelated manner into the series then we would get 50% of breaks largest in DTR and 25% each in Tx and Tn. DTR should be broader in its overall distribution and Tm narrower with Tx and Tn intermediate.

if we put in correlated Tx and Tn breaks such that they were always same sign (but not magnitude) then they would always be largest in either Tx or Tn (or equal with Tm when Tx=Tn)

If we put in anti-correlated breaks then they would always be largest in DTR.

Perhaps most importantly, as alluded to above, breaks will only be equal largest for Tm in a very special set of cases where Tx break = Tn break. Breaks, on average will be smallest in Tm. If breakpoint detection and adjustment is a signal to noise problem its not sensible to look where the signal is smallest. This has potentially serious implications for our ability to detect and adjust for breakpoints if we limit ourselves to Tm and is why we should try to rescue Tx and Tn data for the large amount of early data for which we only have Tm in the archives.

Maybe in future we can consider this as an explicitly joint estimation problem of finding breaks in the two primary elements and two derived elements and then constructing physically consistent adjustment estimates from the element-wise CDFs. Okay, I'm losing you now I know so I'll shut up ... for now ...

Tuesday, December 9, 2014

It has been nearly six months since we have released the first version of the databank. While this was a big achievement for the International Surface Temperature Initiative, our work is not done. We have taken on many different tasks since the release, and a brief description is below:

Monthly Update System
As described in this post, we have implemented a monthly update system appending near real time (NRT) data into the databank. On the 5th of each month 4 sources (ghcnd, climat-ncdc, climat-uk, mcdw-unpublished) update their Stage 1 data, and on the 11th, their common formatted data (Stage 2) are then updated. In addition, an algorithm is applied appending new data to the recommended merge, and that is updated on the 11th as well.

Bug Fixes
Users have submitted some minor issues with version 1. Some stations in Serbia were given a country code of "RB" when they should have been given "RI." These have been addressed, and a new version of the databank (v1.0.1) was released.

There have been concerns about how the station name is displayed. Non-ASCII characters pose problems with some text interpreters. A module has been created in the Stage 1 to Stage 2 conversion scripts where these characters are either changed or removed to avoid this problem in the future.

Of course issues could still exist, if you find any please let us know! As an open and transparent initiative, we encourage constructive criticism and will apply any reasonable suggestions to future versions.

New Sources
We have acquired new sources that will be added as Stage 1 and Stage 2 data soon, including

300 UK Stations from the Met Office

German data released by DWD

EPA's Oregon Crest to Coast Dataset

LCA&D: Latin American Climate Assessment and Dataset

Daily Chinese Data

NCAR Surface Libraries

Stations from Meteomet project

Libya Stations sent by their NMS

C3/EURO4M Stations

Additional Digitized Stations from the University of Giessen

Homogenized Iranian Data

It is not too late to submit new data. If you have a lead on sources please let us know at data.submission@surfacetemperatures.org. We will freeze the sources again on February 28th, 2015, in order to work on the next version of the merge.

Friday, December 5, 2014

NOAA's National Climatic Data Center have undertaken an inventory of their substantial basement holdings of hard copy data. These include a rich mix of data types on varied media including paper, fiche and microfilm.

One row of several dozen in the NCDC archive of hard copy paper holdings from around the world

Microfilm holdings arising from Europe over the second world war

Some, but far from all, of this data has been imaged and / or digitized. NCDC have now released the catalogue online and made it searchable. The catalogue interface can be found at https://www.ncdc.noaa.gov/cdo/f?p=222 (click on search records). The degree to which a given holding has been catalogued varies but this is a good place to at least begin to ascertain what holdings there are there and what their status is. For example searching on American Samoa as country provides a list of holdings most of which are hard copy only.

Example search results for American Samoa

For those interested in aspects of data rescue, this is likely to be a useful tool to ascertain whether NCDC hold any relevant records. By reasonable estimates at least as much data exists in hard copy / imaged format as has been digitised for the pre-1950 period. That is a lot of unknown knowns and could provide such rich information to improve understanding ...