2. get ISTI data

I added code so that ccc-gistemp “knows” how to download the ISTI data:

tool/fetch.py isti

This creates the file input/isti.merged.inv and input/isti.merged.dat.

3. QC the ISTI data

The ISTI data comes from a variety of sources, some more raw than others. Lots of the data is not Quality Controlled (QC). It’s a good idea to QC it, and I suggest MADQC, see What Good is MADQC for more.

../madqc/mad.py --progress input/isti.merged.dat

This creates the file input/isti.merged.qc.dat; you also need to copy the .inv file:

cp input/isti.merged.inv input/isti.merged.qc.inv

4. run ccc-gistemp

It is a lot slower than the usual run because there are far more stations in ISTI than GHCN-M. It took about 50 minutes on my late 2014 ultrabook class laptop (about the same as a low- to mid-range cloud instance).

The data_sources parameter tells ccc-gistemp to use the ISTI data; the element parameter tells ccc-gistemp to analyse the TAVG element (monthly mean). This latter parameter was added specifically for ISTI because the ISTI files have TMIN, TMAX, and TAVG all in the same file.

The ISTI data does not have the metadata that GHCN-M comes with, in particular, no urban indicators. ccc-gistemp’s Step 2 modifies urban stations, we have to skip this step, hence the -s 0-1,3-5 option.

5. view results

The results are available in fixed format text files in the result/ directory.

The land-only result is available in result/landGLB.Ts.GHCN.CL.PA.txt

If the land and ocean data sources finish on the same month then there will be a file result/mixedGLB.Ts.ERSST.GHCN.CL.PA.txt containing the merged land–ocean dataset.

If the land and ocean data sources do not finish on the same month then the blended land–ocean dataset makes no sense for the most recent months (which will have either no ocean contribution, or no land contribution); the results are produced but placed in the file tainted*; best to use the land-only result.

Do you prefer CSV files? You can run

tool/gistemp2csv.py result/*.txt

to convert the fixed format text files to CSV files (also written to the result/ directory).

Well, this being a blog where I mostly write about ccc-gistemp which computes global temperature anomaly, maybe you can guess.

Preliminary analysis shows that 2014 was the hottest year on record.

Of course there are caveats and qualifications. ccc-gistemp is just one analysis, and while it mimics NASA GISTEMP very closely, rounding and other sources of computational ambiguity sometimes result in a slightly different result. The input to this analysis is GHCN and other surface data published yesterday. It’s not unusual for these datasets to be update again later in the month as more observations reach the data centres.

The biggest caveat of course is that the error estimates for the anomaly in recent decades are about 0.1°C. Meaning that’s it’s extremely unwise to talk about 2014, anomaly +0.67, beating 2010, anomaly +0.66. That doesn’t stop everyone else doing it of course.

For 2014 both ccc-gistemp and GISTEMP report an anomaly of 0.62 °C for the June, July, August season (using a base period of 1951 to 1980). And normally we’re an exact match, occasionally the rounding shifts one way or the other and we can be 0.01 °C out. For 2013 we’re a whopping 0.03 °C different. Does anyone want to investigate?

When we’re using the ISTI dataset with ccc-gistemp, what advantage does it give us? The northern hemisphere is already well sampled, so it doesn’t give us much there. Does it do any better in the southern hemisphere?

This is a plot after figure 2 of Hansen and Lebedeff 1987. It shows, for each of the 80 boxes used in the analysis, the earliest year that has data:

A word of caution for those comparing this to the actual figure 2 of Hansen and Lebedeff 1987. Their figure shows the date “when continuous coverage began” for each box, which I take to mean the date when continuous reporting began for any station within the box. The plot I give above is of data used in an analysis, and a box will include data from stations outside of the box (as per the 1200 km rule in Hansen and Lebedeff 1987); it is also why my boxes are clamped at 1880.

The top figure in each box is the earliest year of data for the ISTI MADQC dataset; the bottom figure is the earliest year of data for the GHCN-M QCU dataset (years in brackets mean that the given year is the earliest year of continuous reporting, but there are earlier fragments). The little figure in the bottom right corner of each box is the box number, using the same convention as figure 2 of Hansen and Lebedeff 1987. A box is blue when ISTI has earlier data (hence, more), and is pink when GHCN-M has earlier data.

Where ISTI wins the most is box 65, where ISTI has extended the data period back from 1950 to 1887 and a little bit before (almost the full period for the analysis). There are only a handful of stations contributing to this box, so it’s just about sensible to plot them all on one plot. All the stations start reporting around 1950, except for one:

That station is MP00006199, Plaisance (now renamed Sir Seewoosagur Ramgoolam International Airport; the renaming of airports is a history-in-miniature of colonialisation: airports are built by the colonial powers, then renamed as the newly independent ex-colonies stamp their mark on them).

The equivalent plot for the ccc-gistemp analysis done with the GHCN-M QCU dataset has all of the stations (including 12961990000 Plaisance) starting in 1950 or later:

Meteorological stations don’t just spring up out of nowhere and start reporting all in the same year of 1951. The fact that many stations have records starting in 1951 is an artefact of the collection process: There were deliberate attempts to recover and digitise existing records from 1951 to 1980 so that they could be used for normals (Peterson and Vose, 1997). This is one of the key benefits of the ISTI dataset. By bringing together data from diverse sources, we find and can make use of longer records.

So it seems likely that some of these other stations contributing to box 65 could have more data coaxed out of them. For this box the station of most interest would be WMO station 61996 (listed as 15761996000 Ile Nouvelle-Amsterdam in GHCN-M v3, and FS000061996 Martin-de-Viviès in ISTI Stage 3) because this station is actually within the bounds of the box, whereas the others are not. Sadly, this is a remote island that didn’t have any settlement at all until 1949, so we’re not going to suddenly find significantly more data for this station.

Box 58 is a case where ISTI has more data, but it’s not in a period that connects to the data starting in 1935. Plotting the contributing stations we see that the period of continuous reporting hasn’t actually changed:

What’s changed is that we have two extra data fragments for a single station: FP000091938, Tahiti. Given the huge gaps between reporting periods, for an analysis like ccc-gistemp we would be better off just discarding those data. I’m sharpening my scalpels.

Perhaps it would be worth doing some data archaeology for Tahiti. Can it really be the case that reliable temperatures were only reported from 1935 onwards? The international airport opened in 1960, so the period of reporting isn’t directly related to the airport opening. Perhaps there’s a stack of yellowing paper forms in wherever the French keep their archives.

Box 40 is the only box in the northern hemisphere to have its period of data extended by ISTI. The, by now traditional, plot of contributing stations shows that this is due to 3 stations (I’ve truncated the plot at 1945 to avoid showing a large number of stations that began in 1950/1951):

The contributions of BPXLT466819, Tulagi, and KRXLT605164, Ocean Island, are most welcome. The record for the 3rd station, NRXLT092567, Nauru, is somewhat obscured, but when I plot that alone, we see that it’s just the sort of record that I’ve come to complain about:

A series of unrelated periods joined into a single record for no particularly good reason. The ISTI dataset is certainly good for studies of imhomogeneity because it seems to create lots of inhomogeneities to study. I think if we’re going to continue to use ISTI for ccc-gistemp, I’ll have to implement something like Rohde’s scalpel.

I previously blogged about running ccc-gistemp with the ISTI Stage 3 dataset, and I slipped into that blog post the fact that I had to QC (Quality Control) the data. The purpose of QC is to eliminate data that is obviously entirely spurious and not representative of the climate of the station in question.

Did this island in the middle of the pacific really experience sub-zero monthly means prior to 1942? Or is it simply that somewhere along the arduous journey from paper records to our digital data, some bogus data (from some other station?) crept in. It’s QCs job to eliminate data like that.

As mentioned in the earlier article, to do the QC I made a really simple QC routine which I called MADQC.

The pink in the plot above is MADQC working; it shows the data that MADQC eliminates (the ISTI record is plotted in pink, then on top I plot in blue the data after MADQC has been applied; thus the pink shows where MADQC removed data).

In a similar vein, no one really believes that station FG000081405 (listed as Rochambeau, but now known as Félix Éboué airport), experienced a monthly average well above 100°C for a month in 1903:

This is transcription error or something similar, and is eliminated by MADQC.

The next case is a little different. Station WAXLT556695, Walfisch Bay:

The record between 1941 and 1951 is obviously not consistent with the other two fragments, but does at least look like real temperature data. Except for being about 40 degrees warmer than it should be for this location. A little inspiration suggests that this period has been recorded in degrees Fahrenheit (further supported by the fact that the annual variation is higher in the period between 1941 and 1951). This is unfortunate, but it is potentially correctable. MADQC doesn’t care about correcting it and just eliminates that period entirely.

This station, WAXLT556695, also illustrates a different problem with these records (also present in GHCN-M in various forms): no one really believes that the fragment of data around 1890 is in any way connected to the fragment from the 1960’s and 1970’s, separated, as they are, by many decades of no data. ccc-gistemp doesn’t make any attempt to correct for large gaps, and in some sense would prefer to digest such data as two or more separate records (since it can make use of records as long as they overlap with stations with 1200km; they don’t have to have the common reference period that CAM requires). This is the inspiration behind Rohde’s scalpel (Rohde et al 2013). I haven’t investigated, but should the number of large gaps in the ISTI dataset prove problematic, I may have to get my own scalpels out.

The final station I show is a case where MADQC doesn’t help. Station AYM00089034:

This is listed in ISTI as Belgrano II, but the record is clearly a composite of records from nearby stations, only one of which is Belgrano II. There is an obvious inhomogeneity in both mean temperature and range. The merging algorithm that ISTI uses is entirely automatic, so it is inevitable that some mistakes are made. In terms of how ccc-gistemp could correct for this, maybe we need to consider merge history (it is currently ignored), use one of the alternative merge results, or use some more sophisticated form of homogenisation.

I didn’t select these stations at random, nor did I search through all the stations to find particular illustrative examples. They were all found because when I did my first analysis using the ISTI dataset there were significant discrepencies that were visible at the hemispheric level. Hunting down the discrepancies led me to these stations, and then led me to realise that I need to QC the data first. The usual data source for ccc-gistemp is GHCN-M and SCAR READER, both of which have at least been QC’d at source (and in addition to which ccc-gistemp normally uses the same stop list as NASA GISTEMP), so previous to using ISTI data, our own QC routine wasn’t necessary.

Don’t analyse your data without QC. If you’d like to try your own analysis, give MADQC a try.

I’ve recently modified ccc-gistemp so that it can use the dataset recently released by the International Surface Temperature Initiative. Normally ccc-gistemp uses GHCN-M, but the ISTI dataset is much larger. Since ISTI publish the Stage 3 dataset in the same format as GHCN-M v3 the required changes were relatively minor, and Climate Code Foundation appreciates the fact that ISTI is published in several formats, including GHCN-M v3.

The ISTI dataset is not quality controlled, so, after re-reading section 3.3 of Lawrimore et al 2011, I implemented an extremely simple quality control scheme, MADQC. In MADQC a data value is rejected if its distance from the median (for its station’s named month) exceeds 5 times the median absolute deviation (MAD, hence MADQC); any series with fewer than 20 values (for each named month) is rejected.

So far I’ve found MADQC to be reasonable at rejecting the grossest non climatic errors.

Let’s compare the ccc-gistemp analysis using the ISTI Stage 3 dataset versus using the GHCN-M QCU dataset. The analysis for each hemisphere:

Now we can see the agreement in the northern hemisphere is excellent. In the southern hemisphere agreement is very good. The trend is slightly higher for the ISTI dataset.

The additional data that ISTI has gathered is most welcome, and this analysis shows that the warming trend in both hemispheres was not due to choosing a particular set of stations for GHCN-M. The much more comprehensive station network of ISTI shows the same trends.

The original inspiration for Climate Code Foundation was ccc-gistemp. Our original pro-bono rewrite of NASA GISTEMP. Software that shows global historical temperature change.

We wanted the average person on the Clapham omnibus to be able to use it, inspect it, reason about it, and change it. ccc-gistemp successfully reproduces the NASA GISTEMP analysis (to within a few millikelvin for any particular monthly value) in a few thousand lines of Python code.

A few thousand lines is still a lot of code. There are still a few corners of ccc-gistemp that I haven’t fully looked into. Can we make something simpler and smaller that does more or less the same job? Obviously we can’t still expect to use exactly the same algorithm as NASA GISTEMP, and nor would we want to, because the exact details of which arctic locations use SST anomalies and which use LSAT anomalies are just not very important (for estimating global temperature change). It can be distracting to get bogged down in detail.

ZONTEM attempts to discard all constraints and make an analysis that is as simple as possible. The input data is monthly temperature records from GHCN-M. The Earth is divided into 20 latitudinal zones. The input records are distributed into the zones (by choosing the zone according to the latitude of the station). The records are then combined in two steps: first combining all the stations in a zone into a zonal record; then combining all zonal records into a single global record. The global record is converted into monthly anomalies, and then averaged into yearly anomalies. The zones are chosen to be equal area.

This is simpler in so many ways: only one source of input; no ad hoc fixing or rejection of records; no correction for UHI or other inhomogeneity; no use of Sea Surface Temperatures; only a single global result (no gridded output, no separate hemispherical series).

The result is about 600 lines of Python. This is split into 3 roughly equally sized pieces: ghcn.py, series.py, zontem.py. ghcn.py understands the GHCN-M v3 file format for data and metadata (this is vital, but scientifically uninteresting); series.py is borrowed from the ccc-gistemp project and consists of the detailed routines to combine monthly records and to convert to anomalies; zontem.py is the main driver algorithm, it allocates stations to zones, and picks a particular order to combine station records.

A good chunk of zontem.py is concerned with finding files and parsing command line arguments. The actual interesting bit, the core of the ZONTEM algorithm, is expressed as a very short Python function:

This is a useful 7 line summary of the algorithm even though it glosses over some essential details (how are stations split into zones? How are station records combined together?). The details are of course found in the remaining source code.

I like to think of ZONTEM as a napkin explanation of how a simple way to estimate global historical temperature change works. I can imagine describing the whole thing over a couple of pints in the pub. ZONTEM has probably been simplified to the point where it is no longer useful to Science. But that does not mean it is not useful for Science Communication. In just the same way we might use a simplified sketch of a cell or an atom to explain how a real cell or atom works, ZONTEM is a sketch of a program to explain how a real (“better”) analysis works.

If ZONTEM seems simplistic because it doesn’t increase coverage by using ocean data, well, that’s because GISTEMP and HadCRUT4 do that. If it seems simplistic because it doesn’t produce gridded maps at monthly resolution, well, that’s because Berkeley Earth (and the others) do that. Every way in which ZONTEM has been made simpler is probably a way in which NASA GISTEMP, or a similar analysis that already exists, gets a more accurate result.

For too long I have neglected ccc-gistemp (our clear rewrite of GISTEMP). For a while now it has not been possible to run it. The problems were mostly due to finding the right Sea Surface Temperature (SST) file. The old file was called SBBX.HadR2 (a combination of Hadley ISST and Reynold’s Optimal Interpolation v2). GISS withdrew this file in favour of SBBX.ERSST which is Smith et al’s 2008 Extended Reconstruction

In the final stages (Step 5) of ccc-gistemp, SSTs from the ocean file are combined with temperature anomalies from land-based meteorological stations to produce zonal means that are then averaged into hemispherical and global means. The choice of which dataset to use for SSTs is not completely straightforward, there are different groups with different ways to assimilate all the available observations. Hansen et al’s 2010 paper “Global Surface Temperature Change” does a good job of comparing some of the available options.

So now we’ve caught up with GISS and can do once again do an analysis of combined Land and Ocean Temperatures:

I’ve also moved the code to github, which is a much nicer place and you should move there too.

My home is flooded, for the second time in a month. The mighty Thames is reclaiming its flood-plains, and making humans – especially the UK government’s Environment Agency – look puny and irrelevant. As I wade to and fro, putting sandbags around the doors, carrying valuables upstairs, and adding bricks to the stacks that prop up the heirloom piano, I occasionally check the river level data at the Agency website, and try to estimate how high the water will rise, and when.

There are thousands of river monitoring stations across the UK, recording water levels every few minutes. The Agency publishes the resulting data on its website, in pages like this. For each station it shows a graph of the level over the last 24 hours (actually, the 24 hours up to the last reported data: my local station stopped reporting three days ago, presumably overwhelmed by the water), and has some running text giving the current level in metres above a local datum. There’s a small amount of station metadata, and that’s all. No older data, and no tabular data. I can’t:

See the levels over the course of a previous flood;

Measure how quickly the river typically rises, or how long it typically takes to go down;

Compare today’s flood to that four weeks ago (or those in 2011 or 2003);

Easily navigate to the data for neighbouring stations up and down river;

Get a chart showing the river level, or river level anomalies, along the length of the Thames;

Get a chart comparing that longitudinal view of the flood with the situation at any previous time;

Make a maps mash-up showing river level anomalies across the Thames catchment;

Make a crowd-sourced flooding community site combining river level data, maps, pictures, observations, and advice (“sandbags are now available at the village hall”)

Make a mash-up combining river level data with precipitation records;

Make a flood forecasting tool by combining historical river level, ground-water, and precipitation records with precipitation forecasts.

Most of these things (not the last!) would be a small matter of programming, if the data were available. The Thames Valley is teeming with programmers who would be interested in bashing together a quick web app; or taking part in a larger open-source project to deliver more detailed, more accessible, and more useful flood data. But if we want to do any of those things, we have to pay a license fee to access the data, and the license would apparently then require us to get pre-approval from the Environment Agency before releasing any “product”. All this for data which is gathered, curated, and managed by a part of the UK government, nominally for the benefit of all.

Admittedly I couldn’t do any of those things this week anyway – too many boxes to carry, too much furniture to prop up. But surely this is a prime example of the need for open data.

Last week I gave a short talk to the SoundSoftware Workshop 2013. SoundSoftware is a group of researchers in the field of music and acoustics, based at Queen Mary University in London, who promote the use of sustainable and reusable software and data. My talk was entitled “Ten reasons you must publish your code”, and SoundSoftware have now published the video. It was intended to stimulate debate, and the time was limited, so it’s long on polemic and short on evidence, although there is plenty of evidence out there to support almost everything I said. The short list of reasons is as follows:

Review: to improve your chances of passing review, publish your code;

Reproducibility: if you want others to be able to reproduce your results, publish your code;