USHCN "Raw" – A Small Puzzle

During the past few days, I’ve been assessing the GHCN-Daily dataset, which is a very large data set and plan to do a number of posts on this topic, including a description of the data set. It turns out that literally hundreds of stations that expire around 1989 or 1990 in the NASA data set are alive and thriving in the GHCN-Daily parallel universe. More on this over the next few days.

Before I get to this, I’d like to document a small puzzle in connection with the calculation of the USHCN “raw” monthly average that arose out of inspection of the GHCN-D data, a puzzle that takes us back to the Detroit Lakes MN station, the investigation of which led to the identification of Hansen’s Y2K error.

I’m not saying that these small puzzles necessarily or even probably “matter” in terms of world averages, but they are relevant in terms of craftsmanship and I presume the following: if one is going to the trouble of making these large temperature collations, the the craftsmanship should be as good as possible. By commenting on issues pertaining to craftsmanship, I am not imputing malfeasance, as some perpetually agitated commenters allege. However, as far as I can tell, no one – journal peer reviewer, NASA peer reviewer (if such existed), rival climate scientist, skeptic – ever seems to have gone to the trouble of parsing through the actual craftsmanship of the large temperature calculations and I see no harm and some benefit in doing so. It looks like NASA is paying some attention and has already implemented a couple of recommendations made at CA. (I’ll have some similar suggestions in the near future.)

The GHCN-D dataset contains daily max and mean information for nearly all the USHCN stations, as well as thousands of ROW stations. (I’ll discuss some discrepancies in the USHCN station lists on another occasion). The GHCN-D data set is available in a huge zipped file, but is also available on a station-by-station basis. Most of the identification codes are different than GHCN-M (and thus GISS), but I’ve managed to create a concordance of over 3300 station identifications – and do not preclude the possibility of further gains. I’ve created time series of monthly means for these 3300 or so GHCN-D series. As a first cut, I simply took a monthly average of available values, without requiring a minimum number of values to constitute an average (which I usually do and would probably do if I re-run the results.) I then calculated the monthly mean as the average of the mean monthly minimum and mean monthly maximum – in some cases, there would be different numbers of measurements.

The figure below shows the difference between the USHCN “raw” monthly mean and my calculation from daily information for a station (Kalispell MT) with an excellent match. There are rounding differences, but the two versions clearly reflect the same provenance. In this case, I presume that the small spike differences result from some procedural difference in calculation of monthly averages. While the differences appear attributable to rounding, the differences are not truly random: there are far more +0.1 differences than -0.1 differences, but this is unrelated to time.

Figure 1. USHCN Raw monthly (NOAA) minus monthly average calculated from GHCN-D. (I was at Kalispell airport once in the late 1970s and had an amusing experience there.)

Next here is the same plot for Detroit Lakes MN, a station which had a puzzling jump around 2000 in the original NASA version (a jump that could be attributed in part to the Y2K error.) This particular error has now been patched by NASA. In this case, the tracking looks very similar to the Kalispell tracking from 1950 to about 1980. But in the late 1990s-2000s, the USHCN Raw version (and thus the downstream versions) jumps up relative to the average calculated from GHCN-D daily information. Why is this?
Figure 2. As Figure 1, but for Detroit Lakes MN.

I parsed through about 40 such plots, most of which were in between Kalispell and Detroit Lakes in appearance. But there were a couple of oddballs: here’s one. It looks like the USHCN Raw version must be spliced from two different GHCN-D stations, with values after 1980 or so from the present station and earlier values from some other station.

Here’s a station (Dillon MT) which has a somewhat similar appearance of being spliced – only this time, it looks like the USHCN station is drawn from the GHCN-D data set prior to 1980 and perhaps some other related source after 1980.

The puzzle that needs to be resolved is the exact relationship between the USHCN “Raw” and GHCN-D data. If this can be sorted out, then NASA could make a substantial gain in the timeliness of their reporting.

GHCN-D versions of USHCN stations are current through early March 2008. Right now NASA’s USHCN data is only current to March 2006 – the date of the most recent GHCN update. Following a CA suggestion, NASA is moving to make its USHCN stations more current by adopting the USHCN (NOAA) source, which is more current than the versions at GHCN-M or CDIAC,

However, the GHCN-D data is truly current. NASA already uses “raw” USHCN data for its current results, using a patch to splice each station to the FILNET version used for historic values. If monthly averages calculated from GHCN-D data were used instead of GHCN-M data, then NASA could report USHCN stations right through to February 2008 (and keep current) instead of the current system of being up to two years out of date for USHCN stations. (A better system would be for NASA to write NOAA and ask them to update the USHCN data set on a monthly basis, which should be trivial to program and dispense with the patch altogether.)

Gaining two years in report timeliness for USHCN stations is a small thing but worth doing. In some forthcoming posts, I’ll discuss how NASA can gain nearly 20 years in reporting timeliness for many international stations.

In the past NASA has always posted a monthly update, like UAH and RSS and HADCRU, are NASA going to avoid the monthly update? Shrugs. I hope not. I was counting on following this cold
year month by month.

Climate Science is at much the same state as electronics were about a hundred years ago. Eventually there will probably be an organization similar to the IEEE (which was started by a couple of engineers in their basement) to establish what standards should be applied. We need a group of various scientists, held together by a few business managers, to start up an organization of volunteers to establish what the standards are, how they should be applied, and recommendations for assessing data. Their findings would probably be published on the web, and be downloadable for offline reference. They would not be interested in the data itself, but be more interested in methods of collection, archiving, and data analysis. Interpretation of the data would be left to others. This goes beyond simply auditing what occurs, though that is a valuable function of determining where standards should be applied.

A long while ago JerryB and I had a discussion about the rounding rules of daily and monthly.
It’s OT here, but I recall noting that the rounding rules would appear to bias things upwards.
The trend, most, likely is not going to be impacted by this since its a constant practice over
time. But I have noted in the past that I get minor differences between daily and monthly.

It looks like NASA is paying some attention and has already implemented a couple of recommendations made at CA.

IMHO, it constitutes professional plagiarism for NASA to implement Steve’s suggestions without some form of recognition. This doesn’t even have to be on the NASA webpage, but it wouldn’t kill Gavin or Reto to post a comment on CA to the effect that even though they usually disagree with most of what Steve says, he had a good point with regard to such-and-such, that procedures have been / will be revised accordingly, and thanks for the suggestion(s).

It turns out that literally hundreds of stations that expire around 1989 or 1990 in the NASA data set are alive and thriving in the GHCN-Daily parallel universe.

Even more curious are the “zombie” stations that are dead and buried, yet still continue to crank out adjusted data.

A case in point is Delaware OH, which hasn’t had a daily observation since 1/01 and was officially closed in 5/03. However, this doesn’t stop CDIAC from continuing to provide annual average readings with “final” adjustments through at least 2005!

Being from Orange County I’ve investigated the Newport Harbor record before and stumbled on the discrepancy you note. According the the NCDC USHCN station history file, data from “Newport Harbor” prior to 1981 is actually from Avalon on Catalina Island. This seems very strange as the Newport Beach station had data available back to at least the 1930s.

I don’t see how you can splice together data from Avalon with Newport Beach.

1)Newport Beach and Avalon are at least 25 miles apart and separated by the ocean
2)Newport Beach is on the South shore of mainland Orange County, Avalon is on the North shore of Catalina Island
3)Newport Beach is on flatland with no significant hills or mountains for at least 10 miles, Avalon is in a sheltered cove at the base of of a string of peaks reaching up to 2000 feet.

I’d sure like to know why Avalon was used prior to 1980 when data from the the Newport Beach site was available.

actually my question was more about the source of data (measurements) from the stations that is effectively dead.😉 Some mystical currents from the defunct line to the formerly active MMTS? Teleconnection? New station under the same codename?

Steve “the differences are not truly random: there are far more +0.1 differences than -0.1 differences”
Don’t you think that this is interesting bearing in mind that the odd +0.1C here and here could make a significant difference to the +0.7 or so “increase” over the last century?

Maybe a mere coincidence but in the Newport Beach graph, it’s like they try to cool the 1900-1945 warming and warm the 1945-1977 cooling. A sort of hockey-sticking of modern temperatures.
Steve, is it possible to have statistics for thoses discrepancies (% of uniform, upward, downward differences) ?

This page seems to indicate that an urban adjustment is made to the USHCN data. Would any GISS urban adjustments be in addition to this? When did NOAA start this adjustment?

It has been my impression that GISS uses the USHCN Filenet version (before the Urban adjustment) and does its own urban adjustment using the Hansen satellite lighted index. Look for the Karl references at the USHCN site to determine when and how USHCN makes their urban adjustments.

One last point about the Newport Beach Harbor record, the Avalon/Newport Beach splice makes no sense from a UHI perspective.

Due to Avalon’s island location, the only population in a 25 mile radius is that on the island – about 3600 people. Newport Beach on the other hand, has most of Orange County and a significant portion of Los Angeles County within 25 miles – on the order of 3 million people.

So pre-1981 data is rural and post-1981 data is about urban as it can get despite the fact the Newport Beach station had continuous data back to 1921. No worries, I’m sure it’s been adjusted out.

Re #8, to NASA’s credit, I have found, on digging deeper, that although the defunct station at Delaware OH is in GHCN (#42500332119) and GISS (#42572428004), GISS (unlike CDIAC) reports no monthly data for it after it stopped reporting altogether in 1/01.

There is still a problem that it is missing a lot of data after 1996, but at least there was some real data during that period.

Re: Newport Beach. This is a completely bogus placement for the temperature sensor. It’s just off the SE corner of a flat roof and no more that 3-4 above the thermals rising from that roof. A prevailing westerly wind (coming from the bay) has to traverse the width of the roof before reaching the sensor. An idiot couldn’t have done a worse job (or could it have been a genius with an agenda?)!

RE 27
According to this link station 332119 is still reporting. The inventory and history files were last modified in 2005. The data files have data through 2007. Could this be simply another case of GHCN and GISS inexplicably dropping the station in 2000? Is there some other reference to the station indicating it being closed?

By commenting on issues pertaining to craftsmanship, I am not imputing malfeasance, as some perpetually agitated commenters allege.

I don’t impute malfeasance to issues of craftmanship in climate science when a perfectly clear case of incompetence coupled with overwheening arrogance will produce the same result. Its clear to me that one of the key historical purposes of climate science as the final port of academia for people not bright enough for the hard sciences yet apparently overqualified for a career in the private sector, has not actually changed over time.

Lest you think that this is simply my own biased opinion (whose isn’t?) I watched Richard Lindzen make the same point in a television interview.

RE 27
According to this link station 332119 is still reporting. The inventory and history files were last modified in 2005. The data files have data through 2007. Could this be simply another case of GHCN and GISS inexplicably dropping the station in 2000? Is there some other reference to the station indicating it being closed?

According to MMS, it has been inactive since 1/01 and was officially closed 6/03. CDIAC’s USHCN monthly and daily data page at http://cdiac.ornl.gov/epubs/ndp/ushcn/newushcn.html has good daily data through 1996, then spotty data in 1997, getting worse in 1999 and 2000, finally ending 1/30/01. But this doesn’t stop it from having “monthly data” through 2005, when CDIAC’s coverage generally stops.

The other I thought was the raw GHCN plus USHCN corrected data available from GISS. However, when I create an average from the daily version and compare it to the GISS “raw” version, I do not get nearly the correlation you get. My calculated averages for the earlier years is higher – sometimes 0.5 C or more – than the GISS version. Thus, I think I am looking at a second dataset that is different from the one you are looking at.

Can you post a link to both Kalispell datasets that you are comparing?

I get mostly excellent matches. Where I seem to have discrepancies are the months with one or more missing daily values. That leads me to believe that NOAA is estimating the missing values before calculating an average for the month.

Do you have a pointer to a description of the file format? I don’t understand what each of the four rows per station per year means, and there are codes next to each monthly entry that I cannot find a definition for. I’ve poked through a number of files on the ftp site and cannot find an appropriate guide.

I’m happy to help build that flow chart, and in particular I am happy to try to duplicate the process that turns data in one file to data in another file. Actually, this is why I am making this comment. I want to duplicate the process of turning daily into monthly prior to GISS using the data, but I don’t know which monthly to use.

The four strings of data for each station (q1 … q4) in the USHCNv1 data set contain the following versions:

1. Raw: the data in this version have been through all quality control but have no data adjustments.
2. TOB: these data have also been subjected to the time-of-observation bias adjustment.
3. Adjusted: these data have been adjusted for the time-of-observation bias, MMTS bias, and station moves, etc.
4. Urban: these data have all adjustments including the urban heat (UHI) adjustments.

The q4 (UHI) adjustment for example contains the absurd manipulations to the NYC Central Park station referred to in an earlier CA posting and on other sites. It is yet to be explained.

The NCDC brief descriptions to the various USHCNv1 parameters can be found by following the links from the parent USHCNv1 site. Click on the various explanatory links to the left of the US map.

For USHCNv2, there is a hot link to USHCNv2 near the top of the page. In this version, q4, the UHI adjustment, has been eliminated. The explanation for this is that it is no longer necessary as urban effects are taken care of with a “change-point detection algorithm”. NCDC cites this procedure as a reason for not needing an urban adjustment, because as they state; “no specific urban correction is applied in HCN version 2 because the change-point detection algorithm effectively accounts for any “local” trend at any individual station. In other words, the impact of urbanization and other changes in land use is likely small in HCN version 2.”

How this method accounts for UHI effects is beyond my comprehension as change-point analysis is basically a statistical procedure designed to locate the most likely inflection points in a continuous stream of data. Nevertheless, the NCDC descriptions proceed to describe this procedure by citing several papers where presumably these problems are addressed. To my inadequate brain it appears to be more bafflegab than sound analysis.

Finally, those wishing to download the latest USHCN version, try this USHCNv2 link.

The “GHCN Daily” file collection is of recent vintage, having first been
published in December 2006. Also, the temperature data that it contains
are TMAX/TMIN. It is too young to be the source for either GHCN V2, or
USHCN V1, monthly data, and could not be the source for stations for which
TMAX/TMIN data are not reported.

Various parts of the US NCDC route data to various collections such as
GHCN and USHCN. The USHCN data go through USHCN processing before they
become “official” and go to GHCN, or get published in the USHCN
directory, on the NCDC FTP server. Then GISS, or whoever wishes to do
so, e.g. CDIAC, can pick them up.

It would appear that in 1992 (H92) they selected the best 138 stations, and then later expanded that to 1062, changing the quality guidelines to gain spatial coverage.

The selection of stations for inclusion
in H92 was performed with the following data quality issues in mind.

1. The degree to which each station maintained a constant
observation time for maximum and minimum temperatures,
excursions from a station’s predominant observing time of no
more than four years being desired.

2. At least 95% of a station’s pre-1951 data should be contained
in NCDC digital daily archives.

3. A station’s potential for heat island bias over time should be
low.

Since the release of H92, much more work has been conducted at NCDC
involving compilation and digitizing of daily data. However, to enable
the compilation of a database providing better spatial coverage of the
contiguous United States, the four station selection criteria listed
above were not strictly adhered to in the current version of the HCN/D
presented here.”

Moving from 138 stations to 1062 by “changing a few” quality rules seems to me to be
a step change of sorts.?

One criteria that interested me was attempt to select stations that had minimal TOBS
adjustments ( not in magnatude but in number) I think it would be interesting from a purely
theoretcial standpoint to look at what trends and errors we get in the US record if we select
Homogeneous stations at the start, rather than trying to homogenize bad stations to
good ones and pretending that we gain “area coverage” thereby. ( I think willis and I have beat this dead horse and are ready to mumify it)

When I look at the records for Kalispell, the “4. Urban: these data have all adjustments including the urban heat (UHI) adjustments.” records all have a small magnitude. Does this represent the amount of adjustment? It clearly does not represent the adjusted temperature.

The 128°F “outlier” from McConnellsville Lock OH back on Jan 2, 1900 would never have been a problem for contemporary weather plotters and analysts back then. They would have simply noted that the three nearest first order stations…Columbus OH, Parkersburg WV and Cincinnati OH all reported a max temperatures of 28°F for that date and you can guess what they plotted for McConnellsville.

This sort of illustrates the value of what was lost when computers programmed without AI oversight replaced thinking humans. Of course computers can do a lot more and much more quickly than people, but the point is that before the computer age, with much less to do and more time to do it, weather observers on the whole were more precise and fussier about their observations then than they are today when the dependence is on machines and electronics. The McConnellsville record was clearly a typo rather than a faulty observation. Where and how it entered into the NCDC data base is a matter of speculation, but you can bet that no observer ever entered that number onto the original record.

If you are looking in hcn_doe_mean_data.Z , the fourth line
labeled 3C is a “confidence factor”, and is not related to
the urban adjustment. The USHCN urban adjusted mean temps
are in a different file: urban_mean_fahr.Z .

In your Kalispell plot, do both datasets you used to compare begin in 1896 or 1899? I ask because the monthly data I find begins in 1896, but the daily data found in http://www1.ncdc.noaa.gov/pub/data/ghcn/daily/42500244558.dly begins in 1899. Right now I am assuming that you are ignoring the years 1896 through 1898, but it is hard to tell from the plot.

In your Kalispell plot, do both datasets you used to compare begin in 1896 or 1899? I ask because the monthly data I find begins in 1896, but the daily data found in http://www1.ncdc.noaa.gov/pub/data/ghcn/daily/42500244558.dly begins in 1899. Right now I am assuming that you are ignoring the years 1896 through 1898, but it is hard to tell from the plot.