I have just analyzed daily climatic station data from the publicly available Global Surface Summary of Day (GSOD) dataset issued by National Climatic Data Center. I compared them to the station data we purchased from our national institute CHMI (who is in fact owner of the stations), in particular monthly mean temperatures in March - June. I was surprised there is quite a big difference (TGSOD - TCHMI: min = -0.57°C, max = 0.17°C, RMSE = 0.24°C), considering these are monthly averages. Even worse, there is a systematic bias - the GSOD values are mostly smaller, and the difference grows with growing temperature:

Rounding errors are out of question, the precision of input data is tenths of °F or °C, and also rounding errors would be smaller and not increasing with increasing temperature.

It is interesting that, according to the GSOD docs, the mean daily temperature reported by GSOD is not the one reported by the station, but is computed from the observations:

Count 32-33 Int. Number of observations used in
calculating mean temperature

So (details of comparison) I took mean daily temperature from GSOD data and filtered only records where Count = 24 (obs./day). Then I averaged them accross months and compared (only those months with all days data) to monthly means which we bought from CHMI.

According to the docs, since our timezone is UTC+1/2, some observations can be grouped to different day:

... these elements are derived from the stations' reports during the day,
and may comprise a 24-hour period which includes a portion of the
previous day.

but since the days are consecutive, this cannot explain the systematic bias in monthly means.

So, where is the problem? Is GSOD biased and why?

Could it be some methodological aspect? In fact, the difference is correlated not only with the temperature but with also the difference of daily maximum and minimum temperature (which is correlated to temperature), so the causality could be there.

1 Answer
1

Without all of the original data, and its metadata, any answer here can only offer a guide as to how to start answering your questions.

Your first question is: "where is the problem?"

Your second question is: "is GSOD biased?"

Both of these must start with further statistical analysis.

And you need to analyse the metadata for the datasets you are comparing. Go through the definitions side by side methodically, and look for overlaps and for differences.

Compare your available data at its most temporally disaggregate: more detailed than what you've done so far - monthly means can hide so much of relevance. If possible, try to recreate the monthly means yourself from the individual readings: that can often highlight issues that cause this sort of discrepancy: e.g. the way that missing data or outliers are handled.

It would also be very helpful to set out your prior. And it would be helpful to analyse what's happening in the other months too.

As for using linear regression as an analytic tool in this case, do remember that it's a really blunt, unsophisticated tool; it will more often mislead than give useful information. Remember, courtesy of Andrew Gelman, the criteria for its applicability:

Validity. Most importantly, the data you are analyzing should map to the research question you are trying to answer. This sounds obvious but is often overlooked or ignored because it can be inconvenient. . . .

Additivity and linearity. The most important mathematical assumption of the regression model is that its deterministic component is a linear function of the separate predictors . . .