The USHCN Basketball Team

This is the climatological station of record for Odessa, Washington. It is at the residence of a COOP weather observer administered by NOAA. The photo was taken by surfacestations.org volunteer surveyor Bob Meyer.

In addition to the proximity to the house and the asphalt being less than the 100 foot published NOAA standard, we have a basketball goal nearby. This is a first as far as I know. I don’t know if any studies or standards exist that describe what if any effects having the MMTS sensor whacked by errant basketballs might have.

Still an interesting discussion at scienceblogs, see http://scienceblogs.com/deltoid/2007/07/cherry_picking_stationsorg.php
I am personally getting a slagging off as apparently Eli Rabid is always right.
I cannot see how data from these poor sites can be valid, either cooling/heating at all.
Is there any info on if this data from poor sites was used to create information from models?

Fight global warming. Tell them to water the lawn. A lot.
I thought it funny that the Happy Camp Observer noted the impact of watering on Tmax.
perhaps we can demostrate the import of micro site issues by asking observers to water
frequently, especially on hot days.

1. Anthony Watts is aiming at a complete CENSUS. Not a sample. This means he wants to document
all the sites. He is not selecting a sample. As documentation comes in, as sites are visited,
he posts them. He has highlighted the standard. As sites come in that do not meet the standard
he highlights them. The point of an audit is to find mistakes. Criticizing an audit because
it shows mistakes is rather dense.

2. If he WERE selecting a sample, the concerns I would have would be.

A. volunteers will only visit sites close to where they live and as such would have a urban bias.
This concern is testable as we can readily test the urban/rural ratio in the sample to the
whole. we can also see the distance to which some have travelled. Happy camp is a long
way from Anthony’s home as is Lodi.

B. Volunteers will only document Impaired sites and will not document sites that meet compliance.
This concern has merit. I think most volunteers understand the importance of documenting good
sites, and as the census grows this becomes less of a concern. I am heartened by the number of
reports that show good sites, but this is subjective.

C. Geographical concentration. A valid concern. Addressable, but a concern.

3. Anthony’s site selection criteria has been Open and published. ANYONE can document a site.
If rabbett or others are concerned about cherry picking they should PICK UP A CAMERA.
They should defend Gloabl warming by energizing their crowd to fight the denialists and
document the good sites. Finally, they have a double standard. Compare what Anthony has
done to some of the people they quote.

A. Parker’s site selection criteria for his UHI study has not been explained. We have the
list of stations but no clue as to how they were selected.

B. Hansens site selection criteria has not been explained. We have the list, but not the
method.

C. Jones site selection criteria has not been explained. we have not the list nor the criteria

D. Petersons site selection criteria has not been explained. we have the list but not the method.

Now, parker selected 290 sites WORLDWIDE for his study. we have this list, but not the method of selection.
Peterson selected 289 sites inthe US. we have the list, but not the method.

Now, the AGW crowd gave parker and peterson a PASS on random sampling. No question about the selection
because they liked the answer. Anthony has a published method. We may quibble with it. But the approach is
clear: try to sample every site. People, don’t like what they see after 230 sites of 1221 have been sampled.
So, they suddenly become the statistcians they SHOULD HAVE BEEN and highly selective sceptics.

At some of the other blogs, some authoritately state that the “power of large numbers” increases the quality of data. The specific example they cite is stargazers estimating the brightness of stars attains an accuracy as good or better then instrument readings.

They then state, (without explanation) that this applies to the surface site measurements and therefore all is well and fine.

Can anyone give an explanation of the power of large numbers and why (or why not) this is an effective tool for surface site temperature measurement accuracy.

Eli Rabbet is always making “droppings”. You must understand this is an evolutionary advantage to leave your enemies’ mouth full of Sh%%, when the enemy thought that he would get nice bunnie flesh. By the way, this evolutionary advantage has been recognised for amphibians, nematodes, birds, and all sorts of animals before mammmals came about. Eli is just the last of a long line of ANIMALS that have used this SCAT (as in scatological) approach to defense. If you think that he has left a bad taste in your mouth…rinse twice and consult ClimateAudit for the proper way to prevent future Rabbet halitosis.
Do not worry about asking the Rabbet pertinent questions. Rabbets are ALWAYS on the run and couldn’t be bothered with actually answering a relevant question. If you are actually able to grab hold of a Rabbet, remember, he will do his best to put a massive amount of SH%% in your mouth, as he scampers away. This is what he does.

Re #8, I have done basic statistics so my answer should suffice for lower end answers

Paul, essentailly the larger the sample the more confidence you can have in results obtained. However, if a large number of the sample are producing bad data you will have less confidence in the results gained. You are better off with a smaller set with quality data than a larger set with crap data in this case. There are also ways of analysing the crap data from the good data but that is beyond me. Maybe someone else can elobrate on this.

#8 The long and short of it as far as I am able to determine is that they do not recognize the limits of significant numbers. The best example of this is on RC where lots of really competent bloggers (in their field of expertise) talk about the ability to pull a low frequency signal from a high level of noise. Their analogy fails because in their examples they do not realize that the signal(in their areas of expertise) is KNOWN and in the case of temperature, it is ASSUMED or CLAIMED to be known. IT is actually UNKNOWN (We do not have measurements of creditable measurements of temperature to 10000 BCE). They do not realize what the difference of unknown to known with respect to significant numbers means, and they will fill gigs of hard disk space about how you are wrong, without ever even considering that you are even asking a pertinent, simple question.

The power of “large numbers” is much over-rated if you do not discuss significant numbers, if you do not discuss assumptions, if you do not prepare your answers for an audit (hence CA…. climate AUDIT).

a) If there is no systematic bias, many crude estimates (little precision between the estimates) will average to an accurate number. This has been labeled the “Wisdow of Crowds” following a famous publication of Sir Francis Galton in 1908 in the journal Nature. He showed that crowd of (~800) people could visually assess the weight of an ox within one-pound.

However, this only works when there is no systematic bias. If UHI is a real effect, this would constitute a systematic bias because the overwhelming number of site would experience an increasing, not a decreasing UHI. (Growing population, expansion of roads, air-conditioning, etc) Similarly, changes in instrumentation as technology develops also cause systematic changes that will not necessarily cancel out of the record by averaging alone. This is an artifact of standardize. i.e., stevenson screens transitioning to MMTS

b) Another concerns is that when the variant of the samples is high, your confidence in the averaqe should go down. This is typically evaluated using the students t-test.

As the site surveyor I would like to add that the caretaker said that NOAA came out regularly to check the instruments. He sounded very sincere and dedicated so I suspect that he took reasonably good care of the station.

However, the most interesting thing about this station was the fact that it was not where the NOAA metadata website said that it was. The website indicated a recent move to what would have been a location with fewer nearby objects but apparently that move never happened. I drove around that area for an hour looking for the instruments but found nothing. I found the actual location almost by accident. The site is actually located close to where the records indicate was a previous location. The caretaker, however, said that there had been no moves.

Since NOAA came out regularly to check the instruments they must have known where the instruments actually were but for some reason neglected to correct the information on their website. Probably just a normal bureaucratic SNAFU but it could be indicative of a lack of thoroughness.

This is the second time that I found a station in what was listed as a previous location. I have the survey data from Davenport, WA that I have not yet uploaded because I have been unable to contact the listed caretaker. However, this time it may only be the result of multiple rounding/conversion errors.

As for the data generated by this station I used the office of the Washington State Climatologist whose website allows you to plot the average, maximum and minimum temperatures of most northwest stations. The OWSC website proudly proclaims:

Use the options below to analyze temperature and precipitation trends around the northwest using high quality data from the United States Historical Climatology Network (USHCN) (through 2005) and Canada’s Adjusted Historical Canadian Climate Data (AHCCD) (through 2004). These data sets have been adjusted for biases or inhomogeneities resulting from changes in the environment or operation of individual observing sites (e.g., urbanization, station moves, and instrument and time of observation changes).

I compared the OWSC plot of average temperature to a similar plot I made on the Idsos (CO2science.org) website which uses unadjusted data. Interestingly, the OWSC site shows a slight decline in temperature (-.005 degs C per year) while the Idsos website showed virtually no change (about -.00014 deg C per year) over the same time period (1915 to 2003).

So we may have a first here in Odessa – a site that showed a cooling trend after adjustments but not before.

Does that sensor have a hemi? Does NOAA have any integrity? Who but the most lackadaisical employee would ever site a sensor in such a location? Or are we to assume it used to be rural and became urbanized? This goes beyond negligence to malfeasance.

I will attempt to explain the thinking behind the “power of large numbers.” The assumption is that biases are random and will therefore cancel each other out. This report mentions this concept.

However, from the evidence I have seen the biases are not random. We are not seeing some stations with a warming bias and some with a cooling bias. It is hard to imagine what a site with a cooling bias might look like. Can you think of a site where they tore down a tennis court and put up a garden?

A report by Hale and co-authors claims that 95% of stations with land use/land cover changes had a warming bias. This certainly seems to be confirmed by the work Anthony Watts is doing.

When working with a set of observations, an example might help. Suppose we seek to establish the abundance in a flask of water of the element potassium, symbol K. Nothing tricky, an exercise geochemists do all the time. Postassium is so abundant that it does not strain the capabilities of the usual instruments that analyse for it.

The common instruments are usually calibrated by making a solution with a weighed amount of soluble potassium in a known volume of water. With a large enough volume, that vessel containing potassium can have measurement after measurement made on it. It can be measured on a daily basis for a years, to give thousands of results. If one knows statistics, one examines the distribution of results about the mean or median or whatever is appropriate to determine the type of distribution being generated, such as normal, log-normal, binomial or a host of others. The correct mathematics are then selected and an estimate is made of the PRECISION of measurement of this machine with this pure synthetic solution.

After thousands of measurements, the addition of one new measurement will have a negligible effect on the mean. A by-product of this is that one can state the mean and the precision of its measurements to a certain number of significant figures, or assign a probability that a portion of readings will fall within 99%, or 95% or 66% or whatever of the mean.

Suppose however that the instrument is not 100% specific for the measurement of postassium. Suppose we can show that if there is sodium, and/or magnesium and/or lithium in the solution, then the instrument will give a different reading. This is quite common in analytical chemistry laboratories. I used to own such a lab.

We now have a problem, because the flask that gives us potassium to test also has various amounts of sodium, magnesium, and lithium. If we do not correct for this complication, we introduce a BIAS into the ACCURACY. So our calibration solution for potassium needs to have measured amounts of Na, Mg and Li added to it to better mimic the test solution. How much of each is to be added? One can estimate this, but then every new sample from a different source is able to have different ratios of these contaminants. After a while, we can take a guess and make a reasonable calibration solution that gives results for potassium that agree well with other methods of anaysing it.

If we are still up to the armpits in alligators, we can make a whole series of solutions where the 4 substances are systematically varied and combined, so that multiple regression analysis can be used. (Heard that term in climate science?) There is still a problem, because the lack of precision in correcting for the spurious visitor elements adds to the original precision we calculated for pure K solutions. Every correction we make worsens the precision. Often, we no not know if the contaminants give a linear dose response, a negative or positive one, a log response, a step with a plateau or any other shape that comes to mind.

We are not so much interested in precision (repeatability) as in accuracy (getting the unique correct answer)for our analysis of potassium. It is easy to misbelieve that the problem has been sorted, when along comes a paper saying that strontium affects the instrument too, but nobody had suspected it. Too late, we discover that we have been wrong all along, despite our sophisticated models, our multiple regressions, our extremely high precision on pure solutions and all of the rest of the scenario.

The moral of the story is that no matter how many times we analysed, we would have been wrong.

It does not matter if a weather station thermometer has a precision that gives the same answer to 0.01 of a degree in a flask of pure boiling water at a given pressure. That just shows it has high precision at those conditions. But if in use it sits next to hot car exhausts, there will be a bias and no amount of reconstruction will really properly account for that bias. The answer will be wrong.

That is why the documenting of these sites is so important. When the Future of Life as we Know It depends on bum equipment, it matters not how many readings are taken. The only ones that should be used are those that cannot be shown to be wrong to the best of scientific understanding at the time.

We are far from this happy state. Gross bias, in both directions, is high on the suspicion list. Science is being debased despite standards having been set. That is about the worst sin that a scientist can make.

The law of large numbers applies to the precision of a set of measurements of a “thing”. The larger the number of measurements, the more precise the estimate. However, as Geoff’s example illustrates, there are two embedded assumptions. First, the “thing” is a constant and definite “thing” and does not vary between measures. I am still not sure whether the world’s anomaly temperature actually meets this assumption in the same way that multiple measurements of the temperature in my backyard does. Second, the measurement and “thing” are in constant relation to each other, otherwise again what you think of a large number and as a thing are much smaller. For example, if there are 1000 estimates of temperature using a standard instrument and we add another measurement estimate,say Farmer Jones’ aching knees, do we have a 1001 estimates of the temperature. Yes in a trivial sense, and definitely no in the sense that the 1001st estimate added precision. On the other hand if the prior 1000 estimates had come from Farmer Jones’ knees, then the law of large numbers argues that the 1001st would in fact increase the precision of Farmer Jones’ estimate of temperature.

All this is independent of the question of the accuracy of the estimates. Under most conceivable conditions, Farmer Jones’ knees are less accurate measures of temperature than is a regular thermometer. Farmer Jones’ knees may in fact be a much more accurate proxy of humidity than they are of temperature, a bit like tree rings.

Anthony Watts et als work on documenting the immeidate physical environment of surface stations is pointing out that the conditions for the accurate measurement of temperature are not being adhered to, and that there is a strong possibility of non-random biases in the existing data sets. Under such conditions, a large number of estimates may increase the precision but what they are actually measuring is more suspect.

Gradual changes in the immediate environment over
time, such as vegetation growth, or encroachment by
built features such as paths, roads, runways, fences,
parking lots, and buildings into the vicinity of the instrument
site typically lead to trends in the cooling ratio
series. Distinct régime transitions can be caused by
seemingly minor instrument relocations (such as from
one side of the airport to another, or even within the
same instrument enclosure) or due to vegetation clearance.
This contradicts the view that only substantial
station moves, involving significant changes in elevation
and/or exposure are detectable in temperature
data (G91). It is not surprising that small station moves,
even without changes of elevation or exposure, are capable
of introducing inhomogeneities into the record,
because there are often several confounding changes
occurring at the same time. For example, a station
move often coincides with screens being repainted,
cleaned, or replaced, new instruments installed, and observers
being reinstructed about their practices. Further,
it is common for the new instrument site to be
without grass for a few years, and there are many indications
of muddy conditions around the instruments
until grass is both planted and properly maintained.
These factors, combined with subtle changes in the immediate
surroundings (such as moving away from a
parking lot or building), appear to be a significant cause
of inhomogeneities in temperature records. As isolated
occurrences, activities such as painting, cleaning, or
releveling screens or instruments do not frequently
cause significant changes to cooling régimes.

The law of large numbers (LLN) is a theorem in probability that describes the long-term stability of a random variable. Given a sequence of independent and identically distributed random variables with a finite population mean and variance, the average of these observations will eventually approach and stay close to the population mean.

A) The “law” (not “power”) of large numbers implies that the larger a sample grows, the closer it will be to the mean and variance of the population. If the sample has bias, then it will approach the bias value. I.e., the law of large numbers does not magically erase biases.

B) Independent and identically distributed is important since the temperature measurements themselves are not a random variable. The RV is a result of measurement error, of which there are many sources. Again, bias in each skews the assumption of iid, which renders the whole concept of LLN impossible to apply.

The statistical “experts” that play on the alarmist side and invoke this as some magical way to remove bias and error are neither statistical, nor expert. It is actually rather humorous since most probably don’t even have single undergraduate class in statistics, and only serve as sounding platforms for something someone else said. Silly, at best.

There’s a lot of power in the value of repeated measurements and many aspects of sampling theory and practice in applied fields like polling, market research, atomic physics, etc. are based on it. Of course, if there is a systematic bias, then repeated measurements will just come approach (true + bias). But it is a powerful force and especially does away with the worries about measurement unit resolution.

Geoff, As an analytical chemist who did elemental analysis by atomic spectroscopy I appreciate the example you use. There is an additional source of error in long term measurements, drift or 1/f noise. Every time you open your bottle of test solution, a little bit evaporates and the concentration goes up. Also, the analyte may show an increase or decrease due to interaction with the container. Potassium can be leached from the wall of a glass bottle over time. If you use a plastic bottle, water vapor can diffuse through the wall of the bottle. None of these errors are eliminated by averaging.

Of course, if there is a systematic bias, then repeated measurements will just come approach (true + bias). But it is a powerful force and especially does away with the worries about measurement unit resolution.

Exactly. Those that are trying to use LLN to remove biases don’t understand exactly what it is the law does. If they were attempting to increase the accuracy of a single station by placing 100 different measuring units in the same area, then they could use LLN to remove the random measurement error (assuming iid), but not any systemic bias.

From the temperature measuring site pictures, I obtain the view that we are not really considering a simple measuring error, amenable to greater sampling, and a similar bias in all measurements. What we have potentially are many sites with varying degrees of bias without knowing how varying or in what direction. This means that more measurements are not necessarily better.

Under the assumption that all these sites (or at least nearly all) are compliant with the stated process and specifications, one could use more sampling to reduce sampling errors ‘€” and, of course, eliminate the uncertainty of bias errors. Correction procedures are in place in attempts to reduce biases that can be detected when they occur over very short periods of time but it is not clear how well they would work where many sites are out of compliance and particularly when changes can occur over long time periods. I would guess that the assumption of nearly complete compliance is part and partial to the calculation of uncertainties and is why the specifications are developed and assumed to be adhered to.

A further concern that I have is in not knowing how the (lack of) coverage uncertainty would be handled/calculated if the assumption of compliance, or a proper and complete correction for it, in almost all sites was not found to be the case.

A poster recently pointed to the obvious truism that one cannot test quality into a product in reference to making adjustments to the data product, but I think in these cases, we are not at all certain how much quality is in the product or how much of the uncertainty calculations depend on an assumption of that quality.

I do not understand this comment from above: “But it is a powerful force and especially does away with the worries about measurement unit resolution.”

I think a ruler marked with a resolution of one meter cannot be used to obtain an accurate measurement of an object of length one micron no matter how large the number of attempts. In this case “accurate” meaning that the difference between the actual length of the object and the mean of the reported measurements is a small fraction of the actual length of the object.

people are pretty crappy at guessing the temperature. They are an instrument but they have
a pretty big error. Still, as our friends tell us the LLN fixes everything.
And if 800 people can guess the weight of a bull to one pound, then imagine what a million
could do guessing temperature ?

So, I suppose, using the LLN, if 1 million people tell me it was warmer 25 years ago than today… that
even though they are imprecise instruments, the result is magically fixed by the LLN.

You are quite correct. The quote to which you refer shows confusion of accuracy and precision. There are many forms of “measurement bias” or “inaccuracy” that are unaffected by the number of obvervations made. As a simple example, using your ruler a meter long, if the operator chose one a foot long by mistake, all readings would have gross errors.

Somewhat difficult to tell from the photo, but with the chain link fences in the front yards and no trees, that looks like an, um, “interesting” neighborhood. I’d worry about theft of the unit! Or perhaps, worry about it getting tagged by graffiti!

Mark T (#26) points out that the LLN is a theorem of probability. Basically, the more samples I have of an independent and identically distributed random variable, the more confidence I have that the observed mean approaches the expected mean, plus or minus some guardband.

The simplest example is that of a coin flip. If I flip a fair coin 1000 times I would expect to see the percentage of heads approach 50%, but I would be quite surprised if it were exactly 50%.

The question is, how large is large? If I move to a more complex variable, such as a six-sided die, how many rolls of the die do I need to have confidence that the observed mean is within some guardband of the expected mean of 3.5? If I make it two or more die, how many additional rolls are required to converge on the expected mean, again within a specific guardband?

The number of dice rolls required for convergence does not change for loaded (biased) dice. If a die is loaded to favor six, for example, the expected mean might be 4.3, but 1000 or so rolls are still required to have confidence (guardbanded) in the observed mean.

When applying the LLN to temperature measurements, we need to think of the problem in many, many more dimensions. Here are some I can think of off the top of my head:

1) How many measurements of a specific thermometer by a specific observer are required to have confidence in the (guardbanded) accuracy of the observer’s reading of the thermometer?

2) How many measurements of temperature are needed in a single day to have confidence in the mean temperature (for example, a cold February day where the temperature jumps for a two hour period would register an artificially warm average if only daily min/max are used).

3) How many measurements are required in a specific grid cell to have confidence in the cell’s average?

One might argue that millions of thermometer readings are available, but one must also recognize that those readings have space and time dimensions to them as well. If I want to compare the temperature of 1999 with 1899, are 365 daily averages from each year “large enough”? Are they independent enough (remember, this is a probability theory requiring variable independence). Is the number of stations sampling the temperature on a given day in a given cell large enough? Etc.