MoshTemp 102

In the last installment we started by downloading the v2.mean.Z file unzipping it and processing the raw data into an R object. The code for that is Moshtemp101.R, located here. If that code is executed we end up with an R object containing all the data of v2.mean. Data<-readV2Mean(), gets us the Data object. At your console execute the following command on Data:

>str(Data)

‘data.frame': 597373 obs. of 15 variables:

$ Id : num 1.02e+10 1.02e+10 1.02e+10 1.02e+10 1.02e+10 …

$ Duplicate: int 0 0 0 0 0 0 0 0 0 0 …

$ Year : int 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 …

$ Jan : int NA 100 109 117 129 116 101 112 122 115 …

$ Feb : int NA 112 123 114 110 104 128 101 118 109 …

$ Mar : int NA 129 130 134 122 105 141 106 137 119 …

$ Apr : int NA 143 160 149 138 155 NA 136 134 136 …

$ May : int NA 181 181 182 165 183 160 183 181 167 …

$ Jun : int NA 193 207 194 211 202 204 216 216 192 …

$ Jul : int NA 239 247 223 232 238 220 242 NA 240 …

$ Aug : int NA 255 241 241 249 271 228 246 240 247 …

$ Sep : int NA 225 223 220 233 226 210 235 223 235 …

$ Oct : int NA 201 191 185 NA 186 199 187 170 NA …

$ Nov : int 133 166 162 161 148 133 154 140 141 139 …

$ Dec : int 110 107 128 108 114 112 119 121 NA 118 …

The result, shown above tells us that we have read the file in as a “data.frame” one of the workhorse data structures in R. The columns of Data all have names and we can access the data in many ways. For example, we can look at the span of years in the data:

>max(Data$Year)

[1] 2010

> min(Data$Year)

[1] 1701

We can also access the data using a column number. The duplicates variable is in the second position and we can summarize duplicates by doing the following

>summary(Data[,2])

Min. 1st Qu. Median Mean 3rd Qu. Max.

0.0000 0.0000 0.0000 0.5781 1.0000 9.0000

This tells us that some station Id’s come with many duplicate records, as many as 9, while most have no duplicates. Visually we can see this by plotting a histogram. I’ll spare you the graphics.

>hist(Data$Duplicates,breaks=10)

And we can see that there are a fair number of temperature values as well as missing values. First, we will sum up the temperature values. The function, is.na(), simply takes every value that has a value of NA (missing) and assigns it a value of 1 for TRUE. !is.na(DATA), gets us the opposite. A data structure with TRUE for every place where there is a temperature.

sum(!is.na(Data))

[1] 8579259

> sum(is.na(Data))

[1] 381336

And we can take a mean of all temperatures by month if we like, we issue the mean command over all the columns we want. And we have to remove NA’s or the mean will be NA.

That’s enough basics for now. Lets proceed to the problem of combining duplicates. First an example of what a duplicate looks like: In the example below a duplicate is actually just the same values. But that is not always the case.

As you can see duplicate record 6 and 7 have the same value for most months but on certain months they differ. In this case by 1/100 of a degree C. I spent a considerable time comparing duplicate records. It’s a lot of grunt work. In the end, you have to decide how to process these records. There are several analytical choices and for my work I decided to average the duplicates. This will entail finding all the duplicate records within a ID that have overlapping years and then averaging them by months. Handling this in a looping structure is fairly straightforward. Also, since most of the stations don’t have duplicates it might be attractive to segregate the stations into those that have duplicates versus those that don’t. I’ve coded up several versions of this process using various built in functions of R. In the end, I settled on a solution offered up by the R help list. This approach just computes the mean even if there are no duplicates. Since averaging duplicates only needs to happen when you refresh the base data, I decided that an elegant one line wonder was preferable to more complicated code. The averaging code:

That’s it! The aggregate function works on a data structure, in this case those columns of the incoming data that are labeled with monthly names. Aggregate, needs to be fed a set of indices, in this case we feed it an indices for Id and an index for Year. v2mean[c(‘Id’,’Year’)]. The function mean is then applied to each of the columns. NA,s are removed and the result is output. The last couple of details. I’ve included a line that looks like it makes no sense data[is.na(data)]<-NA. which looks like that just assigns an NA to a element that is already NA!. In reality when we aggregate over the dateframe there will be cases where we are averaging NA’s. But, when we call mean(…na.rm) if the lines only have NAs then we get a result that is NaN. In previous versions, I had written code to detect this and apply the mean() function conditionally, that is only if the values were not all NA. Since is.na(NaN) tests true, I just choose to replace the NaN with NA. Finally, I do a quick sort on the IDs, just to make sure that everything has remained in order. This isn’t strictly necessary.

At the end of Moshtemp101 we had read in the raw v2.mean and created an R object Data. Now we average all the duplicates in that file:

Data<-averageDuplicates(Data)

Or if we want to do it all in one line, we just do this:

Data<-averageDuplicates(readV2Mean())

Next up, Selecting subsets of the data, like spans of years, and writing out data objects. Since, we won’t be processing duplicates very often, we will be saving the results of these processes as R objects. More on that in Moshtemp103. And yes I hate my color scheme already. Find MoshTemp102 at the drop

I dont’ know what came over me when I picked the template, Artisty more than techy has allows been the dilemma.
Aggregate is very slick although I’ve had some issues with it of late when using it to sum months into years.

Combining duplicates when the monthly values differ by 1/10 degree C seems safe because those differences are due to rounding during the conversion from degrees F to degrees C. However, here’s an example of when NOT to combine duplicates:

Some where my stack of stuff I have the total statistics on all duplicates the maximum seen difference, the median, the average etc.

duplicates are the same location. Location is given by the ghcnid 42572469000. The duplicate flag means that NOAA has two officila reports for the same station at the same location. the 0 or 1 does not refer to a different location. regardless, if the were different locations you would still average them

That is the same GHCNID. BOTH DENVER. two records for Denver, record number 0, and record number 1.
When NOAA built the GHCN they went to get records. In this case they got TWO records from the same location.

Now, Location,

42572469001 is Different, That is Dillion.

Some methods, like Hansen, will COMBINE at the IMOD level using a reference station method. I dont.

The difference between IMOD and duplicate is tough because you actually have to read the NOAA documentation which is
long and boring. But you are confusing the two. 11 digit number is the GHCNID, 12 digit is the ID PLUS the duplicate flag.
Duplicate mean, SAME ID, multiple scribal records for the same location. It arises because the same lcations often have TWO or more instruments.
So we average over the Duplicates. Same location.

Steven Mosher

August 19, 2010 at 8:58 PM

WRT averaging Boston and new york. Yes I would average them. IF, you ask me what the average temperature
is for that AREA.

Today it is 50F in san francisco. Suppose its 70 in LA. If that is ALL the data I have for california and you ask me
Whats the average for california, I will say “my BEST estimate is 60″ If, you tell me that San Jose is 60F then I have
some Analytical choices to make.

A: 70+50+60=180/3 =60.
B: 70+(50+60)/2= 125; 125/2= 62.5

In the first case they are not adjusted for Area. the two close stations ( SJ and SF) are given equal weight
In the second case I average by AREA. averaging WITHIN a certain distance first, THEN averaging over the blocks.

Good question though and a good lead into our next step. The question is WHAT AREA can one average over? Which turns on the spatial auto corelation
in TREND. So You can test that by doing the averages over different sized areas

I completely understand imod vs. duplicate. There are two different records in 1948 for Denver 42572469000. I did not show you any data for Dillon 42572469001. Dillon has nothing to do with this discussion. The two records I showed you are for Denver with different duplicate numbers. So GHCN is saying that they are supposed to be from the same station – same imod. But in actuality, they are not.

Then you claimed “Because duplicate 0 is airport data and duplicate 1 is city data. They are totally different stations, even though they are both “in” Denver.” I guess I’m asking this:

1. do you have the documentation that shows this. In looking through all the GHCN documentation I see no metadata for duplicates.
2. if they ARE different stations within Denver, yes I WOULD average them. Always.

Its now 55F in SOMA ( south of market) across town in the Marina its 62F.
Whats the temperature in the tenderloin?

if you DONT average them, How do you get an average for denver? add them?

1.) The documentation is not in GHCN – that is one of the problems – not information information to use the data correctly. You have to look at other data sources, like World Weather Records to detangle the GHCN mess.

2.) Denver city reported from 1873-1970. Denver Stapleton reported from 1940-2001. The reason you can’t average them is because they don’t have the same period of record. Simple example: every year from 1873-1970, the city temperature is 10 degrees. Every year from 1941-2001, the airport temperature is 8 degrees. Average them, you get 10 degrees for 1873-1940, 9 degrees for 1941-1970, and 8 degrees for 1971-2001. Now use 1941-1970 as your anomaly base period. Was the temperature in Denver really 1 degree above normal until 1940 and then 1 degree below normal from 1971-2000? No!

Steven Mosher

August 20, 2010 at 12:23 AM

1. link to the WWR so I can cross check the records

2. In the case of stitched records

a: the sensitivity to stitching methods is nil. The reference station method shows this.
b. the number of stitched records is relatively low
c. monte carlo testing of the duplicate records shows the same results.
d. The impact to trend is nil

The bottomline is that it’s getting warmer. One can for example, show this by selecting only continuous records. When I’m done posting the codeyou
are welcomed to do your own testing of that. All the tools will be provided.

if I had a HIGH degree of stations like that I would use Nick Stokes or jeff Ids or Chad hermans(mcKitrick) Least squares approach. The least squares approach doesnt depend upon the overlap. My personal preference is Roman/Ids approach for Just this reason. However, since our answers are highly correlated the concern you raise is interesting, discussed before, and OBE. If I get time I may play with the stitching issue. If you have anymore questions then read up on the reference station method, have a look at Nicks code or Ids code. Its one of those things where I thought “it will make a huge difference” in the end, not important. The issues lie in metadata. The issues lie in adjustments. the issues lie in UHI. not in handling duplicates. but you have the data, go make a case of the global importance of it.