Introduction

The latest forecast discussions for Northern Alaska have included
warnings that we are likely to experience an extended period of below
normal temperatures starting at the end of this week, and yesterday’s
Deep Cold blog post discusses the
similarity of model forecast patterns to patterns seen in the 1989 and
1999 extreme cold events.

Our dogs spend most of their time in the
house when we’re home, but
if both of us are at work they’re outside in the dog yard. They have
insulated dog houses, but when it’s colder than −15° F, we put them into
a heated dog barn. That means one of us has to come home in the middle
of the day to let them out to go to the bathroom.

Since we’re past the Winter Solstice, and day length is now increasing, I was
curious to see if that has an effect on daily temperature, hopeful that the
frequency of days when we need to put the dogs in the barn is decreasing.

Methods

We’ll use daily minimum and maximum temperature data from the Fairbanks
International Airport station, keeping track of how many years the
temperatures are below −15° F and dividing by the total to get a
frequency. We live in a cold valley on Goldstream Creek, so our
temperatures are typically several degrees colder than the Fairbanks
Airport, and we often don’t warm up as much during the day as in other
places, but minimum airport temperature is a reasonable proxy for the
overall winter temperature at our house.

Results

The following plot shows the frequency of minimum (the top of each line)
and maximum (the bottom) temperature colder than −15° F at the airport
over the period of record, 1904−2016. The curved blue line represents a
best fit line through the minimum temperature frequency, and the
vertical blue line is drawn at the date when the frequency is the
highest.

The maximum frequency is January 12th, so we have a few more days before
the likelihood of needing to put the dogs in the barn starts to decline.
The plot also shows that we could still reach that threshold all the way
into April.

For fun, here’s the same plot using −40° as the threshold:

The date when the frequency starts to decline is shifted slightly to
January 15th, and you can see the frequencies are lower. In mid-January,
we can expect minimum temperature to be colder than −15° F more than
half the time, but temperatures colder than −40° are just under 15%.
There’s also an interesting anomaly in mid to late December where the
frequency of very cold temperatures appears to drop.

Introduction

So far this winter we’ve gotten only 4.1 inches of snow, well below the normal
19.7 inches, and there is only 2 inches of snow on the ground. At this point
last year we had 8 inches and I’d been biking and skiing on the trail to work
for two weeks. In his North Pacific Temperature Update
blog post, Richard James mentions that winters like this one, with a combined
strongly positive Pacific Decadal Oscillation phase and strongly negative North
Pacific Mode phase tend to be a “distinctly dry” pattern for interior Alaska. I
don’t pretend to understand these large scale climate patterns, but I thought it
would be interesting to look at snowfall and snow depth in years with very
little mid-November snow. In other years like this one do we eventually get
enough snow that the trails fill in and we can fully participate in winter
sports like skiing, dog mushing, and fat biking?

Data

We will use daily data from the Global Historical Climate Data set for the
Fairbanks International Airport station. Data prior to 1950 is excluded because
of poor quality snowfall and snow depth data and because there’s a good chance
that our climate has changed since then and patterns from that era aren’t a good
model for the current climate in Alaska.

We will look at both snow depth and the cumulative winter snowfall.

Results

The following tables show the ten years with the lowest cumulative
snowfall and snow depth values from 1950 to the present on November 18th.

Year

Cumulative Snowfall (inches)

1953

1.5

2016

4.1

1954

4.3

2014

6.0

2006

6.4

1962

7.5

1998

7.8

1960

8.5

1995

8.8

1979

10.2

Year

Snow depth (inches)

1953

1

1954

1

1962

1

2016

2

2014

2

1998

3

1964

3

1976

3

1971

3

2006

4

2016 has the second-lowest cumulative snowfall behind 1953 and is tied
for second with 2014 for snow depth with 1953, 1954 and 1962 all having
only 1 inch of snow on November 18th.

It also seems like recent years appear in these tables more frequently
than would be expected. Grouping by decade and averaging cumulative
snowfall and snow depth yields the pattern in the chart below. The error
bars (not shown) are fairly large, so the differences between decades
aren’t likely to be statistically significant, but there is a pattern of
lower snowfall amounts in recent decades.

Now let’s see what happened in those years with low snowfall and snow depth
values in mid-November starting with cumulative snowfall. The following plot
(and the subsequent snow depth plot) shows the data for the low-value years (and
one very high snowfall year—1990), with each year’s data as a separate line. The
smooth dark cyan line through the middle of each plot is the smoothed line
through the values for all years; a sort of “average” snowfall and snow depth
curve.

In all four mid-November low-snowfall years, the cumulative snowfall values
remain below average throughout the winter, but snow did continue to fall as the
season went on. Even the lowest winter year here, 2006–2007, still ended the
winter with 15 inches of snow on the groud.

The following plot shows snow depth for the four years with the lowest snow
depth on November 18th. The data is formatted the same as in the previous plot
except we’ve jittered the values slightly to make the plot easier to read.

The pattern here is similar, but the snow depths get much closer to the
average values. Snow depth for all four low snow years remain low
throughout November, but start rising in December, dramatically in 1954
and 2014.

One of the highest snowfall years between 1950 and 2016 was 1990–1991 (shown on
both plots). An impressive 32.8 inches of snow fell in eight days between
December 21st and December 28th, accounting for the sharp increase in
cumulative snowfall and snow depth shown on both plots. There are five years in
the record where the cumulative total for the entire winter was lower than these
eight days in 1990.

Conclusion

Despite the lack of snow on the ground to this point in the year, the
record shows that we are still likely to get enough snow to fill in the
trails. We may need to wait until mid to late December, but it’s even
possible we’ll eventually reach the long term average depth before spring.

Appendix

Here’s the R code used to generate the statistics, tables and plots from this
post:

May 13th seems very early in the year to hit 80 degrees in Fairbanks, so I
decided to check it out. What I’m doing here is selecting all the dates where
the temperature is above 80°F, then ranking those dates by year and date, and
extracting the “winner” for each year (where rank is 1).

Introduction

One of the best sources of weather data in the United States comes from
the National Weather Service's Cooperative Observer Network (COOP),
which is available from
NCDC.
It's daily data, collected by volunteers at more than 10,000 locations.
We participate in this program at our house (station id DW1454 /
GHCND:USC00503368), collecting daily minimum and maximum temperature,
liquid precipitation, snowfall and snow depth. We also collect river
heights for Goldstream Creek as part of the Alaska Pacific River
Forecast Center (station GSCA2). Traditionally, daily temperature
measurements were collecting using a minimum maximum thermometer, which
meant that the only way to calculate average daily temperature was by
averaging the minimum and maximum temperature. Even though COOP
observers typically have an electronic instrument that could calculate
average daily temperature from continuous observations, the daily
minimum and maximum data is still what is reported.

In an earlier post we looked
at methods used to calculate average daily temperature, and if there are
any biases present in the way the National Weather Service calculates
this using the average of the minimum and maximum daily temperature. We
looked at five years of data collected at my house every five minutes,
comparing the average of these temperatures against the average of the
daily minimum and maximum. Here, we will be repeating this analysis
using data from the Climate Reference
Network stations in the United
States.

The US Climate Reference Network is a collection of 132 weather stations
that are properly sited, maintained, and include multiple redundant
measures of temperature and precipitation. Data is available from
http://www1.ncdc.noaa.gov/pub/data/uscrn/products/ and includes monthly,
daily, and hourly statistics, and sub-hourly (5-minute) observations.
We’ll be focusing on the sub-hourly data, since it closely matches the
data collected at my weather station.

The code to insert all of this data into a database can be found
here.
Once inserted, I have a table named crn_stations that has the
station data, and one named crn_subhourly with the five minute
observation data.

Methods

Once again, we’ll use R to read the data, process it, and produce plots.

Remove observations without temperature data, group by station and date,
calculate average daily temperature using the two methods, remove any
daily data without a full set of data, and collect the results into an R
data frame. This looks very similar to the code used to analyze the data
from my weather station.

Results

The average anomaly across all stations and all dates is 0.44 degrees
Celsius (0.79 degrees Farenheit). That’s a pretty significant error.
Half the data is between −0.1 and 1.0°C (−0.23 and +1.8°F) and the full
range is −11.9 to +10.8°C (−21.4 to +19.4°F).

Plots

Let’s look at some plots.

Raw data by latitude

To start, we’ll look at all the anomalies by station latitude. The plot
only shows one percent of the actual anomalies because plotting 512,460
points would take a long time and the general pattern is clear from the
reduced data set.

The clouds of points show the differences between the min/max daily
average and the actual daily average temperature, where numbers above
zero represent cases where the min/max calculation overestimates daily
average temperature. The blue line is the fit of a linear model relating
latitude with temperature anomaly. We can see that the anomaly is always
positive, averaging around half a degree at lower latitudes and drops
somewhat as we proceed northward. You also get a sense from the actual
data of how variable the anomaly is, and at what latitudes most of the
stations are found.

The overall model and coefficients are highly significant, and show a
slight decrease in the positive anomaly as we move farther north.
Perhaps this is part of the reason why the analysis of my station (at a
latitude of 64.89) showed an average anomaly close to zero (−0.07°C /
−0.13°F).

Anomalies by month and latitude

One of the results of our earlier analysis was a seasonal pattern in the
anomalies at our station. Since we also know there is a latitudinal
pattern, in the data, let’s combine the two, plotting anomaly by month,
and faceting by latitude.

Station latitude are binned into groups for plotting, and the plots
themselves show the range that cover half of all anomalies for that
latitude category × month. Including the full range of anomalies in each
group tends to obscure the overall pattern, and the plot of the raw data
didn’t show an obvious skew to the rarer anomalies.

And the plot itself. At the end, we’re using a function called
facet_adjust, which adds x-axis tick labels to the facet on the
right that wouldn't ordinarily have them. The code comes from this
stack overflow
post.

Each plot shows the range of anomalies from the first to the third
quartile (50% of the observed anomalies) by month, with the dot near the
middle of the line at the mean anomaly. The orange horizontal line shows
the overall mean anomaly for that latitude category, and the count at
the bottom of the plot indicates the number of “station years” for that
latitude category.

It’s clear that there are seasonal patterns in the differences between
the mean daily temperature and the min/max estimate. But each plot looks
so different from the next that it’s not clear if the patterns we are
seeing in each latitude category are real or artificial. It is also
problematic that three of our latitude categories have very little data
compared with the other two. It may be worth performing this analysis in
a few years when the lower and higher latitude stations have a bit more
data.

Conclusion

This analysis shows that there is a clear bias in using the average of
minimum and maximum daily temperature to estimate average daily
temperature. Across all of the CRN stations, the min/max estimator
overestimates daily average temperature by almost a half a degree
Celsius (0.8°F).

We also found that this error is larger at lower latitudes, and that
there are seasonal patterns to the anomalies, although the seasonal
patterns don’t seem to have clear transitions moving from lower to
higher latitudes.

The current length of the CRN record is quite short, especially for the
sub-hourly data used here, so the patterns may not be representative of
the true situation.

Abstract

The following is a document-style version of a presentation I gave at work a
couple weeks ago. It's a little less useful for a general audience because you
don't have access to the same database I have, but I figured it might be useful
for someone who is looking at using dplyr or in manipulating the GHCND data
from NCDC.

Introduction

Today we’re going to briefly take a look at the GHCND climate database
and a couple new R packages (dplyr and tidyr) that make data
import and manipulation a lot easier than using the standard library.

For further reading, consult the vignettes for dplyr and tidyr,
and download the cheat sheet:

GHCND database

The GHCND
database
contains daily observation data from locations around the world. The
README linked above describes the data set and the way the data is
formatted. I have written scripts that process the station data and the
yearly download files and insert it into a PostgreSQL database
(noaa).

“Tidy” data

Without going into too much detail on the subject (read Hadley
Wickham’s paper) for
more information, but the basic idea is that it is much easier to
analyze data when it is in a particular, “tidy”, form. A Tidy dataset
has a single table for each type of real world object or type of data,
and each table has one column per variable measured and one row per
observation.

For example, here’s a tidy table representing daily weather observations
with station × date as rows and the various variables as columns.

dplyr and tidyr are the data import and manipulation libraries
we will use, knitr is used to produce tabular data in report-quality
forms, ggplot2 and scales are plotting libraries, and
lubridate is a library that makes date and time manipulation easier.

Also note the warnings about how several R functions have been “masked”
when we imported dplyr. This just means we'll be getting the
dplyr versions instead of those we might be used to. In cases where
we need both, you can preface the function with it's package:
base::filter would us the normal filter function instead of the
one from dplyr.

Each row in the observation table rows contain the station_id, date,
a variable code, the raw value for that variable, and a series of flags
indicating data quality, source, and special measurements such as the
“trace” value used for precipitation under the minimum measurable value.

Each row in the variables table contains a variable code, description
and the multiplier used to convert the raw value from the observation
table into an actual value.

This is an example of completely “normalized” data, and it’s stored this
way because not all weather stations record all possible variables, and
rather than having a single row for each station × date with a whole
bunch of empty columns for those variables not measured, each row
contains the station × data × variable data.

We are also missing information about the stations, so let’s load that
data:

The first part is the same as before, loading the ghcnd_stations
table, but we are filtering that data down to just the Fairbanks area
stations with long term records. To do this, we use the pipe operator
%>% which takes the data from the left side and passes it to the
function on the right side, the filter function in this case.

filter requires one or more conditional statements with variable
names on the left side and the condition on the right. Multiple
conditions can be separated by commas if you want all the conditions to
be required (AND) or separated by a logic operator (& for AND, |
for OR). For example: filter(latitude > 70, longitude < -140).

When used on database tables, filter can also use conditionals that
are built into the database which are passed directly as part of a
WHERE clause. In our code above, we’re using the %in% operator
here to select the stations from a list.

Now we have the station_ids we need to get just the data we want
from the observation table and combine it with the other tables.

Join the observation table with the filtered station data, using
station_id as the variable to combine against. Because this is an
“inner” join, we only get results where station_id matches in
both the observation and the filtered station data. At this point we
only have observation data from our long-term Fairbanks stations.

Join the variable table with the Fairbanks area observation data,
using variable to link the tables.

Add a new variable called value which is calculated by
multiplying raw_value (coming from the observation table) by
raw_multiplier (coming from the variable table).

Remove rows where the quality flag is not an empty space.

Select only the station name, date, variable and actual value columns
from the data. Before we did this, each row would contain every
column from all three tables, and most of that information is not
necessary.

Finally, we “collect” the results. dplyr doesn’t actually perform
the full SQL until it absolutely has to. Instead it’s retrieving a
small subset so that we can test our operations quickly. When we are
happy with the results, we use collect() to grab the full data.

De-normalize it

The data is still in a format that makes it difficult to analyze, with
each row in the result containing a single station × date × variable
observation. A tidy version of this data requires each variable be a
column in the table, each row being a single date at each station.

To “pivot” the data, we use the spread function, and we'll also
calculate a new variable and reduce the number of columns in the result.

spread takes two parameters, the variable we want to spread across
the columns, and the variable we want to use as the data value for each
row × column intersection.

Examples

Now that we've got the data in a format we can work with, let's look at a few
examples.

Find the coldest temperatures by winter year

First, let’s find the coldest winter temperatures from each station, by winter
year. “Winter year” is just a way of grouping winters into a single value.
Instead of the 2014–2015 winter, it’s the 2014 winter year. We get this by
subtracting 92 days (the days in January, February, March) from the date, then
pulling off the year.

See if you can follow the code above. The pipe operator makes is easy to
see each operation performed along the way.

There are a couple new functions here, group_by and summarize.
group_by indicates at what level we want to group the data, and
summarize uses those groupings to perform summary calculations using
aggregate functions. We group by station and winter year, then we use
the minimum and n functions to get the minimum temperature and number of
days in each year where temperature data was available. You can see we
are using n to remove winter years where more than two weeks of data
are missing.

Also notice that we’re using spread again in order to make a single
column for each station containing the minimum temperature data.

Here’s how we can write out the table data as a restructuredText
document, which can be converted into many document formats (PDF, ODF,
HTML, etc.):

To plot the data, we need the data in a slightly different format with
each row containing winter year, station name and the minimum
temperature. We’re plotting minimum temperature against winter year,
coloring the points and trendlines using the station name. That means
all three of those variables need to be on the same row.

To do that we use gather. The first parameter is the name of
variable the columns will be moved into (the station names, which are
currently columns, will become values in a row named station_name).
The second is the name of the column that stores the observations
(tmin) and the parameters after that are the list of columns to
gather together. In our case, rather than specifying the names of the
columns, we're specifying the inverse: all the columns exceptwinter_year.