Using the countyweather package

Rachel Severson and Brooke Anderson

2016-10-25

While data from weather stations is available at the specific location of the weather station, it is often useful to have estimates of daily or hourly weather aggregated on a larger spatial level. For U.S.-based studies, it can be particularly useful to be able to pull time series of weather by county. For example, the health data used in environmental epidemiology studies is often aggregated at the county level for U.S. studies, making it very useful for environmental epidemiology applications to be able to create weather datasets by county.

This package builds on functions from the rnoaa package to identify weather stations within a county based on its FIPS code and then pull weather data for a specified date range from those weather stations. It then does some additional cleaning and aggregating to produce a single, county-level weather dataset. Further, it maps the weather stations used for that county and date range and allows you to create and write datasets for many different counties using a single function call.

If you are pulling weather data from single weather station, you should use rnoaa directly. However, countyweather allows you to pull and aggregate data from weather stations more easily at the county level for the US.

Required set-up for this package

To use this package, you will need an API key from NOAA to be able to access the weather data. This API key is input with some of your data requests to NOAA within functions in this package. You can request an API key from NOAA here: http://www.ncdc.noaa.gov/cdo-web/token. You should keep this key private.

Once you have this NOAA API key, you’ll need to pass it through to some of the functions in this package that pull data from NOAA. The most secure way to use this API key is to store it in your .Renviron configuration file. Then you can save it as the value of an object in R code or R markdown documents without having to include the key itself in the script. To store the NOAA API key in your .Renviron configuration file, first check and see if you already have an .Renviron file in your home directory. You can check this by running the following from your R command line:

any(grepl("^\\.Renviron", list.files("~", all.files =TRUE)))

If this call returns TRUE, then you already have an .Renviron file.

If you already have it, open that file (for example, with system("open ~/.Renviron")). If you do not yet have an .Renviron file, open a new text file (in RStudio, do this by navigating to File > New File > Text File) and save this text file as .Renviron in your home directory. If prompted with a complaint, you DO want to use a filename that begins with a dot .

Once you have opened or created an .Renviron file, type the following into the file, replacing “your_emailed_key” with the actual string that NOAA emails you:

noaakey=your_emailed_key

Do not put quotation marks or anything else around your key. Do make sure to add a blank line as the last line of the file. If you find you’re having problems getting this to work, go back and confirm that you’ve included a blank line as the last line in your .Renviron file. This is the most common reason for this part not working.

Next, you’ll need to restart R. Once you restart R, you can get the value of this NOAA API key from .Renviron anytime with the call Sys.getenv("noaakey"). Before using functions that require the API key, set up the object rnoaakey to have your NOAA API key by running:

options("noaakey" =Sys.getenv("noaakey"))

This will pull your NOAA API key from the .Renviron file and save it as the object noaakey, which functions in this package need to pull weather data from NOAA’s web services. You will want to put this line of code as one of the first lines of code in any R script or R Markdown file you write that uses functions from this package.

Basic examples of using the package

Weather data is collected at weather station, and there are often multiple weather stations within a county. The countyweather package allows you to pull weather data from all stations in a specified county over a specified date range. The two main functions in the countyweather package are daily_fips and hourly_fips, which pull daily and hourly weather data, respectively. By default, the weather data pulled from all weather stations in a county will then be averaged for each time point to create an average time series of daily or hourly measurements for that county. There is also an option that allows the user to opt out of the default aggregation across weather stations, and instead pull separate time series for each weather station in the county. This option is explained in more detail later in this document. Opting out of the default aggregation can be useful if you would like to use a method other than a simple average to aggregate across weather stations within a county.

Throughout, functions in this package identify a county using the county’s Federal Information Processing Standard (FIPS) code. FIPS codes are 5-digit codes that uniquely identify every U.S. county. The first two digits of a county FIPS code specify state and the last three specify the county within the state. This package pulls data based on FIPS designations as of the 2010 Census. Users will not be able to pull data for the few FIPS codes that have undergone substantial changes since 2010 - for a list of those codes see the Census Bureau’s summary of these counties for the 2010s.

Currently, this package can pull daily and hourly weather data for variables like temperature and precipitation. For resources with complete lists of weather variables available through this package, as well as sources of this weather data, see the section later in this document titled “More on the weather data”.

Pulling daily data

The daily_fips function can be used to pull daily weather data for all weather stations within the geographic boundaries of a county. This daily weather data comes from NOAA’s Global Historical Climatology Network. When pulling data for a county, the user can specify date ranges (date_min, date_max), which weather variables to include in the output dataset (var), and restrictions on how much non-missing data a weather station must have over the time period to be included when generating daily county average values (coverage). This function will pull any available data for weather stations in the county under the specified restrictions and output both a dataset of average daily observations across all county weather stations, as well as a map plotting the stations used in the county-wide averaged data.

Here is an example of creating a dataset with daily precipitation for Miami-Dade county (FIPS code = 12086) for August 1992, when Hurricane Andrew stuck:

The output from this function call is a list that includes three elements: a daily timeseries of weather data for the county (andrew_precip$daily_data), a dataframe with meta-data about the weather stations used to create the timeseries data, as well as statistical information about the weather values pulled from these stations (andrew_precip$station_metadata), and a map showing the locations of weather stations included in the county-averaged dataset (andrew_precip$station_map).

The dataset includes columns for date (date), precipitation (in mm, prcp), and also the number of stations used to calculate each daily average precipitation observation (prcp_reporting).

This function performs some simple data cleaning and quality control on the weather data originally pulled from NOAA’s web services; see the “More on the weather data” section later in this document for more details, including the units for the weather observations collected by this function.

Here is a plot of this data, with colors used to show the number of stations included in each daily observation:

From this plot, you can see both the extreme precipitation associated with Hurricane Andrew (Aug. 24) and that the storm knocked out quite a few of the weather stations normally available.

A map is also included in the output of daily_fips with the stations used for the county average, as the station_map element:

andrew_precip$station_map

This map uses U.S. Census TIGER/Line shapefiles (vintage 2011) and functions from the ggmap package to overlay weather station locations on a shaped map showing the county’s boundaries.

The station_metadata dataframe gives information about all of the stations contributing data to the daily_data dataframe, as well as information about how the values by each station vary within each weather variable. If a weather station is contributing data for multiple variables, it will show up in this dataframe multiple times. Here’s what the station_metadata dataframe looks like for the andrew_precip list:

For each station, the dataframe gives an id and name, as well as latitude and longitude. var indicates the variable for which the station is pulling data. If a station is contributing data for multiple variables, that station will show up in the dataframe once for each of those variables. For each variable and station combination, the dataframe also shows calc_coverage, which is the calculated percent of non-missing values. You can filter these by using the daily_fips option coverage. standard_dev gives the standard deviation for each sample of weather data from each station and weather variable, min and max give the minumum and maximum values, and range gives the range of these values. These last four statistical calculations (standard deviation, maximum, minimum, and range) are only included for the seven core hourly weather variables (which include wind_direction, wind_speed, ceiling_height, visibility_distance, temperature, and temperature_dewpoint – for more details on these variables, see the “More on the weather data” section below). The values of these columns are set to “NA” for other variables, such as quality flag data.

If you are interested in looking at the weather values for certain stations, you can use the average_data = FALSE option in daily_fips. For more on this option and a few others, see the “Futher options available in the package” section below.

Pulling hourly data

You can use the hourly_fips function to pull hourly weather data by county from NOAA’s Integrated Surface Data (ISD) weather dataset. In this case, NOAA’s web services will not identify weather stations by FIPS, so instead this function will pull all stations within a certain radius of the county’s population mean center to represent weather within that county. While there are seven main weather variables that are possible to pull (listed below in the “More on the weather data” section), temperature and wind_speed tend to be non-missing most often.

An estimated radius is calculated for each county using 2010 U.S. Census Land Area data – each county is assumed to be roughly ciruclar. The calculated radius (in km), as well as the longitude and latitude of the geographic center for each county are included as elements in the list returned from hourly_fips.

Here is an example of pulling hourly data for Miami-Dade, for the year of Hurricane Andrew. While daily weather data can be pulled using a date range specified to the day, hourly data can only be pulled by year (for one or multiple years) using the year argument:

The output from this call is a list object that includes six elements. andrew_hourly$hourly_data is an hourly timeseries of weather data for the county. The other five elements, station_metadata, station_map, radius, lat_center, and lon_center, are explained in more detail below.

Again, the intensity of conditions during Hurricane Andrew is clear, as is the reduction in the number of reporting stations during the storm.

The list object returned by hourly_fips also includes a map of station locations (station_map):

andrew_hourly$station_map

Because hourly data is pulled by radius from each county’s geographic center, this plot inlcudes the calculated radius from which stations are pulled. This radius is calculated for each county using 2010 U.S. Census Land Area data. U.S. Census TIGER/Line shapefiles are used to provide county outlines, included on this plot as well. Because stations are pulled within a radius from the county’s center, stations from outside of the county’s boundaries may sometimes be providing data for that county.

Other list elements returned by hourly_fips include station_metadata, radius, lat_center, and lon_center. radius is the estimated radius (in km) for the county calculated using 2010 U.S. Census Land Area data – the county is assumed to be roughly ciruclar. lat_center and lon_center are the longitude and latitude of the geographic center for the county, respectively.

The station_metadata dataframe gives information about all of the stations contributing data to the hourly_data dataframe, as well as information about how the values by each station vary within each weather variable. If a weather station is contributing data for multiple variables, it will show up in this dataframe multiple times. Here’s what the station_metadata dataframe looks like for the andrew_hourly list:

usaf and wban are station ids. station is a unique identifier for each station – usaf and wban ids have been pasted together, separated by “-”. (Note: values for wban or usaf are sometimes missing (originally indicated by “99999” or “999999”), which could result in a station value like 722024-NA.) station_name is the name for each station, and var indicates the variable for which the station is pulling data. If a station is contributing data for multiple variables, that station will show up in the dataframe once for each of those variables. For each variable and station combination, the dataframe also shows calc_coverage, which is the calculated percent of non-missing values. You can filter these by using the hourly_fips option coverage. standard_dev gives the standard deviation for each sample of weather data from each station and weather variable, and range gives the range of these values. Here, we can see in row 7 that the OPA LOCKA station has a very low percent coverage for temperature (0.0018), and a correspondingly high standard deviation (17.94). If you are interested in looking at the weather values for certain stations, you can use the average_data = FALSE option in hourly_fips. For more on this option and a few others, see the “Futher options available in the package” section below. The dataframe also gives station countries, states, elevation (in meters), the earliest and latest dates for which the station has available data (begin and end, respectively), longitude, and latitude.

Writing out timeseries files

There are a few functions that allow the user to write out daily or hourly timeseries datasets for many different counties to a specified local directory, as well as plots of this data. For daily weather data, see the functions write_daily_timeseries and plot_daily_timeseries. For hourly, see write_hourly_timeseries and plot_hourly_timeseries.

For example, if we wanted to compare daily weather in the month of August for three counties in southern Florida, we could run:

The write_daily_timeseries function saves each county’s timeseries as a separate file in a subdirectory called “data” of the directory specified in the out_directory option. The data_type argument allows the user to specify either .rds or .csv files (the default is to write .rds files). Each file is a time series dataframe of daily weather data. A dataframe of station metadata is saved in a second subdirectory called “metadata”, and maps showing locations of weather stations contributing to each time series are saved in a subdirectory called “maps.” At this stage, if you were to include a county in the fips argument without available data, a file would not be created for that county.

The function plot_daily_timeseries creates and saves plots for each of these files. (Note: the data_type argument for this function also defaults to read .rds files, so if you chose to write .csv files, make sure to change that arugment in this function as well to data_type = "csv".)

Here’s an example of what the timeseries plots for the three Florida counties would look like:

Futher options available in the package

coverage

For hourly_fips, daily_fips, and time series functions, the user can choose to filter out any stations that report variables for less than a certain percent of the specified date range (coverage). For example, if you were to set coverage to 0.90, only stations that reported non-missing values at least 90% of the time over the specified date range would be included in your data.

average_data

In both daily_fips and hourly_fips, the default is to return a single daily average for the county for each day in the timeseries, giving the value averaged across all available stations on that day. However, there is also an option called average_data which allows the user to specify whether they would like the weather data returned before it has been averaged across stations. If this argument is set to FALSE, the functions will return separate daily data for each station in the county. For our Hurricane Andrew example, we can specify average_data = FALSE:

In this example, there are six stations contributing weather data to the timeseries. We can plot the data by station to get a sense for how values from each station compare, and which stations were presumably knocked out by the storm, with different colors used to show values for different stations:

It might be interesting here to compare this plot with the station map, this time with station labels included (done using station_label = TRUE when we pulled this data using daily_fips):

not_averaged$station_map

Quality Flags

The hourly Integrated Surface Data includes quality codes for each of the main weather variables. For more information about the hourly weather variables, see the “More on the weather data” section below. We can use these codes to remove suspect or erroneous values from our data. The values in wind_speed_quality, for example, take on the following values: (Values in this table were pulled from the ISD documentation file.)

code

definition

0

Passed gross limits check

1

Passed all quality control checks

2

Suspect

3

Erroneous

4

Passed gross limits check , data originate from an NCEI data source

5

Passed all quality control checks, data originate from an NCEI data source

6

Suspect, data originate from an NCEI data source

7

Erroneous, data originate from an NCEI data source

9

Passed gross limits check if element is present

Because it doesn’t make sense to average these codes across stations, the codes should only be pulled when using the option to pull station-specific values (average_data = FALSE).

More on the weather data

Daily weather data

Functions in this package that pull daily weather values (daily_fips(), for example) are pulling data from the Daily Global Historical Climatology Network (GHCN-Daily) through NOAA’s FTP server. The data is archived at the National Centers for Environmental Information (NCEI) (formerly the National Climatic Data Center (NCDC)), and spans from the 1800s to the current year.

Users can specify which weather variables they would like to pull. The five core daily weather variables are precipitation (prcp), snowfall (snow), snow depth (snwd), maximum temperature (tmax) and minimum temperature (tmin). The daily weather data is filtered so that included weather variables fall within a range of possible values. These ranges were chosen to include national maximum recorded values.

Variable

Description

Units

Most extreme value

prcp

precipitation

mm

1100 mm

snow

snowfall

mm

1600 mm

snwd

snow depth

mm

11500 mm

tmax

maximum temperature

degrees Celsius

57 degrees C

tmin

minumum temperature

degrees Celsius

-62 degrees C

tmax, tmin, and prcp were originally recorded in tenths of units, and are listed as such in NOAA documentation. These values are converted to standard units (degrees Celsius and mm, respectively) in countyweather output.

There are several additional, non-core variables available. For example, acmc gives the “average cloudiness midnight to midnight from 30-second ceilometer data (percent).” The complete list of available weather variables can be found under ‘element’ from the GHCND’s readme file.

While the datasets resulting from functions in this package return a cleaned and aggregated dataset, Menne et al. (2012) give more information aboout the raw data in the GHCND database.

Hourly weather data

Hourly weather data in this package is pulled from NOAA’s Integrated Surface Data (ISD), and is available from 1901 to the current year. The data is archived at the National Centers for Environmental Information (NCEI) (formerly the National Climatic Data Center (NCDC)), and is also pulled through NOAA’s FTP server.

The seven core hourly weather variables are wind_direction, wind_speed, ceiling_height, visibility_distance, temperature, temperature_dewpoint, and air_pressure. Values in this table were pulled from the ISD documentation file.

Variable

Description

Units

Minimum

Maximum

wind_direction

The angle, measured in a clockwise direction, between true north and the direction from which the wind is blowing

Angular Degrees

1

360

wind_speed

The rate of horizontal travel of air past a fixed point

Meters per Second

0

90

ceiling_height

The height above ground level of the lowest cloud or obscuring phenomena layer aloft with 5/8 or more summation total sky cover, which may be predominately opaque, or the vertical visibility into a surface-based obstruction

Meters

0

22000 (indicates ‘Unlimited’)

visibility_distance

The horizontal distance at which an object can be seen and identified

Meters

0

160000

temperature

The temperature of the air

Degrees Celsuis

-93.2

61.8

temperature_dewpoint

The temperature to which a given parcel of air must be cooled at constant pressure and water vapor content in order for saturation to occur

Degrees Celsius

-98.2

36.8

air_pressure

The air pressure relative to Mean Sea Level

Hectopascals

860

1090

There are other columns available in addition to these weather variables, such as quality codes (e.g., wind_direction_quality — each of the main weather variables has a corresponding quality code that can be pulled by adding _quality to the end of the variable name).

For more information about the weather variables described in the above table and other available columns, see the ISD documentation file.

Error and warning messages you may get

Not able to pull data from a station

The following error message will come up after running functions pulling daily data if there isn’t available data (for your specified date range, coverage, and weather variables) for a particular station or stations:

In rnoaa::meteo_pull_monitors(monitors = stations, keep_flags = FALSE,: The following stations could not be pulled from the GHCN ftp:USR0000FTENAny other monitors were successfully pulled from GHCN.

The following error message will come up after running functions pulling hourly data (hourly_fips()) if there isn’t available data for any of the stations in your specified county. Note: some weather variables tend to be missing more often than others.

and then try again. If you still get an error, you may not have set up your NOAA API key correctly in your .Renviron file. See the “Required set-up” section of this document for more details on doing that correctly.

NOAA web services down

Sometimes, some of NOAA’s web services will be off-line. In this case, you may get an error message when you try to pull data like: