R Libraries to Install:

Important - Data Organization

Before you begin this lesson, be sure that you’ve downloaded the dataset above. You will need to UNZIP the zip file. When you do this, be sure that your directory looks like the image below: note that all of the data are within the week2 directory. They are not nested within another directory. You may have to copy and paste your files to make this look right.

Your `week_02` file directory should look like the one above. Note that the data directory is directly under the earth-analytics folder.

Get Started with Time Series Data

To begin, load the ggplot2 and dplyr libraries. Also, set your working directory. Finally, set stringsAsFactors to FALSE globally using options(stringsAsFactors = FALSE).

# set your working directory to the earth-analytics directory# setwd("working-dir-path-here")# load packageslibrary(ggplot2)library(lubridate)## ## Attaching package: 'lubridate'## The following object is masked from 'package:base':## ## datelibrary(dplyr)# set strings as factors to falseoptions(stringsAsFactors=FALSE)

Import Precipitation Time Series

You will use a precipitation dataset collected by the National Centers for Environmental Information (formerly National Climate Data Center) Cooperative Observer Network (COOP) station 050843 in Boulder, CO. The data cover the time span between 1 January 2003 through 31 December 2013.

About the Data

Viewing the structure of these data, you can see that different types of data are included in this file:

STATION and STATION_NAME: Identification of the COOP station.

ELEVATION, LATITUDE and LONGITUDE: The spatial location of the station.

DATE: The date when the data were collected in the format: YYYYMMDD. Notice that DATE is currently class chr, meaning the data is interpreted as a character class and not as a date.

DAILY_PRECIP: The total precipitation in inches. Important: the metadata notes that the value 999.99 indicates missing data. Also important, hours with no precipitation are not recorded.

YEAR: The year the data were collected.

JULIAN: The JULIAN DAY the data were collected.

Additional information about the data, known as metadata, is available in the PRECIP_HLY_documentation.pdf. The metadata tell us that the noData value for these data is 999.99. IMPORTANT: You have modified these data a bit for ease of teaching and learning. Specifically, you’ve aggregated the data to represent daily sum values and added some noData values to ensure you learn how to clean them!

You can download the original complete data subset with additional documentation here.

Next, take care of the date field. In this case you have month/day/year. You can use ?strptime to figure out which letters you need to use in the format = argument to ensure your data elements (month, day and year) are understood by R.

In this case you want to use

%m - for month

%d - for day

%y - for year

Also take note of the format of your date. In this case, each date element is separated by a /.

NA Values and Warnings

When you plot the data, you get a warning that says:

## Warning: Removed 4 rows containing missing values (geom_point).

You can get rid of this warning by removing NA or missing data values from your data. A warning is just R’s way of letting you know that something may be wrong. In this case, it can’t plot 4 data points because there are missing data values there.

Let’s remove the missing data value rows using a dplyr pipe and the na.omit() function. You will learn about pipes in just a minute!

Optional Challenge

Use the min() and max() functions to determine the minimum and maximum precipitation values for the 10 year span?

Introduction to the Pipe %>%

Above you used pipes to manipulate your data. Specifically you removed NA values in a pipe with na.omit().

Pipes let you take the output of one function and send it directly to the next, which is useful when you need to do many things to the same data set. Pipes in R look like %>% and are made available via the magrittr package, installed automatically with dplyr.

Notice that each time you assign the pipe to a variable, you are overwriting that variable.

boulder_daily_precip <- boulder_daily_precip

In this case you are just updating your current boulder_daily_precip variable.

The process above avoids processing the data in separate steps, and potentially creating new variables each time. You can even send the output to ggplot(). When you send output to ggplot() in a pipe, you don’t need the use the data argument (data = boulder_daily_precip) because you send the data throught the pipe. Like this:

Note: that because you are creating a plot with the code below, you don’t need to assign the pipe to a variable. Thus you leave out the

Subset the Data

You may want to only work with a subset of your time series data. Let’s create a subset of data for the time period around the flood between 15 August to 15 October 2013. You use the filter() function in the dplyr package to do this and pipes!

In the code above, you use the pipe to send the boulder_daily_precip data through a filter step. In that filter step, you filter out only the rows within the date range that you specified. Since %>% takes the object on its left and passes it as the first argument to the function on its right, you don’t need to explicitly include it as an argument to the filter() function.