What you need

About the Data

The data that you will use for this workshop is stored in the cloud. It contains precipitation information over time for several locations in Colorado.

All you have to get started with is a list of URLs - one for each data file. Each data file is in .csv format. You can find this list of URLs in the data/ directory of the version-control-hot-mess GitHub repository that you cloned or downloaded for this workshop.

Data Exploration

To begin this lesson you will explore your data.

What Is the Length of Record For Each Site?

Your end goal in this workshop is to create plots of precipitation data over time by station and month / year. However, you have yet to explore your data. To begin, open the first url in csv file containing urls of the data locations. Remember that file is located in data/data_urls.csv.

Explore your data and calculate the length of record for each site in the data.

For this activity you will use the readr library to import your data - a powerful library for parsing and reading tabular data. The readr package will attempt to convert known character formats including date/times, numbers and other formats into the correct R class.

# load librarieslibrary(readr)library(ggplot2)library(dplyr)

Next, open the file that contains URLs to the data. Note that we are using data that are stored on Amazon Web Services (AWS) servers.

Note that when you use readr::read_csv, it returns the data class that each column was converted to. Above, notice that the lat, lon, elevation are all of type double - which is a number with decimal places.

The DATE field was converted to a proper datetime class.

The HPCP column stores precipitation. This is the data that you ultimately want to plot. Notice that those data were not converted to a numeric format. You will explore that issue later in this lesson.

What is Pseudocode?

Before you start to code, think about your goals. Rather than simply jumping into R and coding (which is what we all want to do initially!), plan things out.

Write down that steps associated with what you wish to accomplish - in English. Writing out the steps required to complete an operation is called pseudocode. Pseudocode is useful for organization coding operations. It allows you to think through what you wish to accomplish and the most efficient way to go about it BEFORE you write your code.

GOAL: You want to calculate the total time in days that is represented in the precipitation data for colorado for each station or site.

Write Pseudocode

Once your goal is clear, write out the steps that you will need to implement in order to achieve your goal. It’s ok if you don’t know all of the functions yet to implement this. Organize first, look up functions second.

## Below is the pseudocode for calculating length of record# 1. open up the file containing the data# 2. group by data by the station name field# 3. calculate the total time by subtracting the min date from the max date.

Once your pseudocode is written out, it’s time to associated R functions with each step. To do that you will use the tidyverse.

Get Started with tidyverse

To get going with tidyverse, there are a few things that you should know.

The pipe %>% is fundamental to tidyverse. The pipe is a way to connect a sequence of operations together. Pipes are efficient because they:

Don’t create intermediate outputs saving memory

Combine operations into a clean chunk of code

Allow you to send one output as an input to the next operation.

When combined with tidyverse functions, you also gain extremely expressive code. Pipes generally are often used with a data.frame object and are written as follows:

my_data_frame%>%perform_some_operation

Pipes are a powerful tool for clearly expressing a sequence of multiple operations. - Hadley Wickham, R for Data Science

R tidyverse summarise and group_by Functions

The next operations that you need to know are the summarise and group_by functions.

group_by: As the name suggest, group_by allows you to group by a one or more variables.

Calculate Total Days of Observations

You can calculate the total number of days represented in your data by subtracting the maximum date from the minimun date for each station. The dates were stored in a friendly format that readr could understand and convert to a datetime class.

Your code to calculate length of record will thus look something like this:

On Your Own (OYO)

Create a plot of precipitation over time using the .csv file that is accessed through the first URL in the list. This is the same file we’ve been using throughout this lesson. To help you create your plot, an example of creating a scatter plot with ggplot and sending a data.frame to ggplot is below.