2 Exercise 1: BRFSS Survey Data

We will explore a subset of data collected by the CDC through its extensive Behavioral Risk Factor Surveillance System (BRFSS) telephone survey. Check out the link for more information. We’ll look at a subset of the data.

Use file.choose() to find the path to the file ‘BRFSS-subset.csv’

path <- file.choose()

Input the data using read_csv(), assigning to a variable brfss and visualizing the first few rows.

The tidyverse uses a ‘pipe’, %>% to send data from one command to another. There are small number of key functions for manipulating data. We’ll use group_by() to group the data by Sex, and then summarize(n=n()) to count the number of observations in each group.

Year is input as an integer vector, and Sex as a character vector. Actually, though, these are both factors. Use mutate() and factor() to update the type of these columns. Re-assign the updated tibble to brfss

There are several other pipes available (see also the magrittr package). %$% extracts a column. Here we look at the levels() of the factor that we created.

brfss %$% Sex %>% levels()

## [1] "Female" "Male"

brfss %$% Year%>% levels()

## [1] "1990" "2010"

It’s usually better to ‘clean’ data as soon as possible. Visit the help page ?read_csv, look at the col_types = argument, and the help pages ?cols and ?col_factor. Input the data in it’s correct format, with Sex and Year as factors

Use filter() to create a subset of the data consisting of only the 1990 observations (Year in the set that consists of the single element 1990, Year %in% 1990). Optionally, save this to a new variable brfss_1990.

Pipe this subset to t.test() to ask whether Weight depends on Sex. The first argument to t.test is a ‘formula’ describing the relation between dependent and independent variables; we use the formula Weight ~ Sex. The second argument to t.test is the data set to use – indicate the data from the pipe with data = .

What about differences between weights of males (or females) in 1990 versus 2010?

Use boxplot() to plot the weights of the Male individuals. Can you transform weight, e.g., taking the square root, before plotting? Interpret the results. Do similar boxplots for the t-tests of the previous question.

3 Exercise 2: ALL Phenotypic Data

Choose the file that contains ALL (acute lymphoblastic leukemia) patient information and input the date using read.csv(); for read.csv(), use row.names=1 to indicate that the first column contains row names.

Use the mol.biol column to filter the data to contain individuals in the set c("BCR/ABL", "NEG") (i.e., they have mol.biol equal to BCR/ABL or NEG))

bcrabl <- pdata %>% filter(mol.biol %in% c("BCR/ABL", "NEG"))

We’d like to tidy the data by mutating mol.biol to be a factor. We’d also like to mutate the BT column (B- or T-cell subtypes) to be just B or T, using substr(BT, 1, 1) (i.e., for each element of BT, taking the substring that starts at letter 1 and goes to letter 1 – the first letter)

Use t.test() to compare the age of individuals in the BCR/ABL versus NEG groups; visualize the results using boxplot(). In both cases, use the formula interface and . to refer to the incoming data set. Consult the help page ?t.test and re-do the test assuming that variance of ages in the two groups is identical. What parts of the test output change?