Note: Simulation helpful when you don’t have actual data or limited data. Unlikely to be true for most data science work.

Week 3: Functions

Basic components of functions: body and arguments

name <- function(arg)
{
BODY
}

Functions can have many arguments (seperated by , )

Variables can be defined inside or outside a function (inside is first look)

function(arg1,arg2,arg3)

Week 3: Functions

A tip for writing functions… start with pseudo-code

Distribution <- function(vector,number)
{
# only keep the elements within the vector that are less than the number, and store the number of eligible elements into the variable "count"
# calculate the percentage and return the results
}

Week 3: Functions

Stepwise coding with functions

vec <- c(1,2,3,4,5)
val <- 2

only keep the elements within the vector that are less than the number, and store the number of eligible elements into the variable “count”

Start simple and add complexity 1. Return elements in vector less than the number

vec < val

## [1] TRUE FALSE FALSE FALSE FALSE

Count the number of elements in the vector

sum(vec < val)

## [1] 1

Week 3: Functions

Example using length vec[vec < val]

## [1] 1

length(vec[vec < val])

## [1] 1

Distribution <- function(vector,number)
{
# only keep the elements within the vector that are less than the number, and store the number of eligible elements into the variable "count"
count <- length(vec[vec < val])
# calculate the percentage and return the results
}

Exploratory Data Analysis (EDA): Summarizing data using dplyr()

The ddplyr package is powerful for munging and summarizing data.

Select certain columns of data.Filter your data to select specific rows.Arrange the rows of your data into an order.Mutate your data frame to contain new columns.Summarize chunks of you data in some way.

Week 4: Inferential stats for Lab

A brief overview of Week 4: Sampling

Allows us to make assumptions about the underlying truth (i.e., population).

in R sample(X =, size =, replace = )

A brief overview of Week 4: Sampling

Obtain a sample of size 5 from the Distirbution vector with replacement

Distribution <- rnorm(1000,80,10)

PopA <- rnorm(1000,80,10)

sample(PopA,5,replace = TRUE)

## [1] 77.10141 89.02092 76.44148 89.98562 81.71340

A brief overview of Week 4: Evaluating two distributions

Comparing two distributions

Helpful for evaluating whether two datasets are the “same” i.e., come from the same distirbution.

To make this determination we can compare the sample statistics from the “unknown” population to the known population parameters.

A brief overview of Week 4: Evaluating two distributions

A scenario:

You have the parameters of Pop A and you want to know if PopB with a single sample mean of 70.1608428 is same data as Pop A with a mean of 80.5812465

We can compare the sample mean of Pop B (70.1608428) to determine if it falls within the acceptable distirbution of Pop A.

A brief overview of Week 4 for lab: Evaluating two distributions

Is the mean value for Pop B within an acceptable range?

A brief overview of Week 4 for lab: Evaluating two distributions

We can determine whether the mean for PopB is within our range of truth by: (1) setting a threshold and (2) comparing the threshold to the mean of PopB. If its outside of the threshold its not likely from the same population.

Our acceptable threshold is between 5 percent and 95 percent of the data in popA.