Also, I gathered some cool data from draftexpress.com that we can use to practice on if we have time. Here is the data problem:

Every summer, 60 of the best basketball players in the world are drafted into the National Basketball Association (NBA). The majority of the draftees are selected from among thousands of eligible NCAA-level college basketball players.Who will be drafted?

If you are not a sports fan, you can still appreciate the data if you like numbers. Get the draft data.

# this runs k-fold cross validation. When k = the number of observations in your dataset, then that's LOOCV
# to run LOOCV, set k=n or just don't specify (its default is k=n)
# cost specifies how to evaluate how good the model is; the default is the average squared error function.
# it requires a glm object as input, so you need to run your model first and then validate it.

The knitr package provides the kable function, which allows you to export data frames as HTML, markdown, and more. It’s really useful along with some background with LaTeX or HTML/CSS to make nicely formatted tables directly from your R output. The code below should get you started.

(Bonus: a brief look at plyr power!)

#install.packages('knitr')
require(knitr)

#lets use the apropos Number of Breaks in Yarn during Weaving data
data(warpbreaks)

#check it out
summary(warpbreaks)

#Let's get the mean and SD for each level of wool
# at each level of tension.

#install.packages('plyr') # This is a great package
require(plyr)

#Split the data by wool and tension, get mean and sd for each
# and return a data frame.
descriptives<-ddply(.data=warpbreaks,
.variables=c('wool','tension'),
.fun=function(x){
round(c(Mean=mean(x$breaks),SD=sd(x$breaks)),2)
})
descriptives

#The default is markdown, and looks pretty good to copy and
# past into a text file.
kable(descriptives)

# REFRESHER: POWER is the probability of correctly rejecting a null hypothesis that is actually false.
# P(rejecting NULL hypothesis|NULL hypothesis is false)
# In other words, its the prob. of finding an effect that is really present.
# Power = 1 - P(Type II error)

# Power, effect size, sample size, and the significance level are inter-related, and if you know 3 of these
# quantities you can calculate the fourth (exciting eh?). Let's get to it.

#For example, what is the power for a two-sided ind. samples t-test given n=25 and a medium effect size?
pwr.t.test(n=25, d=.5, sig.level= .05)

#Another example: What is the sample size needed to achieve a pwr=.80 and a small effect size?
pwr.t.test(d=.2, sig.level= .05, power=.8)
# Note: if you add "$n" to the end of the command only the necessary sample size will be displayed.
pwr.t.test(d=.2, sig.level= .05, power=.8)$n

# For example, find the power for a multiple regression test with 2 continuous predictors and 1 categorical
# predictor (i.e. Marital status with k=3 so 3-1=2 dummy codes) that has a large effect size and a sample size of 30.
pwr.f2.test(u =3, v =30, f2 =.35, sig.level =.05)

# Generating a table of sample sizes
# What is the required sample size for an independent samples t-test with pwr=.80?
seq <- seq(.1,.8, .1)
FindN <- rep(0,8)
for(i in 1:8) FindN[i]=pwr.t.test(d=seq[i],power=.8,sig.level=.05,type="two.sample")$n
data.frame(d=seq , N=ceiling(FindN))
# can swap in other functions, just make sure arguments are relevant to test
seq <- seq(.1,.8, .1)
FindN <- rep(0,8)
for(i in 1:8) FindN[i]=pwr.anova.test(k=4, f=seq[i], power=.8,sig.level=.05)$n
data.frame(f=seq , N=ceiling(FindN))
# NOTE: "ceiling" is an argument for rounding numbers so the values are slightly different than if we
# computed the sample size based on specified values. For example...
pwr.anova.test(k =4,f=.3, sig.level =.05, power=.8)

#We have books read, classes attended, and grade.
#We're interested in predicting student grades from
#the number of books they've read, and the number
#of classes they attended.

##Graphing with ggplot involves a process of specifying layers of graphic objects.
#Let's start with a histogram

#What does the grade distribution look like?
ggplot(data, aes(GRADE)) + #The first step is to identify the data you want to graph.
#In this case, we want to graph the variable GRADE, which can be found in data
geom_histogram() #Now, we need to tell ggplot2 how we want the data represented.

ggplot(data, aes(x = factor(BOOKS), y = GRADE)) + geom_bar(stat="identity")
#What happened? Why are the grades so high?
#It looks like we have summed grades, which is not particularly interesting.
#Let's use stat_summary to tell ggplot we want the mean grade for each level of book read.

For a recent assignment in Sanjay’s SEM class, we had to plot interactions between two continuous variables – the model was predicting students’ grades (GRADE) from how often they attend class (ATTEND) and how many of the assigned books they read (BOOKS), and their interaction. I did all the plotting in ggplot2. It was my first time trying to add lines for different categories to the same plot, and I really wanted labels for each line to show up in the plot legend, which was trickier than I would have thought. I got it to work, though!

Run a simple regression in which you regress GRADE on ATTEND. (In regressionspeak, you say “regress Y on X,” where Y is the dependent/response variable and X is the independent/input variable. Thus, I am telling you to treat GRADE as the dependent variable and ATTEND as the independent variable.) Using the output of your analysis, do the following:

(1a) Write out the algebraic equation representing this analysis, using the unstandardized coefficient estimates from your output (that is, write out the best-fitting linear model predicting GRADE from ATTEND).

(1b) Create a graph in which the Y-axis is GRADE and the X-axis is ATTEND. Draw a line representing the slope of ATTEND for a realistic range of values. (You can do this by hand, or using whatever software you'd like.)

Run a multiple regression in which you regress GRADE on BOOKS, ATTEND, and the interaction of BOOKS with ATTEND. Do not center any of the variables. Using the output of your analysis:

(3a) Write the algebraic equation representing the results of this analysis three times. The first time, write it in standard form. The second time, rearrange it so that you can easily see the conditional slope of BOOKS. The third time, rearrange again so you can easily see the conditional slope of ATTEND.

(3b) Draw a graph with GRADE on the Y-axis and ATTEND on the X-axis. Draw 3 lines depicting the regressions of ATTEND for students who have read 0, 2, and 4 books.

(3c) Draw a graph with GRADE on the Y-axis and BOOKS on the X-axis. Draw 3 lines depicting the regressions of BOOKS for students who have attended 10, 15, and 20 times.

# Run a multiple regression in which you regress GRADE on BOOKS, ATTEND, and
# the interaction of BOOKS with ATTEND
model3 <- lm(GRADE ~ ATTEND * BOOKS, data = data)
summary(model3)

Repeat the analysis you ran for #3, only this time you should first center BOOKS and ATTEND around their means, and then regress GRADE on BOOKS(centered), ATTEND(centered), and their interaction.

(4a) Write the algebraic equation representing the results of this analysis three times. The first time, write it in standard form. The second time, rearrange it so that you can easily see the conditional slope of BOOKS. The third time, rearrange again so you can easily see the conditional slope of ATTEND.

(4b) Draw a graph with GRADE on the Y-axis and ATTEND on the X-axis. Draw 3 lines depicting the regressions of ATTEND for students who have read 0, 2, and 4 books.

(4c) Draw a graph with GRADE on the Y-axis and BOOKS on the X-axis. Draw 3 lines depicting the regressions of BOOKS for students who have attended 10, 15, and 20 times.

Repeat #3, only this time you should first z-score all three variables (GRADE, BOOKS, and ATTEND), and then run the regression on the z-scored variables, including the product of z-BOOKS times z-ATTEND.

(5a) Write the algebraic equation representing the results of this analysis three times. The first time, write it in standard form. The second time, rearrange it so that you can easily see the conditional slope of BOOKS. The third time, rearrange again so you can easily see the conditional slope of ATTEND.

(5b) Draw a graph with GRADE on the Y-axis and ATTEND on the X-axis. Draw 3 lines depicting the regressions of ATTEND: one line for students who have read an average number of books, one line for students whose value on BOOKS is 1 standard deviation below the mean, and one line for students whose value on BOOKS is 1 standard deviation above the mean.

(5c) Draw a graph with GRADE on the Y-axis and BOOKS on the X-axis. Draw 3 lines depicting the regressions of BOOKS: one line for students who have attended an average number of times, one line for students whose attendance is 1 standard deviation below the mean, and one line for students whose attendance is 1 standard deviation above the mean.

I wanted to share this way of doing the simple slopes using the 'predict' function. This also demonstrates how to produce data on the fly -- good for reproducible examples!

#Replace this with your data.
# For now, making up new stuff.
summary(books<-round(runif(100,from=0,to=4),0)) #Get number of books from a uniform distribution from 0-4
summary(attend<-rnorm(100,mean=14,sd=4.3)) #Get the number of days attended from a normal distribution
summary(grade<-55-.137*attend-6.2*books+.74*attend*books+ #The grade is related to books and attend...
rnorm(100,0,20)) #...plus unmeasured things!

head(theData<-data.frame(books=books,
attend=attend,
grade=grade))

#Update below is cool, but not necessary. It's just
# an easy way to make nested models. Maybe you've
# come across something like this or better?
summary(mod1<-lm(grade~attend,data=theData))
summary(mod3<-update(mod1,.~.+books+attend:books))

#Uses a model to get predicted values for each row –
# you can use the original data or new data
(theData$Predicted_Grade<-predict(mod3,theData,type='response'))

require(plyr) #for the '.' function
require(ggplot2)
(plot3a<-ggplot(theData, aes(y=Predicted_Grade,x=attend)) +
#I subset the data for geom_line or else we get a line for every value of books
geom_line(subset= .(books %in% c(0,2,4)), aes(colour = as.factor(books)), size = 1)+
geom_point(aes(y=grade,x=attend))+scale_color_discrete(name='Books'))
#One nice thing about this method is that the lines don't extend past the
# data. So honest!

Daryn will be going over how to make surface plots. Its just kind of a fun way to show off pretty things you can do in R. Here (rsm-plots ) is a document that covers surface plots, and here is her code!

Have you never used R? Nor programmed at all? The swirl package* will get you on your feet so fast. It teaches you how to use R directly from the R prompt. (* Of course, if you’re entirely new, you don’t know what a package is yet. Don’t worry! It’s sort of like an app — it wraps up a bunch of useful functions into a nice neat package.)

Get started with these instructions — they’ll walk you through installing R, R Studio, and swirl. If you need help, well, that’s what R club is for.

Here’s a snippet of what you’ll see as you run through the swirl tutorial:

| To assign the result of `5 + 7` to a new variable
| called `x`, you type `x <- 5 + 7`. This can be
| read as "x gets 5 plus 7." Give it a try now.
> x <- 5 + 7
| You nailed it! Good job!
| You'll notice that R did not print the result
| of 12 this time. When you use the assignment
| operator, R assumes that you don't want to
| see the result immediately, but rather that
| you intend to use the result for something
| else later on.

To get us started off with some quick topic presentations at R Club this term, I’ll go over a little matrix algebra on Thursday. Those of you who have taken PSY613 may remember doing similar exercises in SPSS, in MATLAB, and/or by hand. Those of you who are in PSY613 currently will remember this from, well, the future, since we’ll be working on matrix algebra in lab on Friday.