Monday, December 1, 2008

I just put an example of using flogs to compare scores on two years of placement scores. But some of you may be confused why we take flogs in the first place. Here are the main points.

1. We want to compare two proportions, say p1 and p2. There are two problems with a direct comparision like p1/p2.

a) This measure depends on whether we consider the proportions p1, p2 or the proportions 1-p1, 1-p2, and that's a problem. The choice of p1, p2 or 1-p1, 1-p2 shouldn't change our comparision.

b) Small proportions tend to have smaller variation than large proportions (close to .5). The ratio p1/p2 = 1.5 is more meaningful (significant) if the p's are close to zero, than if the p's are close to 0.5.

So we want to transform a proportion p so that

-- it doesn't matter if we consider p or 1-p

-- the variance of the transformed p will be roughly the same for p near 0 and p near 0.5.

By using the flog reexpression log(p/(1-p)) we achieve these two goals.

By the way, the flog reexpression is the basis for logistic regression models. You will likely be using logistic regression in your statistical life, but we are trying here to motivate why we use the flog reexpression.

Friday, November 21, 2008

Some of you appear to be confused on this "binning" homework. Let's outline what you are supposed to do in this assignment.

1. First you bin your data and construct a histogram. This is easy (I hope). The histogram gives you counts of each bin.

By the way, a rootogram is just like a histogram, but you are graphing the root counts against the bins. (Why do you do this? Look at the notes.)

2. You find a Gaussian curve that fits your data -- you have a mean and a standard deviation.

3. Now you want to find the expected or fitted counts from the Gaussian curve. You do this by running the function fit.gaussian -- the inputs to this function are your bins, the vector of raw data, and the mean and standard deviation of the Gaussian curve. Suppose you save the output of this function into the variable s. Then

s$counts are the observed bin counts

s$expected are the expected counts

4. The question at this point is -- does the Gaussian curve provide a good fit to the histogram? Well, maybe yes, and maybe no.

How can you tell?

You look at the residuals which are essentially the deviations of the counts from the expected counts.

I defined several residuals in the notes.

Simple rootogram residuals are

r.simple = sqrt(d) - sqrt(e)

These are graphed by use of the rootogram function.

The double root residuals are defined by

DRR = sqrt(2+4d) - sqrt(1+4e).

How do I compute these? Well, you already have computed the d's and the e's -- you can use R to compute the DRR's.

5. Once I compute the residuals and graphed them, am I done? Well, the point of graphing the residuals is to see if they show any systematic pattern -- if there is a pattern, then the Gaussian curve is not good.

6. Well, we are almost done. Part 1 (d) asks you to fit a G comparison curve to the root data. (Boy, your instructor seems to like taking roots.)

All I'm asking here is to FIRST transform the data by a root, and then repeat all of the above with the root data.

7. Lastly, in #2, I'm asking you to fit a G curve to the women heights. If you have loaded in the studentdata dataset, then to get the female heights, you type

3. The rootogram function plots the residuals. Both plots below show the same info, but in different ways.

# load library vcd that contains rootogram function

library(vcd)

# again save histogram object

h=hist(time,breaks=bins,plot=F)

# illustrate "hanging" style of rootogram

r=rootogram(h$counts,s$expected,type="hanging") # this is the hanging style

# illustrate "deviation" style of rootogram

r=rootogram(h$counts,s$expected,type="deviation")

What have we learned? Although the lengths of baseball games are roughly normal in shape, we see that there are some large negative residuals on the right tail.

What does this mean? Baseball game lengths have a long right-tail which means one tends to see many long games. If you know anything about baseball, you know that baseball can go into extra-innings when there the game is tied after 9 innings, and these extra-inning games cause the long right tail.

Sunday, November 9, 2008

Our department recently was interested in exploring the relationship between student class attendance and performance in a number of 100-level math classes. We all know that missing math classes will have an adverse effect on grades, but we wanted to learn more about this relationship.

For many students in four 100-level math classes, we collected the student's attendance (percentage of classes attended) and their performance (a percentage where a 90% corresponds to A work, 80-90% a B, etc). We decided to categorize attendance using the categories 0-50%, 50-70%, 70-90%, over 90%) and then we found the mean performance for students in each categorization of attendance and course. We got the following two-way table.

Attendance Pct

COURSE 0-50 50-70 70-90 90-100

---------------------------------

MATH 112 61.9 68.0 68.8 75.5

MATH 115 54.3 71.8 76.6 83.3

MATH 122 56.1 64.8 73.1 78.1

MATH 126 50.6 66.3 73.9 78.1

I'll demonstrate median polish with this data.

1. First, I have to get this data into a matrix form into R. I'll first use the matrix command to form the matrix (by default, one enters data column by column) and then I'll add row and column labels.

grade=matrix(c(61.9,54.3,56.1,50.6,

68.0,71.8,64.8,66.3,

68.8,76.6,73.1,73.9,

75.5,83.3,78.1,78.1),c(4,4))

dimnames(grade)=list(c("MATH 112","MATH 115","MATH 122","MATH 126"),

c("0-50","50-70","70-90","90-100"))

To check that I've read in the data correctly, I'll display "grade":

> grade

0-50 50-70 70-90 90-100

MATH 112 61.9 68.0 68.8 75.5

MATH 115 54.3 71.8 76.6 83.3

MATH 122 56.1 64.8 73.1 78.1

MATH 126 50.6 66.3 73.9 78.1

2. Now I can implement median polish to get an additive fit. I'll store the output in the variable "my.fit" and then I'll display the different components.

> my.fit=medpolish(grade)

Here's the common value.

> my.fit$overall

[1] 69.7

Here are the row effects.

> my.fit$row

MATH 112 MATH 115 MATH 122 MATH 126

-0.7375 4.5000 0.2875 -0.2875

Here are the column effects.

> my.fit$col

0-50 50-70 70-90 90-100

-16.35000 -2.75625 2.75625 8.40000

I'll interpret this additive fit.

The average performance of these students is 69.7%.

Looking at the row effects, we see that MATH 115 students get grades that are 4.5 - (0.2875) approx 4.2 points higher than MATH 122 students. MATH 122 students tend to be 0.2875 - (-0.2875) approx 0.57 points higher than MATH 126 students, and MATH 112 students are about a half percentage point lower than MATH 126 students.

Looking at the column effects, we see a clear relationship between attendance and performance. The best (90-100%) attenders do 8.4 - 2.75 = 5.65 points better (on average) than the 70-90 attendance group, do 8.4 - (-2.75) = 11.15 points better than the 50-70 attendance group, and a whopping 8.4 - (-16.35) = 24.75 points better than the "no shows" (the under 50% attending group).

3. It might be helpful to plot this fit. I wrote a function plot2way in the LearnEDA package:

> library(LearnEDA)

> plot2way(my.fit$overall+my.fit$row,my.fit$col,dimnames(grade)[[1]],

dimnames(grade)[[2]])

Note that the plot2way function has four arguments: the row part, the column part, the vector of names of the rows, and the vector of names of the columns. Here's the figure.

Actually, this plot would look better if I had figured out how to rotate the figure so that the FIT lines are horizontal. (I'll give extra credit points to anyone who can fix my function to do that.)

From this graph we see that the best performances are the MATH 115 students who attend over 90% of the classes; the worst performances are the MATH 112 students who have under 50% attendance.

4. Are we done? Not quite. We have looked at the fit, but have not looked at the residuals -- the differences between the performance and the fit. They are stored in my.fit$residual -- I will round the values so they are easier to view.

> round(my.fit$residual)

0-50 50-70 70-90 90-100

MATH 112 9 2 -3 -2

MATH 115 -4 0 0 1

MATH 122 2 -2 0 0

MATH 126 -2 0 2 0

I see one large residual that I have highlighted in red. MATH 112 students who don't come to class (under 50% attendance) seem to do 9 points better than one would expect based on the additive fit. This might deserve further study.

Thursday, October 30, 2008

Last night was an exciting night. My team, the Philadelphia Phillies, won the World Series for only the second time in their history that started in 1883.

So it seems appropriate to give a statistical tribute to the Phillies.

I collected the Phillies winning percentage (percentage of games won) for each season in their history. Below I plot the winning percentage against the season year.

It is hard to see the general pattern in this graph, so it makes sense to smooth. I could use Tukey's resistant smooth described in the EDA notes, but I'll illustrate the use of an alternative smoothing method called lowess. Here is the R code to graph the scatterplot and overlay the lowess smooth.

plot(Year,Win.Pct,pch=19,col="red")

lines(lowess(Year,Win.Pct,f=.2),lwd=3,col="red")

title("PHILLIES WINNING PERCENTAGES",col="red")

abline(h=50,lwd=2)

I added the horizontal line at WIN.PCT = 50 that corresponds to an average season.

Looking at the graph, you should see why the Philadelphia fans are so excited about the Phillies winning this season.

1. The Phillies teams generally have been crummy, especially between 1920 and 1950. The team hasn't experienced much success.

2. But there have been two recent periods where the Phillies have been successful. One period was in the 1970's and the climax of this success was the Phillies first World Series win in 1980. The second period is in recent history and of course the Phillies won their second World Series in 2008.

Sunday, October 26, 2008

Here are some general and specific comments on your efforts on Homework 4 on plotting and straightening.

First, most of you did well on this homework. You were good in describing the fit and the pattern in the residuals. Also you made reasonable choices at reexpressing the x and/or the y variable so that the graph looked pretty straight.

But there were some things that caused you to lose points.

1. What are you looking for in a residual plot? In this assignment, the focus was looking for nonlinear patterns. For example, is it appropriate to fit a line to the (year, population) data for the England and Wales dataset? We can answer this question by looking for a nonlinear pattern in the plot of residuals against year. There is significant curvature (that is, a quadratic pattern) in the residual plot which tells you that the population growth is not linear.

By the way, some of you plotted log population against log year -- why did you take the log of year? It doesn't make any sense to me.

2. Always talk about the fit and the residuals in the context of the data. Someone plotted x against y without telling me the variables. The fun part of statistics is that you can always talk about the application.

3. Remember that funny problem where the scatterplot shows two group of points with a clear separation? This is one type of nonlinear pattern that you won't be able to straighten with a single choice of power transformation. But since the graph clearly divides into two parts, it makes sense to treat this as two independent problems and try to straighten each part.

4. Should one fit a line by least-squares or by a resistant line? In many situations, it won't make a difference -- either fit will work. But least-squares can give you relatively poor fits when there are outliers.

How can you tell if least squares isn't the best fit? Look at the residual plot. If you still see some increasing or descreasing pattern, then this tells you that least-squares hasn't explained all of the "tilt" pattern in the graph.

Monday, October 20, 2008

This week, the main topic is reexpressing either the x or y variable to straighten a nonlinear pattern that we see in a scatterplot. Although the manipulation may be straightforward, it is possible to miss the main message in this material. Here are some questions that may help in your explanations in the homework.

Question 1: What is a simple description of the pattern in a scatterplot?

Question 2: Why do we prefer to fit lines instead of more complicated curves like quadratic or cubic?

Question 3: Is it possible to straighten all nonlinear patterns by power transformations?

Question 4: Can you think of a situation or an example where it is not possible to straighten by power transformations?

Don't forget to look at your data first -- you may see rightaway that it is not possible to straighten the graph.

Wednesday, October 15, 2008

In baseball, the objective is to win games and a team wins a game by scoring more runs than its opponent. An interesting question is "how important is a single run" towards the goal of winning a game? Suppose one collects the runs scored, the runs allowed, the number of wins and the number of losses for a group of teams. Bill James (a famous guy who works on baseball data) discovered the empirical relationship

log(wins/losses) = 2 log(runs scored/runs allowed)

He called this the Pythagorean Relationship.

Let's try to demonstrate this relationship by use of a resistant fit.

1. First, I collected data for all baseball teams in the 2008 season. The dataset teams2008. txt contains for each of the 30 teams ...

Team -- the name of the teamWins -- the number of winsLosses -- the number of lossesRuns.Scored -- the total number of runs scoredRuns.Allowed -- the total number of runs allowed

2. I read this dataset into R and compute the variables log.RR and log.WL.

7. What are we looking for in the residual plot? First, we look for general patterns that we didn't see earlier in the first plot. I don't see any trend, so it appears that we removed the tilt by fitting a line.

Also we are looking for unusually small or large residual. Here a "lucky team" corresponds to a team who seemed to win more games than one would expect based on their wins and losses.

Which team was unusually lucky in the 2008 season? A hint: they were a "heavenly" team from the west coast.

Monday, October 6, 2008

Your instructor is currently in "Phillies Heaven". His baseball team rarely has a chance to win the World Series (they have won one World Series in over 120 seasons of competing) and they currently in the National League Championship against the Dodgers!

Oh right -- I'm supposed to talk about the EDA class.

I finished the grading on your "symmetry" homework and Fathom assignment. You generally did fine, but I'll explain some issues that may have caused you to lose points.

What was I looking for?

In the notes, we talked about several methods for assessing symmetry of a batch, including looking a the sequence of midsummaries, using a symmetry plot, and Hinkley's quick method. For each dataset you consider, here's a outline of what you should do:

What is going on? The problem is that you were defining a reexpression like p = -1/2 as

data^(-1/2)

when you should have used

-data^(-1/2)

By adding the negative sign, all of your transformations are increasing functions, and then

you can make better sense of the reexpressed graphs and the methods.

2. Reexpression only works when there is sufficient spread in your data. Suppose you have data that ranges from 50 to 60 -- here the range is only 60/50 = 1.2. Reexpression using any value of p won't work -- that is, it won't change the shape of the data.

3. One of you used a normal probability plot to determine if the graph was symmetric. What's wrong with this? Well, it is not one of the methods we discussed in the notes. Second, checking for normality is different (but related) than checking for symmetry. Data can be symmetric but not normally distributed.

Monday, September 29, 2008

When I grade your homework, I'm more interested in your explanation and comments rather than the mechanics. This homework is a good illustration of this.

You started by constructing a spread vs level plot of the weights by the supplements. Here's the graph produced by the spread.level.plot function in the LearnEDA package.

Here are the main questions:

1. Is there a dependence between spread and level?

Yes, but the pattern in the graph is a bit confused with the outlying point in the lower-right section of the plot. If we removed this point, there would appear to be a stronger relationship.

2. Can we improve by a suitable reexpression?

There is a line of slope 0.18 drawn on the graph. This would suggest the use of a power transformation with power p = 1 - 0.18 = 0.82.

3. Is this a reasonable strategy?

If we transform by a 0.82 power, this won't really change things. It is almost equivalant to taking a 1 power which is no change.

But looking more carefully at the graph, we would fit a different line if we ignored that one outlier. Then one would get a line with a smaller slope, like 0.50 and this would suggest the use of a root transformation. This would really be nontrivial and would help the general dependence between spread and level.

If you made some comments that were similar in spirt to the ones I've made above, then you got full credit. You could have lost points if you went through the mechanics without commenting on what you actually did.

One day, a boy was interested in taking a course in exploratory data analysis, but he didn't have the money to pay for it. He decided on asking his grandmother for the money for the course. She decided to help, but she said "I hope this is a worthwhile class for you."

Anyway, the boy visited his grandmother recently and told her that the course was going well. The grandmother asked what he was learning and the boy responded:

Last week, we learned how to compare groups of data. We compared the yearly snowfall of Buffalo with Cairo. It was hard to compare the two groups since there was a dependence between level and spread that I learned by constructing a spread versus level graph. But this graph suggested the use of a p = 0.5 power transformation and when I did a spread versus level graph of the transformed data, the dependence between spread and level was reduced.

The grandmother, listening intently, responded "So what did you conclude from your data analysis?"

The boy said proudly "There is more snow in Buffalo than Cairo."

The grandmother then with a heavy sigh said "Can we get our money back?"

Tuesday, September 16, 2008

Since we are comparing groups in EDA, I thought I would give some guidance on how to subset data in R.

Suppose we want to construct stemplots for the areas of the islands in each continent in the homework. Here is some R work for constructing a stemplot of the island areas in the Arctic Ocean. The key command is "subset".

By the way, I don't think there is a simple way of constructing parallel stemplots in R.

Sunday, September 14, 2008

I finished grading your Fathom assignment on the number of bins. Generally, you all did well on this, but there are a couple of things I should mention.

1. The moral of this assignment is that as you have more data (bigger n), you should use a small bin width and have more bins. It seemed that your best histograms by eye were similar to the ones chosen by the "optimal rule" formula.

2. I think the rule wasn't that effective for constructing a histogram of the old faithful data. By using a small number of bins, you didn't see any structure in each of the two humps.

3. If you lost points, it probably was due to some confusion on your calculations or maybe not the best answer to a question -- like the one about the histogram of the old faithful data. If you don't know why you lost points, just email me .

Looking ahead, the next assignment is on EFFECTIVE COMPARISON. You'll learn a specific method for equalizing spreads between batches. Although you might understand the method (spread vs. level plot, reexpressing, etc), it is important not to lose sight of what we are trying to accomplish. We want to make a reasonable comparison between groups.

So, when you do your homework this week, don't forget to think about the BIG PICTURE. Conclude your work by making a comparison.

Last, we'll be using some new R commands. Don't forget to look at the "Chapter 3 work" file that illustrates the use of these new commands.

Sunday, September 7, 2008

Most of you are doing great on the homework so far. But I thought I should explain how I great and why you may be losing points.

Generally I am more interested in your explanations and how you are answering the main questions of interest. For example, in the graphs and summaries homework, I am not interested as much in your R work and your computation. Most of you are doing ok in getting R to produce stemplots and compute letter values. But the BIG questions are ...

-- what is the best choice of stemplot?

-- what have we learned about the data in terms of shape, average, and spread?

-- are there observations that deviate from the rest and why are these observations unusual?

You should be addressing these BIG questions in the first R homework.

In the Fathom activity, we were looking at the number of outliers one would expect for samples from different population distributions.

For normal data, we don't see many outliers. But if the data comes from a flat-tailed distribution (like the t distribution), outliers are more common. If this general conclusion wasn't obvious from your work, then you may have lost points.

Here is a final quibble (small point). Most of the stemplots you showed me were hard to read and certainly you wouldn't want to use them for any presentation.

Which stemplot do you prefer?

Stemplot A:

1 | 2: represents 1.2

leaf unit: 0.1

n: 50

2 -1. | 55

6 -1* | 0233

11 -0. | 67899

22 -0* | 01111113344

(9) 0* | 112223334

19 0. | 789999

13 1* | 0011233

6 1. | 68

4 2* | 023

1 2. | 9

Stemplot B:

1 | 2: represents 1.2

leaf unit: 0.1

n: 50

2 -1. | 55

6 -1* | 0233

11 -0. | 67899

22 -0* | 01111113344

(9) 0* | 112223334

19 0. | 789999

13 1* | 0011233

6 1. | 68

4 2* | 023

1 2. | 9

The message here is that you should use a monoproportional font where each character takes the same space like Courier.

Thursday, September 4, 2008

Here are some common questions I've heard recently about R and Fathom.

1. Some of you are having problems reading in datafiles which is a big concern. There are two ways you can mess up.

(a) First, it is important that R can find your files. Put all of your R work in a particular folder, say EDA, and then by choosing menu item File -> Change dir ..., you select the file EDA. To check if the working directory really has changed, type

dir()

and you should see your data files.

(b) A general form to read in a text datafile is

data=read.table(file.name, header=T, sep="\t")

where file.name is in double-quotes. The header option says that the first line in the file contains the variable names and the sep option says that columns are separated by the tab character.

2. How do you plot curves on Fathom?

Suppose you have created a scatterplot and wish to add a curve. You select the graph and choose the menu item Graph -> Plot Function. Then you just type the function (using the variable name on the x axis) in the box.

Tuesday, August 26, 2008

Here is a typical data analysis on R. I'm a baseball fan and I'm interested in the wins and losses for the current Major League teams and how these wins and losses are related to the runs scored and runs allowed.

In Excel, I create a dataset that contains the wins, losses, runs scored, and runs allowed for all 30 teams. Here is the first 4 rows of the dataset.

Team

League

W

L

RS

RA

Tampa Bay

American

79

50

597

515

Boston

American

75

55

670

559

NY Yankees

American

70

60

632

585

Toronto

American

67

63

574

510

I save this dataset as "baseball2008.txt" -- text, tab-delimited format. I save this file in a folder called "eda" and I make this the current R working directory so R will find this file.

Here is my analysis:

# read the datafile into R

data=read.table("baseball2008.txt",header=T,sep="\t")

# attach the data to make the variable names available in R

attach(data)

# compute the winning proportion for all teams

win.prop=W/(W+L)

# what was the largest winning proportion?

max(win.prop)

[1] 0.6183206

# which team had the largest winning proportion?

Team[win.prop==max(win.prop)]

[1] Chicago Cubs

# for each team, compute the number of runs scored per game

runs.game=RS/(W+L)

# construct a stemplot of the runs scored per game# (I'm assuming you have the aplpack package installed)

Monday, August 25, 2008

R is a great program but it has a relatively steep learning curve. To help you get started, I have three help sessions on R planned this week -- you need only attend one of the sessions.

I have four documents R_INTRO_PART_I, R_INTRO_PART_II, R_INTRO_PART_III, and R_INTRO_PART_IV in the Course Documents section that describe different aspects of R.

1. Manipulating vectors. A basic object in R is a vector. R_INTRO_PART_I discusses how to create and work on vectors.

2. Input and output. One typically wants to read datafiles into R -- read.table is useful for doing this. Data is typically stored in a R object called a data frame. Also you'll want to save R output including graphs and paste this material into a Word document.

3. Matrices. You should be comfortable working and manipulating matrices in R.

4. Plotting. You should be familiar with basic plotting commands and understand how one can add things (like labels and titles) to to graphs.

I hope you have R installed on your laptop. You may find it helpful to bring your laptop to the help session.

Friday, August 22, 2008

Now that you have successfully installed R, you have access to many functions in the R "base" package. But we'll want to add packages that will give us additional functions helpful for the class. I'll illustrate adding the package "aplpack" that we'll need to get the "stem.leaf" function. Also, I'll show you how to add the "LearnEDA" package that I wrote for the class.

Installing the package aplpack from CRAN.

1. In R, choose the menu item Packages -> Install Packages.

2. You'll be asked to choose a CRAN mirror site -- choose one in the United States.

3. Then you'll see a list of all available Packages -- choose aplpack.

4. At this point, the package will be downloaded and installed -- if you see the message

package 'aplpack' successfully unpacked and MD5 sums checked

you're in good shape.

5. The package aplpack is installed but not yet loaded into R. To load the package, you type

library(aplpack)

6. To check to see if you have the aplpack commands, try constructing a stemplot on a random sample of values from a standard normal distribution:

stem.leaf(rnorm(50))

If you get some display, you're set.

Installing the class package LearnEDA.

This is a package that I wrote that has some special functions and all the datasets we'll use. This is not available on CRAN yet, since it is still in development. But it is available from my website.

1. Go to the class web folder http://bayes.bgsu.edu/EDA/ and find the zip file