Hypothesis Testing

Previously we used confidence intervals to estimate some unknown population parameter. For example, we constructed 1-proportion confidence intervals to estimate the true population proportion – this population proportion being the parameter of interest. We even went as far as comparing two intervals to see if they overlapped – if so we concluded that there was no difference between the population proportions for the two groups – or if the interval contained a specific parameter value.

Statistical Significance

A sample result is called statistically significant when the p-value for a test statistic is less than level of significance, which for this class we will keep at 0.05. In other words, the result is statistically significant when we reject a null hypothesis.

Five Steps in a Hypothesis Test (Note: some texts will label these steps differently, but the premise is the same)

Check any necessary assumptions and write null and alternative hypotheses.

Calculate an appropriate test statistic.

Determine a p-value associated with the test statistic.

Decide between the null and alternative hypotheses.

State a "real world" conclusion.

Now let’s try to tie together the concepts we discussed regarding Sampling and Probability to delve further into statistical inference with the use of hypothesis tests.

Two designs for producing data are sampling and experimentation, both of which should employ randomization. We have learned that randomization is advantageous because it controls bias. Now we will see another advantage: because chance governs our selection, we may make use of the laws of probability – the scientific study of random behavior – to draw conclusions about the
entire population from which the units (e.g. students, machined parts, U.S. adults) originated. Again, this process is called statistical inference.

Previously we had defined population and sample and what we use to describe their values, but we will revisit these:

Parameter: a number that describes the population. It is fixed but rarely do we know its value. (e.g. the true proportion of PSU undergraduates that would date someone of a different race.)

Statistic: a number that describes the sample. This value is known but can vary from sample to sample, for instance from the Class Survey data we may get one proportion of those who said they would date someone of a different race, but if I gave that survey to another sample of PSU undergraduate students do you really believe that the proportion from that sample would be identical to ours?

EXAMPLES

1. A survey is carried out at a university to estimate the mean GPA of undergraduates living off campus current term. Population: all undergraduates at the university who live off campus; sample: those undergraduates surveyed; parameter: mean GPA of all undergraduates at that university living off campus; statistic: mean GPA of sampled undergraduates.

2. A balanced coin is flipped 100 times and percentage of heads is 47%. Population: all coin flips; sample: the 100 coin flips; parameter: 50% - percentage of all coin flips that would result in heads if the coin is balanced; statistic: 47%.

Hypothesis Testing for a Proportion

Ultimately we will measure statistics (e.g. sample proportions and sample means) and use them to draw conclusions about unknown parameters (e.g. population proportion and population mean). This process, using statistics to make judgments or decisions regarding population parameters is called statistical inference.

Example 2 above produced a sample proportion of 47% heads and is written:

[read p-hat] = 47/100 = 0.47

P-hat is called the sample proportion and remember it is a statistic (soon we will look at sample means, .) But how can p-hat be an accurate measure of p, the population parameter, when another sample of 100 coin flips could produce 53 heads? And for that matter we only did 100 coin flips out of an uncountable possible total!

The fact that these samples will vary in repeated random sampling taken at the same time is referred to as sampling variability. The reason sampling variability is acceptable is that if we took many samples of 100 coin flips an calculated the proportion of heads in each sample then constructed a histogram or boxplot of the sample proportions, the resulting shape would look normal (i.e. bell-shaped) with a mean of 50%.

[The reason we selected a simple coin flip as an example is that the concepts just discussed can be difficult to grasp, especially since earlier we mentioned that rarely is the population parameter value known. But most people accept that a coin will produce an equal number of heads as tails when flipped many times.]

A statistical hypothesis test is a procedure for deciding between two possible statements about a population. The phrase significance test means the same thing as the phrase "hypothesis test."

The two competing statements about a population are called the null hypothesis and the alternative hypothesis.

A typical null hypothesis is a statement that two variables are not related. Other examples are statements that there is no difference between two groups (or treatments) or that there is no difference from an existing standard value.

An alternative hypothesis is a statement that there is a relationship between two variables or there is a difference between two groups or there is a difference from a previous or existing standard.

NOTATION: The notation Ho represents a null hypothesis and Ha represents an alternative hypothesis and po is read as p-not or p-zero and represents the null hypothesized value. Shortly, we will substitute μo for when discussing a test of means.

The first Ha is called a two-sided test since "not equal" implies that the true value could be either greater than or less than the test value, po. The other two Ha are referred to as one-sided tests since they are restricting the conclusion to a specific side of po.

Example 3 – This is a test of a proportion:

A Tufts University study finds that 40% of 12th grade females feel they are overweight. Is this percent lower for college age females? Let p = proportion of college age females who feel they are overweight. Competing hypothesis are:

Ho: p = .40 (or greater) That is, no difference from Tufts study finding.Ha: p < .40 (proportion feeling they are overweight is less for college age females.

Example 4 – This is a test of a mean:

Is there a difference between the mean amount that men and women study per week? Competing hypotheses are:

Null hypothesis: There is no difference between mean weekly hours of study for men and women, writing in statistical language as μ1 = μ2
Alternative hypothesis: There is a difference between mean weekly hours of study for men and women, writing in statistical language as μ1≠ μ2

This notation is used since the study would consider two independent samples: one from Women and another from Men.

Test Statistic and p-value

A test statistic is a summary of a sample that is in some way sensitive to differences between the null and alternative hypothesis.

A p-value is the probability that the test statistic would "lean" as much (or more) toward the alternative hypothesis as it does if the real truth is the null hypothesis. That is, the p-value
is the probability that the sample statistic would occur under the presumption that the null hypothesis is true.

A small p-value favors the alternative hypothesis. A small p-value means the observed data would not be very likely to occur if we believe the null hypothesis is true. So we believe in our data and disbelieve the null hypothesis. An easy (hopefully!)
way to grasp this is to consider the situation where a professor states that you are just a 70% student. You doubt this statement and want to show that you are better that a 70% student. If you took a random sample of 10 of your previous exams and calculated the mean percentage of these 10 tests, which mean would
be less likely to occur if in fact you were a 70% student (the null hypothesis): a sample mean of 72% or one of 90%? Obviously the 90% would be less likely and therefore would have a small probability (i.e. p-value).

Using the p-value to Decide between the Hypotheses

The significance level of a test is the border used for deciding between the null and alternative hypotheses.

Decision Rule: We decide in favor of the alternative hypothesis when a p-value is less than or equal to the significance level. The most commonly used significance level is 0.05.

In general, the smaller the p-value the stronger the evidence is in favor of the alternative hypothesis.

EXAMPLE 3 CONTINUED:

In a recent elementary statistics survey, the sample proportion (of women) saying they felt overweight was 37 /129 = .287. Note that this leans toward the alternative hypothesis that the "true" proportion is less than .40. [Recall that the Tufts University study finds that 40% of 12th grade females feel they are overweight. Is this percent lower for college age females?]

Step 1: Let p = proportion of college age females who feel they are overweight.

Ho: p = .40 (or greater) That is, no difference from Tufts study finding.Ha: p < .40 (proportion feeling they are overweight is less for college age females.

Step 2:

If npo ≥ 10 and n(1 – po) ≥ 10 then we can use the following Z-test statistic: Since both (129)*(0.4) and (129)*(0.6) > 10 [or consider that the number of successes and failures, 37 and 92 respectively, are at least 10] we calculate the test statistic by:

Note: In computing the Z-test statistic for a proportion we use the hypothesized value po here not the sample proportion p-hat in calculating the standard error! We do this because we "believe" the null hypothesis to be true until evidence says otherwise.

Calculating p-value:

In our example we are using Ha : p < .40 so our p-value will be found from P(Z ≤ z) = P(Z ≤ -2.62) and from Standard Normal Table this is equal to 0.0044.

Step 4: We compare the p-value to alpha, which we will let alpha be 0.05. Since 0.0044 is less than 0.05 we will reject the null hypothesis and decide in favor of the alternative, Ha.

Step 5: We’d conclude that the percentage of college age females who felt they were overweight is less than 40%. [Note: we are assuming that our sample, since not random, is representative of all college age females.]

The p-value= .004 indicates that we should decide in favor of the alternative hypothesis. Thus we decide that less than 40% of college women think they are overweight.

The "Z-value" (-2.62) is the test statistic. It is a standardized score for the difference between the sample p and the null hypothesis value p = .40. The p-value is the probability that the z-score would lean toward the alternative hypothesis as much as it does if the true population really was p = .40.

Using Software to Perform a One Proportion Test Analysis Using Raw Data

Check the box for Perform Hypothesis Test and enter 0.4 (note that for Minitab versions earlier than 15 this test is found under the Options)

Click Options and select the correct Alternative (e.g. less than)

Check the box for Use Test and Interval Based on Normal Distribution (remember to verify this use by checking that the number of successes and failures are at least ten)

Click OK twice

This should result in the following output:

To perform a summarized one proportion test analysis in SPSS:

Open SPSS without data

Enter in the first empty cell the number of successes, 37

Enter in the cell below that one the number of failures, 92

Click Data > Weight Cases

Click the radio button Weight Cases By and enter in the text box the variable of interest from the variable list (should only be one variable VAR00001 if you started with an empty data set) --(see image spss_02)

Click OK

Go to Analyze > Nonparametric Tests > Binomial

Enter the variable of interest into the Test Variable List (see image spss_03)

Change the test proportion value to 0.4

Click OK

NOTE: SPSS does not provide a method based on the normal approximation (even though the notation in the output references based on Z approximation). SPSS uses exact methods based on binomial distribution. However, the hypotheses setup, decision rules and conclusion use the same approach as that for when using normal approximation techniques, i.e. z- method.

This should result in the following output:

The p-value= .004 indicates that we should decide in favor of the alternative hypothesis. Thus we decide that less than 40% of college women think they are overweight.

The "Z-value" (-2.62) is the test statistic. It is a standardized score for the difference between the sample p and the null hypothesis value p = .40. The p-value is the probability that the z-score would lean toward the alternative hypothesis as much as it does if the true population really was p = .40.

Hypothesis Testing for a Mean

Quantitative Response Variables and Means

We usually summarize a quantitative variable by examining the mean value. We summarize categorical variables by considering the proportion (or percent) in each category. Thus we use the methods described in this handout when the response variable is quantitative. Again, examples of quantitative variables are height, weight, blood pressure, pulse rate, and so on.

Null and Alternative Hypotheses for a Mean

For one population mean, a typical null hypothesis is H0 : population mean μ = a specified value. We'll actual give a number where it says "a specified value," and for paired data the null hypothesis would be H0 : ud = a specified value. Typically when considering differences this specified value is zero

The alternative hypothesis might either be one-sided ( a specific direction of inequality is given) or two-sided ( a not equal statement).

Test Statistics

The test statistic for examining hypotheses about one population mean:

where the observed sample mean, μ0 = value specified in null hypothesis, s = standard deviation of the sample measurements and n = the number of differences.

Notice that the top part of the statistic is the difference between the sample mean and the null hypothesis. The bottom part of the calculation is the standard error of the mean.

It is a convention that a test using a t-statistic is called a t-test. That is, hypothesis tests using the above would be referred to as "1-sample t test".

Finding the p-value

Recall that a p-value is the probability that the test statistic would "lean" as much (or more) toward the alternative hypothesis as it does if the real truth is the null hypothesis.

When testing hypotheses about a mean or mean difference, a t-distribution is used to find the p-value. This is a close cousin to the normal curve. T-Distributions are indexed by a quantity called degrees of freedom, calculated as df = n – 1 for the situation involving a test of one mean or test of mean difference.

The p-values for the t-distribution are found in your text or a copy can be found at the following link: T-Table. To interpret the table, use the column under DF to find the correct degree of freedom. Use the top row under Absolute Value of t-Statistic to locate your calculated t-value. Most likely you will not find an exact match for your t-value so locate the range for your t-value. This means that your t-value will be either less than 1.28; between two t-statistics in the table; or greater than 3.00. Once you located the range, then find the corresponding p-value(s) associated with your range of t-statistics. This would be your p-value used to compare to alpha of 0.5.

NOTE: the t-statistics increase from left to right, but the p-values decrease! So if your range for the t-statistic is greater than 3.00 your p-value would be less than the corresponding p-value listed in the table.

Examples of reading T-Table [recall degrees of freedom for 1-sample t are equal to n − 1, or one less than the sample size] and is read as p-value = P(T > |t|). NOTE: If this formula appears familiar it should as this closely resembles that for finding probability values using Standard Normal Table with z-values.

If you had sample of size 15 resulting in DF = 14 and t-value = 1.20 your t-value range would be less than 1.28 producing a p-value of p > 0.111. That is, the probability that P(T < 1.28) is greater than 0.111.

If you had sample of size 15 resulting in DF = 14 and t-value = 1.95 your t-value range would be from 1.80 to 2.00 producing a p-value of 0.033 < p < 0.047. That is, the probability that P(1.80 < T < 2.00) is between 0.0333 and 0.047.

If you had sample of size 15 resulting in DF = 14 and t-value =3.20 your t-value range would be greater than 3.00 producing a p-value of p < 0.005. That is, the probability that P(T > 3.00) is less than 0.005.

NOTE: The increments for the degrees of freedom in T-Table are not always 1. This column increases by 1 up to DF = 30, then the increments change. If your DF is not found in the table just go to the nearest DF. Also, note that the last row, "Infinite", displays the same p-values as those found in Standard Normal Table. This is because as n increases the t-distribution maps the standard normal distribution.

Using Software to Perform a One Mean Test Analysis Using Raw Data

Example:

Students measure their pulse rates. Is the mean pulse rate for college age women equal to 72 (a long-held standard for average pulse rate)?

Click the check box for Perform Hypothesis Test and enter the hypothesized value into the text box for Hypothesized Mean (e.g. 72)

Click Options. Here you can select correct alternative hypothesis (default is not equal to - keep that for now)

Click OK

Click OK

This should result in the following output (image_003 included in conversion folder):

SPSS cannot perform a hypothesis test for a mean using summarized data.

INTERPRETATION:

The p-value is p = 0.019. This is below the .05 standard, so the result is statistically significant. This means we decide in favor of the alternative hypothesis. We're deciding that the population mean is not 72.

The test statistic is

Because this is a two-sided alternative hypothesis, the p-value is the combined area to the right of 2.47 and the left of −2.47 in a t-distribution with 35 – 1 = 34 degrees of freedom.

Example 2:

In the same "survey" there were n = 57 men. Is the mean pulse rate for college age men equal to 72?

Null hypothesis: μ = 72
Alternative hypothesis: μ ≠72

RESULTS:

INTERPRETATION:

The p-value is p = 0.236. This is not below the .05 standard, so we do not reject the null hypothesis. Thus it is possible that the true value of the population mean is 72. The 95% confidence interval suggests the mean could be anywhere between 67.78 and 73.06.

The test statistic is

The p-value is the combined probability that a t-value would be less than (to the left of ) −1.20 and greater than (to the right of +1.20).

Errors, Practicality and Power in Hypothesis Testing

Errors in Decision Making – Type I and Type II

How do we determine whether to reject the null hypothesis? It depends on the level of significance α, which is the probability of the Type I error.

What is Type I error and what is Type II error?

When doing hypothesis testing, two types of mistakes may be committed and we call them Type I error and Type II error.

Decision

Reality

H0 is true

H0 is false

Reject H0 and conclude Ha

Type I error

Correct

Do not reject H0

Correct

Type II error

If we reject H0 when H0 is true, we commit a Type I error. The probability of type I error is denoted by alpha, α (as we already know this is commonly 0.05)

If we accept H0 when H0 is false, one commits a type II error. The probability of Type II error is denoted by Beta, β:

Our convention is to set up the hypotheses so that type I error is the more serious error.

Example 1: Mr. Orangejuice goes to trial where Mr. Orangejuice is being tried for the murder of his ex-wife.

We can put it in a hypothesis testing framework. The hypotheses being tested are:

Mr. Orangejuice is guilty

Mr. Orangejuice is not guilty

Set up the null and alternative hypotheses where rejecting the null hypothesis when the null hypothesis is true results in the worst scenario:

H0 : Not GuiltyHa : Guilty

Here we put Mr. Orangejuice is not guilty in H0 since we consider false rejection of H0 a more serious error than failing to reject H0. That is, finding an innocent person guilty is worse than finding a guilty man innocent.

Type I error is committed if we reject H0 when it is true. In other words, when Mr. Orangejuice is not guilty but found guilty.

α = probability( Type I error)

Type II error is committed if we accept H0 when it is false. In other words, when Mr. Orangejuice is guilty but found not guilty.

β = probability( Type II error)

Relation between α, β

Note that the smaller we specify the significance level, α, the larger will be the probability, β of accepting a false null hypothesis.

Cautions About Significance Tests

If a test fails to reject Ho, it does not necessarily mean that Ho is true – it just means we do not have compelling evidence to refute it. This is especially true for small sample sizes n. To grasp this, if you are familiar with the judicial system you will recall that when a judge/jury renders a decision the decision is "Not Guilty". They do not say "Innocent". This is because you are not necessarily innocent, just that you haven’t been proven guilty by the evidence, (i.e. statistics) presented!

Our methods depend on a normal approximation. If the underlying distribution is not normal (e.g. heavily skewed, several outliers) and our sample size is not large enough to offset these problems (think of the Central Limit Theorem from Chapter 9) then our conclusions may be inaccurate.

Power of a Test

When the data indicate that one cannot reject the null hypothesis, does
it mean that one can accept the null hypothesis? For example, when the p-value
computed from the data is 0.12, one fails to reject the null hypothesis
at =
0.05. Can we say that the data support the null hypothesis?

Answer: When you perform hypothesis testing, you only set
the size of Type I error and guard against it. Thus, we can only present
the strength of evidence against the null hypothesis. One can sidestep
the concern about Type II error if the conclusion never mentions that
the null hypothesis is accepted. When the null hypothesis cannot be rejected,
there are two possible cases: 1) one can accept the null hypothesis, 2)
the sample size is not large enough to either accept or reject the null
hypothesis. To make the distinction, one has to check .
If at a likely value of the parameter is small, then one accepts the null
hypothesis. If the is large, then one cannot accept the null hypothesis.

The relationship between and :

If the sample size is fixed, then decreasing will increase .
If one wants both to decrease, then one has to increase the sample size.