Hypothesis Testing

Construct an appropriate null hypothesis and alternative hypothesis and distinguish between the two.

Construct and apply confidence intervals for one-sided and two-sided hypothesis tests, and interpret the results of hypothesis tests with a specific level of confidence.

Differentiate between a one-sided and a two-sided test and identify when to use each test.

Explain the difference between Type I and Type II errors and how these relate to the size and power of a test.

Understand how a hypothesis test and a confidence interval are related.

Explain what the p-value of a hypothesis test measures.

Interpret the results of hypothesis tests with a specific level of confidence.

Identify the steps to test a hypothesis about the difference between two population means.

Explain the problem of multiple testing and how it can bias results.

Hypothesis testing is defined as a process of determining whether a hypothesis is in line with the sample data. Hypothesis testing tries to test whether the observed data is likely is the hypothesis is true. Hypothesis testing starts by stating the null hypothesis and the alternative hypothesis. The null hypothesis is an assumption of the population parameter. On the other hand, the alternative hypothesis states the parameter values (critical values) at which the null hypothesis is rejected. The critical values is determined by the distribution of the test statistic (when the null hypothesis is true) and the size of the test (which gives the size at which we reject the null hypothesis)

Components of the Hypothesis Testing

The elements of the test hypothesis include:

The null hypothesis

The alternative hypothesis

The Test statistic

The size of the hypothesis test and Errors

The critical value

The Decision rule

The Null hypothesis

As stated earlier, the first stage of the hypothesis test is the statement of the null hypothesis. The null hypothesis is the statement concerning the population parameter values. It brings out the notion that “there is nothing about the data.”

The null hypothesis, denoted as H0, represents the current state of knowledge about the population parameter that’s the subject of the test. In other words, it represents the “status quo.” For example, the U.S Food and Drug Administration may walk into a cooking oil manufacturing plant intending to confirm that each 1 kg oil package has, say, 0.15% cholesterol and not more. The inspectors will formulate a hypothesis like:

H0: Each 1 kg package has 0.15% cholesterol

A test would then be carried out to confirm or reject the null hypothesis.

Other typical statements of H0 include:

$$H_0:\mu={\mu}_0$$

$$H_0:\mu≤{\mu}_0$$

Where

\(μ\) = true population mean and,

\(μ_0\)= the hypothesized population mean.

The Alternative Hypothesis

The alternative hypothesis, denoted H1, is a contradiction of the null hypothesis. The null hypothesis determines the values of the population parameter at which the null hypothesis is rejected. Thus, rejecting the H0 makes H1 valid. We accept the alternative hypothesis when the “status quo” is discredited and found to be untrue.

Using our FDA example above, the alternative hypothesis would be:

H1: Each 1 kg package does not have 0.15% cholesterol

The typical statements of H0 include:

$$H_0:\mu \neq {\mu}_0$$

$$H_0:\mu > {\mu}_0$$

Where

\(μ\) = true population mean and,

\(μ_0\)= the hypothesized population mean.

Note that we have stated the alternative hypothesis, which contradicted the above statement of the null hypothesis.

The Test statistic

A test statistic is a standardized value computed from sample information when testing hypotheses. It compares the given data with what would expect under the null hypothesis. Thus, it is a major determinant when deciding whether to reject H0, the null hypothesis.

We use the test statistic to gauge the degree of agreement between sample data and the null hypothesis. Analysts use the following formula when calculating the test statistic.

The test statistic is a random variable that changes from one sample to another. Test statistics assume a variety of distributions. We shall focus on normally distributed test statistics because it is used hypotheses concerning the means, regression coefficients, and other econometric models.

We shall consider the hypothesis test on the mean. Consider a null hypothesis \(H_0:μ=μ_0\) . Assume that the data used is iid, the asymptotic normally distributed as:

$$\sqrt{n} (\hat{\mu}-\mu) \sim N(0, {\sigma}^2)$$

Where \({\sigma}^2\) is the variance of the sequence of the iid random variable used. The asymptotic distribution leads to the test statistic:

Note this is consistent with our initial definition of the test statistic.

The following table gives a brief outline of the various test statistics used regularly, based on the distribution the data is assumed to follow:

Hypothesis Test

Test Statistic

Z-test

z-statistic

Chi-Square Test

Chi-square statistic

t-test

t-statistic

ANOVA

F-statistic

We can subdivide the set of values that can be taken by the test statistic into two regions: One is called the non-rejection region, which is consistent with H0 and the rejection region (critical region) which is inconsistent with H0. If the test statistic has a value found within the critical region, we reject H0.

Just like with any other statistic, the distribution of the test statistic must be specified entirely under H0 when H0 is true.

The Size of the Hypothesis Test and the Type I and Type II Errors

While using sample statistics to draw conclusions about the parameters of the population as a whole, there is always the possibility that the sample collected does not accurately represent the population. Consequently, statistical tests carried out using such sample data may yield incorrect results that may lead to erroneous rejection (or lack thereof) of the null hypothesis. We have two types of error:

Type I Error

Type I error occurs when we reject a true null hypothesis. For example, a type I error would manifest in the form of rejecting H0 = 0 when it is actually zero.

Type II Error

Type II error occurs when we fail to reject a false null hypothesis. In such a scenario, the test provides insufficient evidence to reject the null hypothesis when it’s actually false.

The level of significance denoted by α represents the probability of making a type I error i.e., rejecting the null hypothesis when in fact, it’s true. α is the direct opposite of β, which is taken to be the probability of making a type II error within the bounds of statistical testing. The ideal but practically impossible statistical test would be one that simultaneouslyminimizes α and β. We use α to determine critical values that subdivide the distribution into the rejection and the non-rejection regions.

The Critical Value and the Decision Rule

The decision to reject or not to reject the null hypothesis is based on the distribution assumed by the test statistic. This means if the variable involved follows a normal distribution, we use the level of significance (α) of the test to come up with critical values that lie along with the standard normal distribution.

The decision rule is a result of combining the critical value (denoted by \(C_α\)), the alternative hypothesis, and the test statistic (T). The decision rule is to whether to reject the null hypothesis in favor of the alternative hypothesis or fail to reject the null hypothesis.

For the t-test, the decision rule is dependent on the alternative hypothesis. When testing the two-side alternative, the decision is to reject the null hypothesis if \(|T|>C_α\).That is, reject the null hypothesis if the absolute value of the test statistic is greater than the critical value. When testing on the one-sided, the decision rule, reject the null hypothesis if \(T<C_α\) when using a one-sided lower alternative and if \(T>C_α\) when using a one-sided upper alternative. When a null hypothesis is rejected at α significance level, we say that the result is significant at α significance level.

Note that prior to decision making, one must decide whether the test should be one-tailed or two-tailed. The following is a brief summary of the decision rules under different scenarios:

Left One-tailed Test

H1: parameter < X

Decision rule: Reject H0 if the test statistic is less than the critical value. Otherwise, do not reject H0.

Right One-tailed Test

H1: parameter > X

Decision rule: Reject H0 if the test statistic is greater than the critical value. Otherwise, do not reject H0.

Two-tailed Test

H1: parameter ≠ X (not equal to X)

Decision rule: Reject H0 if the test statistic is greater than the upper critical value or less than the lower critical value.

The first graph represents the rejection region when the alternative is one-sided lower. For instance, the hypothesis is stated as :\(H_0:μ<μ_(0 )\) \text{Vs} \(H_1:μ>μ_(0 )\).

The second graph represents the rejection region when the alternative is one-sided upper. The null hypothesis, in this case, stated as :\(H_0:μ>μ_(0 ) \text{Vs} H_1:μ<μ_(0 )\).

Example: Hypothesis Test on the Mean

Consider the returns from a portfolio \(X=(x_1,x_2,\dots, x_n)\) from 1980 through 2020. The approximated mean of the returns is approximated to be 7.50%, with a standard deviation of 17%. We wish to determine whether the expected value of the return is different from 0 at a 5% significance level.

The critical values \(α=5\%\) the critical value is \(±1.96\). Since this is a two-sided test, the rejection regions are ( \(-\infty,-1.96\) ) and (\(1.96, \infty \) ) as shown in the diagram below:

Since the test statistic (2.79) is higher than the critical value, then we reject the null hypothesis in favor of the alternative hypothesis.

The example above is an example of a Z-test (which is mostly emphasized in this chapter, and immediately follows from the central limit theorem (CLT)). However, we can use the Student’s t-distribution is the random variables are iid and normally distributed and that the sample size is small (n<30).

In Student’s t-distribution, we used the unbiased estimator of variance. That is:

$$s^2=\frac{\hat{\mu}-{\mu}_0}{\sqrt{\frac{s^2}{n}}}$$

Therefore the test statistic for \(H_0=μ_0\) is given by:

$$T=\frac{\hat{\mu}-{\mu}_0}{\sqrt{\frac{s^2}{n}}} \sim t_{n-1}$$

The Type II Error and the Test Power

The power of a test is the direct opposite of the level of significance. While the level of relevance gives us the probability of rejecting the null hypothesis when it’s, in fact, true, the power of a test gives the probability of correctly discrediting and rejecting the null hypothesis when it is false. In other words, it gives the likelihood of rejecting H0 when, indeed, it’s false. Denoting the probability of type II error by (P(type II error)=), the power test is given by:

$$ \text{Power of a test}=1–\beta $$

Power test measures the likelihood that the false null hypothesis is rejected. It is influenced by the sample size, the length between the hypothesized parameter and the true value, and the size of the test.

Confidence Intervals

A confidence interval can be defined as the range of parameters at which the true parameter can be found at a confidence level. For instance, a 95% confidence interval constitutes that the set of parameter values where the null hypothesis cannot be rejected when using a 5% test size. Therefore, a 1-α confidence interval contains the values that cannot be disregarded at a test size of α.

It is important to note that the confidence interval depends on the alternative hypothesis statement in the test. Let us start with the two sided-test alternatives.

Example: Calculating Two-Sided Alternative Confidence Intervals

Consider the returns from a portfolio \(X=(x_1,x_2,…, x_n)\) from 1980 through 2020. The approximated mean of the returns is approximated to be 7.50%, with a standard deviation of 17%. Calculate the 95% confidence interval for the portfolio return.

Note that the critical decrease from 1.96 to 1.645 due to change in the direction of the change.

The p-Value

When carrying out a statistical test with a fixed value of the significance level (α), we merely compare the observed test statistic with some critical value. For example, we might “reject H0 using a 5% test” or “reject H0 at 1% significance level”. The problem with this ‘classical’ approach is that it does not give us the details about the strength of the evidence against the null hypothesis.

Determination of the p-value gives statisticians a more informative approach to hypothesis testing. The p-value is the lowest level at which we can reject H0. This means that the strength of the evidence against H0 increases as the p-value becomes smaller. The test-statistic depends on the alternative.

The p-Value for One-Tailed Test Alternative

For one-tailed tests, the p-value is given by the probability that lies below the calculated test statistic for left-tailed tests. Similarly, the likelihood that lies above the test statistic in right-tailed tests gives the p-value.

Denoting the test statistic by T, the p-value for \(H_1:μ>0\) is given by:

$$P(Z>|T|)=1-P(Z≤|T|)=1- \Phi (|T|) $$

Conversely, for \(H_1:μ≤0 \) the p-value is given by:

$$ P(Z≤|T|)= \Phi (|T|)$$

Where z is a standard normal random variable. The absolute value of T (|T|) ensures that the right tail is measured whether T is negative or positive.

The p-Value for Two-Tailed Test Alternative

If the test is two-tailed, this value is given by the sum of the probabilities in the two tails. We start by determining the probability lying below the negative value of the test statistic. Then, we add this to the probability lying above the positive value of the test statistic. That is the p-value for the two-tailed hypothesis test is given by:

$$2\left[1-\Phi [|T|\right]$$

Example 1: p-Value for One-Sided Alternative

Let θ represent the probability of obtaining a head when a coin is tossed. Suppose we toss the coin 200 times and heads come up in 85 of the trials. Test the following hypothesis at 5% level of significance.

H0: θ = 0.5

H1: θ < 0.5

Solution

First, not that repeatedly tossing a coin follows a binomial distribution.

Our p-value will be given by P(X < 85) where X `binomial(200,0.5) assuming H0 is true.

$$ \left [ z< \frac{85.5-100}{\sqrt{50}$$

$$ =P(Z<-2.05)=1–0.97982=0.02018 $$

(We have applied the Central Limit Theorem by taking the binomial distribution as approx. normal)

Since the probability is less than 0.05, H0 is extremely unlikely, and we actually have strong evidence against H0 that favors H1. Thus, clearly expressing this result, we could say:

“There is very strong evidence against the hypothesis that the coin is fair. We, therefore, conclude that the coin is biased against heads.”

Remember, failure to reject H0 does not mean it’s true. It means there’s insufficient evidence to justify rejecting H0, given a certain level of significance.

Example 2: p-Value for Two-Sided Alternative

A CFA candidate conducts a statistical test about the mean value of a random variable X.

H0: μ = μ0 vs. H1: μ ≠ μ0

She obtains a test statistic of 2.2. Given a 5% significance level, determine and interpret the p-value:

1.39%

2.78

2.78%

Solution

The correct answer is C.

$$ \text{P-value}=2P(Z>2.2)=2[1–P(Z≤2.2)] =1.39\%×2=2.78\%$$

(We have multiplied by two since this is a two-tailed test)

Interpretation

The p-value (2.78%) is less than the level of significance (5%). Therefore, we have sufficient evidence to reject H0. In fact, the evidence is so strong such that we would also reject H0 at significance levels of 4% and 3%. However, at significance levels of 2% or 1%, we would not reject H0 since the p-value surpasses these values.

Hypothesis about the Difference between Two Population Means.

It’s common for analysts to be interested in establishing whether there exists a significant difference between the means of two different populations. For instance, they might want to know whether the average returns for two subsidiaries of a given company exhibit significant differences.

Now, consider a bivariate random variable:

$$W_i=[X_i,Y_i]$$

Assume that the components \(X_i\) and \(Y_i\)are both iid are may be correlated. That is:
\(\text{Corr} (X_i,Y_i )≠0\)

Now, suppose that we want to test the hypothesis that:

Now, suppose that we want to test the hypothesis that:

$$H_0:μ_X=μ_Y$$

$$H_1:μ_X≠μ_Y$$

In other words, we want to test whether the constituent random variables have equal means. Note that the hypothesis statement above can be written as:

$$H_0:μ_X-μ_Y=0$$

$$H_1:μ_X-μ_Y≠0$$

To execute this test, consider the variable:

$$Z_i=X_i-Y_i$$

Therefore, considering the above random variable, if the null hypothesis is correct then,

Where \(n_X\) and \(n_Y\) are the sample sizes of \(X_i\), and \(Y_i\) respectively.

Example: Hypothesis Test on Two Means

An investment analyst wants to test whether there is a significant difference between the means of the two portfolios 95% level. The first portfolio X consists of 30 government-issued bonds and has a mean of 10% and a standard deviation of 2%. The second portfolio Y consists of 30 private bonds with a mean of 14% and a standard deviation of 3%. The correlation between the two portfolios is 0.7. Calculate the null hypothesis and state whether the null hypothesis is rejected or otherwise.

Solution

The hypothesis statement is given by:

$$H_0:μ_X-μ_Y=0 \text{vs} H_1:μ_X-μ_Y≠0$$

Note that this is a two-tailed test. At 95% level, the test size is α=5% and thus the critical value \(C_α=±1.96\). Now the test statistic is given by:

The test statistic if far much less than -1.96. Therefore the null hypothesis is rejected at a 95% level.

The Problem of Multiple Testing

Multiple testing refers to conducting multiple hypothesis testing on the same data. The reuse of data results in spurious results and unreliable conclusions. Multiple testing results create huge test sizes than the assumed size and thus increasing the probability of type I error and hence the introduction of bias to the test.

Some of the control methods that have been developed to combat multiple testing. These include Bonferroni correction, the False Discovery Rate (FDR), and Familywise Error Rate (FWER).

Question

An experiment was done to find out the number of hours that candidates spend preparing for the FRM part 1 exam. It was discovered that for a sample of 10 students, the following times were spent:

318, 304, 317, 305, 309, 307, 316, 309, 315, 327

If the sample mean and standard deviation are 312.7 and 7.2 respectively, calculate a symmetrical 95% confidence interval for the mean time a candidate spends preparing for the exam using the t-table.

[307.5, 317.9]

[307.6, 317.8]

[307.9, 317.5]

[307.3, 318.2]

The correct answer is A.

Population variance is unknown; we must use the t-score.

Our \(t_{\frac{𝛼}{2}}\) value is given by using the table with (10 – 1 =) 9 degrees of freedom and the (1 – 0.025 =) 0.975 which gives us 2.262.