Means and Proportions with two populations

Statistical inference about means and proportions with two populations seems to be one of the most commonly used applications in the field of analytics – comparing campaign response rates between 2 groups of customers, pre and post campaign sales, membership renewal rates, etc.

Call it chance or whatever, but whenever these kind of tasks came up I hear people talking about the t-tests only. No issues as long as you want to compare means or when your target variable is a continuous value. But how or why do people talk about the t-test when they want to compare ratios or proportions? Whatever happened to the Chi-Square tests or the Z-test for difference in proportions?

I did a bit of research on the net, a bit of calculation using pen and paper [very good exercise for the brain in this age of calculators and spreadsheets 🙂 ], read a very good article by Gerard E. Dallal, and I found the answers.

Going back to our introductory class in statistics, let’s check out the formulae for the t-tests.

Statistical inference about means and proportions with two populations seems to be one of the most commonly used applications in the field of analytics – comparing campaign response rates between 2 groups of customers, pre and post campaign sales, membership renewal rates, etc.

Call it chance or whatever, but whenever these kind of tasks came up I hear people talking about the t-tests only. No issues as long as you want to compare means or when your target variable is a continuous value. But how or why do people talk about the t-test when they want to compare ratios or proportions? Whatever happened to the Chi-Square tests or the Z-test for difference in proportions?

I did a bit of research on the net, a bit of calculation using pen and paper [very good exercise for the brain in this age of calculators and spreadsheets 🙂 ], read a very good article by Gerard E. Dallal, and I found the answers.

Going back to our introductory class in statistics, let’s check out the formulae for the t-tests.

The test statistic Z (equation 3) is equivalent to the chi- square goodness-of-fit test, also called a test of homogeneity of proportions.

But how different is the proportions from means? The proportion having the desired outcome is the number of individuals/observations with the outcome divided by total number of individuals/observations. Suppose we create a variable that equals 1 if the subject has the outcome and 0 if not. The proportion of individuals/observations with the outcome is the mean of this variable because the sum of these 0s and 1s is the number of individuals/observations with the outcome.

Let’s suppose there are m 1s and (n-m) 0s among the n observations. Then, XMean (=P) =m/n and is equal to (1-m/n) for m observations and 0-m/n for (n-m) observations. When these results are combined, the final result is

Substituting this in the equation 3 (for Z statistic), we get(P1 – P2)/ sqrt(Variance/n1 + Variance/n2)), which is not so different from equation 2 (the formula for the “equal variances not assumed” version of t test).

As long as the sample size is relatively large, the distributional assumptions are met, and the response is binomial – the t test and the z test will give p-values that are very close to one another.

And in the case where we have only two categories, the z test and the chi-square test turn out to be exactly equivalent, though the chi-square is by nature a two-tailed test. The chi-square distribution for 1 df is just the square of the z distribution.

The various tests and their assumptions as listed in Wikipedia are given below:1. Two-sample pooled t-test, equal variances(Normal populations or n1 + n2 > 40) and independent observations and σ1 = σ2 and (σ1 and σ2 unknown)