Do you remember the term “Student’s t-test” from your statistics lessons? And do you use its intention in case you are doing performance measurements in your day-to-day life?

William Sealy Gosset was a chemist working at the Guinness brewery in Dublin where he has been recruited because he was one of the best graduates at Oxford. The brewery’s idea was to use the scientific knowledge in order to optimize the industrial processes. During his work at Guinness William Sealy Gosset developed a way to test hypothesis like “The means of these two populations are equal.”. But because publishing scientific results gathered during work was not allowed at Guinness, he published his work under the pseudonym “Student”. That’s why we all know this kind of hypothesis testing as “Student’s t-test”.

When we measure the performance of two different algorithms on the same hardware, we cannot just compare the resulting mean values in order conclude if one of them is faster. According the “Student’s t-test” we have to formulate a “null hypothesis” that could sound in this example like “There is no effective difference between the sample means of the two observations”. The next step is to compute the so called “t value”. For this computation we assume that both series of samples are independent, i.e. the observations in the first series are in no way related to the observations in the second series, and that the distribution of values follows a normal distribution. As we do not know if both series have the same variance, we must use the so called “heteroscedastic t-test” with the following formula:

t = (x - y) / sqrt( Sx^2 / n1 + Sy^2 / n2 )
x: mean of the first series
y: mean of the second series
Sx: standard deviation of the first series
Sy: standard deviation of the second series
n1: number of samples in the first series
n2: number of samples in the second series

Let’s assume we have measured the following data:

X

Y

154.3

230.4

191.0

202.8

163.4

202.8

168.6

216.8

187.0

192.9

200.4

194.4

162.5

211.7

To compute the t value we can utilize Apache’s “commons-math3” library:

The example code above also computes the so called “homoscedastic” t-test, which assumes that the two samples are drawn from subpopulations with equal variances. The two methods from the commons library compute the smallest “significance level” at which one can reject the null hypothesis that the two means are equal. The “confidence level”, which is easier to understand, can be computed by subtracting the “significance level” from 1. As the result is a probability, we can multiply it with 100 in order to get a statement in percentage:

This means that we can reject the statement that the mean value of both sample series is equal with a probability of 99.8%. Or the other way round that the probability that both series have the same mean value is only 0.2%. Hence the two measurements are very likely to be different. But the result is not always as clear as in this case. Let’s take a look at these two series:

At first glance the second series of sample values performs much slower. But the probability that we can reject the null hypothesis that both means are equal is only 59.5%. In other words: The probability that both series have the same mean value is only about 40.5%.