blog

What's the difference? : An Overview of Non-Parametric Tests

Are my datasets different?

I ran into the following problem:

I am performing a regression analysis. I have a training dataset named train and a test dataset named test with a single dependent variable (my output). I find that when performing cross validation on the training set I get a low MSE and a high R2 value (yea!). Now that I have some faith this model could work, I train a model using all my training data and then predict on the features from the test set. When I check the R2 I find that it is negative and the MSE is much higher. So what could be my issue?

In my case I am using data that was collected at different times, i.e. train was collected a year before test. Is it possible some of the features in the dataset have different distribution? I can use the following tests to find out.

The tests and methods I’ll cover in this post can be used for a number of different purposes outside of the example I provided.

Purpose

The purpose of this post is to provide examples of non-parametric tests and methods along with brief (generalized) descriptions of what each test does. If you are looking for advice on which test to choose for your application, that is beyond the scope of this post. However I would point you to the links below as a great starting point:

Probability Density of a and b

Empirical CDF of a and b

The KS test compares cumulative distribution functions (CDF) of two sample sets. The KS test calculates a D statistic value which indicates the maximum discrepancy between the two CDF’s.

Things to know about KS:

Null HYPOTHESIS: The two distributions are the same

p-value can be used to determine rejection of null hypothesis (i.e. p< 0.05 reject null)

D close to 1 indicates the two samples are from different distributions

D closer to 0 indicates the two samples are from the same distribution

In the case of our example dataset we can see the KS test returns a p-value < 0.05, so we would
reject the null hypothesis. However we can also see that the D statistic is only 0.1, so although the
distributions are different, this lets us know they are not so far apart.

Median Comparison

The next set of tests, the Wilcoxon signed-rank test and the Mann-Whitney U test, examine a difference in the
median value between the two distibutions. The Wilcoxon signed-rank test is used for paired data, while the
Mann-Whitney U test is used for unpaired data.

Given the example I cited at the beginning of the post one would assume the data is not paired, but I will
go through how to test paired samples anyway. In R, both the Wilcoxon and Mann-Whitney tests
are carried out using the wilcox.test function. To implement the Wilcoxon test
the paired argument should be set to TRUE, to implement the Mann-Whitney test paired
is set to FALSE.

The results of the Wilcoxon test are a p-value less than 0.05 and and a large statistic value. In this case we reject the null
hypothesis that the difference between the medians is zero. (Note: R lists the W statistic I mentioned before as V.)

We can see in this case the p-value is less than 0.05, so we reject the null hypothesis. However if we look at the value of the
U-statistic we can see it is reasonably close to 5e5 (n1*n2/2). From this we can gather, the null hypothesis should be rejected, but the medians are not all that far from one another.

The functions in the above snippet perform the example I outlined previously. It looks at the
difference between the means each time the data is shuffled. To make things a bit more compact I
first create a dataframe, ds, which contains the name of the original distribution and the values
from the original distribution.

We can see from the results that the estimated p-value for the permutation test is less than
0.05 indicating that the null hypothesis should be rejected. In the case of this test my null
hypothesis was the same as a t-test : the means of the two distributions are equal.

The plot above shows the resulting distribution of the difference in means
from each permutation. The difference between the true sample means is shown by
the dashed red line. We can see that the distribution doesn’t overlap the
difference between the sample means further reinforcing the rejection of
the null hypothesis.

Example 2: Multiple Feature Testing with Broom

The above example is great if we only have one feature we want to check, but what if I’m trying to determine which features have the same distribution in both datasets and which features have different distributions? We can do the same thing, but then we have a bunch of test model output objects (lists) that are messy. That is where the broom package comes into play. It takes the outputs of all our tests and puts them in a dataframe.

First I’ll create a dataframe which has two features one with a gamma distribution and one with a normal distribution. The parameters of the distributions vary depending on the label a or b in the gamma distribution feature.

Below are the contents of the dataframe. Each row is an observation and each column is a variable; the first two columns are independent variables (features) and the last column is the label of the dataset they reside in (a is from the train dataset, b is from the test dataset). I didn’t create a dependent variable in this case as we are only concerned with checking if a given feature has a different distribution based on which dataset it came from.

In the above snippet I used the tidyr package to gather the feature columns so that I now have a feature_name column and a feature_value column. This structure will allow me to run my remaining analysis with broom.

Shapiro-Wilk

The first test I’m going to run here is the Shapiro-Wilk test for
normality. The Shapiro-Wilk test is one of the available options for testing normality along with the KS test (using a normal distribution for comparison) and the Anderson Darling test.

From the above test results we can gather that the null hypothesis should only be rejected for feature 2 (both a and b samples). However lets assume each feature comes from a single distribution (a and b are the same - which we already know is not true for feature 2) . We can plot the q-q plot for each feature and do a follow up Shapiro Wilk test on the each feature as a whole (a and b combined).

Q-Q Plot of Feature 1 (a and b)

In the q-q plot above the sample data runs along the diagonal (the points are from the sample data and
the line is from a normal distribution). This goes along with the result from the Shapiro-Wilk test which
has a p-value > 0.05 for both breakdowns of feature 1 which indicate that the null hypothesis (NH: the distribution being
tested has a normal distribution) should not be rejected.

Q-Q Plot of Feature 2 (a and b)

The q-q plot and Shapiro-Wilk test for feature 2 shows a much different result. The test has a very small p-value indicating the null hypothesis should be rejected. The q-q plot confirms this result as we can see the sample data does not overlay the solid line representing a normal distribution.

The above results confirm what we already know, but I wanted to make sure I run through at least one test for normality prior to
moving on with the non-parametric tests.

The following snippets run through KS, Mann-Whitney, and a permutation test which checks the difference between the median values between the distributions. Each test is implemented using the broom package along with dplyr so that I can group by feature and compare the distributions of the a and b labeled data.

Result of Median Difference Permutation Test:

Null Hypothesis: The median difference between the distributions is zero.

feature 1 : Do not reject null hypothesis

feature 2 : Reject null hypothesis

Conclusion for Example 2

Test outcomes:

Feature

Shapiro-Wilk

KS

Mann-Whitney

Permutation Test

Feature 1

Do Not Reject

Do Not Reject

Do Not Reject

Do Not Reject

Feature 2

Reject

Reject

Reject

Reject

The table above contains the results from the tests performed on the sample data. In this case all our tests reinforce what we already know about the dataset. However this is not to say all the tests will always agree, but using the broom package it is simple enough to run through several test and compare results.