This tip describes a method to verify equal distributions between your training and testing sets.

It is typical to split the data into training set and test set so that we train a data mining model from the training set and test the model with test set. This way, we can optimize the accuracy of a data mining model without the problem of over fitting. However, this will work only when the training set and test set are randomly split. For instance, if there is a strong pattern found only either in training set or test set, the optimization process won’t be able produce a model with the pattern.

There is a simple yet elegant way to know if the two set are randomly split by using data mining itself. We can simply create a decision tree to predict whether a case belongs to training set or test set based on all other input attributes. The expectation is that the decision should not have any strong pattern. Any strong pattern indicates that a significant group of cases belongs to one set than other set and therefore the sets are not random. The following shows an example of the process:

Suppose we have two tables: “TrainingSet” for training set and “TestingSet” for testing set.

Create an AS project.

Create a data source to point to the database where the TrainingSet and TestingSet tables exist.

Create a data source view and add the following named query:

SELECT *, 'Training' AS DataUsage FROM TrainingSet UNIONALL

SELECT *, 'Test' AS DataUsage FROM TestingSet

Create a mining structure with a decision tree model in it to predict DataUsage from all available columns except key.

Edit the mining parameters of the model so that it can find fine patterns.

Process the model by deploying the project.

Browse the mining model content (tree view). The following shows the content example:

From the example, we can clearly see that the are many significant patterns found between training set and test set. For instance, the distribution for “All” node (population: 4269 cases) shows that 75% are in training set, 25% are in test set while “Normalized Avg Asset Dwell Time < 134” node (population: 1708) shows that 82% are in training set and 18% are in test set. A more serious example is the node, “Dwell Time not = Missing” where all 116 cases appear only in test set.

The split method taken in the above example was based on a home-grown function that involves some identifier and a modulus operator, which typically ends up non-random split. As an extended experimentation, we applied SQL Server SSIS Percentage Transform to produce the training set and test set. The following tree shows the patterns found. Although it found a few patterns, none of them are significant; either the distributions between training/test set do not vary significantly and/or the support of the nodes is low.

Disclaimer: SQLServerDataMining.com is currently managed by members of the SQL Server Data Mining development team at Microsoft Corporation. It does not represent Microsoft’s official position on its products or technologies. All content is provided “AS-IS” with no warranties, and confers no rights.