Other sites

Sampling Arbitrary data

Generating data usually requires a variance – covariance matrix and is therefore restricted by using a linear assumption between the variables. However, using a linear assumption between data can miss important non – linear relationships. This post uses quantile random forests to simulate an arbitrary dataset.

MCMC Sampling:MCMC sampling is a method that can simulate from any multivariate densities as long as one has the full conditionals. Full conditionals in this case are models of a particular variable given all other variables. For more concreteness; suppose we have a dataset with n variables (x1, x2, … xn). The estimated full conditionals in this case are:

f(x1 | x2, x3…xn)

f(x2 | x1, x3…xn)

.

.

f(xn | x1, x3…x(n-1))

Where f() is a machine learning algorithm of choice that can provide a distribution of values, in this case I use quantregForest() for continuous variables and randomForest() for categorical variables.

A random forest model was built to predict whether the simulated data could be distinguished from original data. The resulting KS test (out of sample) was insignificant so we retain null that model can’t distinguish between original and simulated data.

While that is a good result and shows that a model can’t distinguish between simulated and original data, visually we can see a difference between the two datasets shown below.

Original Iris Data

Simulated Iris Data Set

The simulated dataset shows sharper boundaries between points that are not found in original dataset. For example, the scatterplot between Sepal.Length and Sepal.Width shows a sharp boundary at approximately 5 Sepal.Length in simulated data but it more gradual in original data.

Conclusion:This post discussed a method to simulate an arbitrary dataset. While I cannot build a model to distinguish the two datasets, visually they are distinguishable.