Wednesday, December 4, 2013

Python: bootstrapping with sklearn

sklearn.cross_validation.Bootstrap returns indices of random bootstrap samples from your data. I am not yet well familiar with the use of bootstrapping for 2 sample comparison, so I'm using means as a way to characterise the two distributions (that's better than 2D K-S test, which doesn't exist anyway!). This is of course an incomplete statistic (I have a reason to suspect that two of the samples I am working with have different skewnesses). I'm working with three sets of absolute r magnitudes (M_r) here.
Here's how it looks in practice:

from sklearn import cross_validation
#let's call our sample array data. It can be N-dimensional.
#len(data) -- total number of data points in my dataset,
#nBoot -- number of bootstrap samples,
#train_size = bootstrap sample size (proportion of the whole sample, or just number)
#Here we create an empty array we will store the means of bootstrapped
means = np.empty((nBoot, ))
#then we get the class instance, drawing half of the sample with replacement every time
bs = cross_validation.Bootstrap(len(data), nBoot, train_size=0.5, random_state=0)
#Filling the means array while iterating over the bootstrap samples (indexed by train_index):
i = 0
for train_index, test_index in bsJ:
means[i] = np.mean(data[train_index])
i+=1

I've repeated it for all three distributions I'm interested in, and here is a histogram of all the bootstrap sample means. The difference is just what what I've expected: basically, some faint galaxies (with M_r < 20 or so) were manually rejected from the green and red distributions, so the mean of them is shifted towards brighter absolute magnitudes. It remains to be seen how important this is. If we think this is worth it, I'll try using other distributions' statistics some other time.