Too tired to think straight and with very little time left, I came up with a primitive script to randomly spawn workloads for a series of experiments I was running. As it turns out, the natural order of the dataset [1] I am using needs a bit of randomisation –the records in the different SetSpecs have inconsistent structures. Moral of the story here is that free-styling isn’t always approapriate 😉

I used the sample function from the random module… you’ll notice that I recursively swallow the entire dataset (~2million records) into memory and generate a random sample with files equal to the ‘bin’ value. I did mention that I was tired and not thinking straight, however, I checked to see if I was doing the right thing [2].