It samples one in 100 records per block, which could result in under/over sampling of the data.

For extremely large inputs, 100 records per block results in over sampling of the data. These samples can then burden the single reducer which aggregates these samples.

Different operations need a different sampler. We thus need a generic sampling interface.

Proposed changes

An abstract "Sampler" class that defines the basic sampling operations. The existing RandomSampleLoader will be modified to extend the same.

Order By will continue to use the existing RandomSampleLoader where as SkewedJoin will define a new Sampler. The distinction is important since the sample rate is different between the two and the sample rate for skewed join will not be known during the compilation phase.

Skewed Join sampler will estimate the number of samples based on the size of the input.

Using a more uniform distribution for the skewed join sample loader instead of making it random. The distribution can be generated offline and stored in a file and later used by the sample loader to pick the samples.

Order By sampler

The existing order by samples the input data to get a sense of the distribution of ordering keys. Data is sampled 1 per 100 records by using RandomSampleLoader. This loader is a subclass of BinStorage and is able to skip all the records between samples. To calculate quantiles, the samples are divided into several partitions that is equal to the number of quantiles. The last record of each partition is retrieved to form a quantile list. It then counts the occurrence of the keys that fall in the quantile list for each partition. The probability that a key falls into a partition is then calculated and written to the quantile file along with the quantile list.

Skewed Join sampler

Since the frequency distribution of keys in the input is highly skewed, the underlying data can be modeled using a poisson distribution. The skewed join sampler tries to identify the keys that are too big to fit in memory and allocates reducers to those skewed keys. Given an input file of size N, we need to estimate the number of samples which represents this input.

The main purpose of a skewed join sampler is to come up with a reducer allocation map of the skewed keys. A custom slicer is used to estimate the number of maps that are required to run the sampler job. Although, the number of partitions provides us a base number of samples for the input, for an uniformly distributed random samples, we may be under sampling the data. Hence, we use a Poisson cumulative distribution function to estimate the total number of samples that are required to represent the underlying data.

To compute the base number of samples, we calculate the number of partitions of the file based on the amount of memory available and file size. For a 1TB file running on nodes having 512 mb of memory each:

Assuming we should atleast sample 1 record per partition, we end up with a base number of 4000 samples.

Estimating the number of samples

The probability that a partition has less than or equal to k samples is predicted by the Poisson cumulative distribution function. Although, the value of k needs to be experimented, a guidance value of 10 is obtained from various sources. A table of cumulative probabilities for a selected range of the sample rate (lambda) and the number of samples per partition is available here. From the table, for a 95% confidence and k (number of samples) set to 10, the sampling rate appears to be 17. Using these numbers, the number of samples that we need to obtain from the input is 68000.

Implementation

An abstract sampling class will define functions for getSamplingRate and skipinterval

The existing RandomSampleLoader will extend this sampling class. It will set the sampling rate to 100 which is similar to the existing implementation for order by.

Skewed join's sampler will extend the sampler class and set the sampling rate as described above.

The skip interval for the skewed join sampler will be uniform. In the future, this can be replaced by a uniformly distributed random interval.