In scientific (and other) applications, it is frequently necessary to classify a series of values in a number of bins. For instance, particles may be classified according to particle size in a number of bins of, say, 0.01 mm wide, yielding a histogram. Or, to take an example from my own work: satellite measurements taken all over the globe must often be classified in latitude/longitude boxes for further processing.

PDL has a dedicated function to make histograms, hist(). To create a histogram of particle size from 0 mm to 10 mm, in bins of 0.1 mm, you would write:

my $histogram = hist $particles, 0, 10, 0.1;

This will count the number of particles in every bin, yielding the 100 counts that form the histogram. But what if you wanted to perform other computations on the values in the bins? It is actually not that difficult to perform the binning by hand. The key is to associate a bin number with every data value. With fixed-size bins of 0.1 mm wide, that is accomplished with

my $bin_numbers = PDL::indx( $particles/0.1 );

(Note that the formulation above does not take care of data beyond 10 mm, but PDL::NDBin does.) We now have two arrays of data: the actual particle sizes in $particles, and the bin numbers associated with every data value in $bin_numbers. The histogram could now be produced with the following loop, $N being 100:

But, once we have the indices of the data values corresponding to any bin, it is a small matter to extend the loop to actually extract the data values in the bin. A user-supplied subroutine can then be invoked on the values in every bin:

(This is how early versions of PDL::NDBin were implemented.) The user subroutine could do anything with the values in the currently selected bin, $selection, including counting the number of elements. But the subroutine could also output the data to disk, or to a plot. Or the data could be collected to perform a regression. Anything that can be expressed with a subroutine, can now easily be plugged into this core loop.

This basic idea can even be extended by noticing that it is also possible to do multidimensional binning with the same core loop. The solution is to 'flatten' the bins, much like C and Perl flatten multidimensional arrays to a one-dimensional array in memory. So, you could perfectly bin satellite data along both latitude and longitude:

$flattened now contains pseudo-one-dimensional bin numbers, and can be used in the core loop shown above.

I've left out many details to illustrate the idea. The basic idea is very simple, but the implementation does get a bit messy when multiple variables are binned in multiple dimensions, with user-defined actions. Of course, ideally, you'd like this to be very performant, so you can handle several millions of data values without hitting memory constraints or running out of time. PDL::NDBin is there to handle the details for you, so you can write

to obtain the average of the flux, binned in boxes of 20x20 degrees latitude and longitude.

The rest of the documentation goes into more detail on the methods and implementation. You may also want to check out the examples (see EXAMPLES), or the comparison of PDL::NDBin with alternative solutions on CPAN (see "SEE ALSO" and further).

Please note that, although I do not anticipate major API changes, the interface and implementation are subject to change.

Axes may be binned either on a uniformly spaced grid or on a user-provided grid. If no specifications are given, data will be binned on a uniformly spaced grid automatically derived from the data.

Uniformly spaced grids are specified via a subset of the min, max, step, n, and round parameters, while the user provided grid is specified via the grid parameter. Only specify the subset of parameters which exactly determines the grid. For example:

The width of the bins. Currently only a fixed step size is allowed, which means that all the bins have equal width. Optional; will be determined from the data range and the number of bins if not supplied.

The number of bins. Optional; will be determined from the data range and the step size if not supplied. If the step size is not supplied, n will be set to the number of data values, or to 100, whichever is smaller.

Either a piddle or a reference to an array containing the values of the bin boundaries. The first and lest elements specify the minimum and maximum bin values, while the intermediate elements specify the common boundaries. A bin is inclusive of its lower bound and exclusive of its upper bound. For example, the following set of bins

The action to perform on this variable. May be either a code reference (a reference to a named or anonymous subroutine), a class name, or a hash reference. (See Actions under "IMPLEMENTATION NOTES" for more details.)

Only the name is required. All other specifications are optional and will be determined automatically as required. For a list of allowed axis specifications, consult add_axis(). Note that you cannot specify all specifications at the same time, because some may conflict.

At least one axis will eventually be required, although it needn't be specified at constructor time, and can be added later with add_axis(), if desired.

Set the piddles that will eventually be used for the axes and variables. Arguments must be specified as key-value pairs, the keys being the name, and the values being the piddle for every piddle that is to be set.

$binner->feed( latitude => $latitude, longitude => $longitude );

Note that not all piddles need be set in one call. This function can be called repeatedly to set all piddles. This can be very useful when data must be read from disk, as in the following example (assuming $nc is an object that reads data from disk):

Determine the following parameters for one axis automatically, if they have not been supplied by the user: the step size, the lowest bin, and the number of bins. Use whatever combination is needed of the specifications that have been supplied by the user, and the data itself. Obviously, the piddles containing the data must have been set before calling this subroutine. Details of the automatic parameter calculation are given in the section on "IMPLEMENTATION NOTES" below.

It is not usually required to call this method, as it is called automatically by autoscale().

Determine the following parameters for all axes automatically, if they have not been supplied by the user: the step size, the lowest bin, and the number of bins. It will use whatever combination is needed of the specifications that have been supplied by the user, and the data itself. Obviously, the piddles containing the data must have been set before calling this subroutine. For more details on the autoscaling, consult autoscale_axis().

$binner->autoscale( x => $x, y => $y, z => $z );

autoscale() accepts, but does not require, arguments. They must be key-value pairs as for feed(), and indicate piddle data that must be fed into the object prior to autoscaling. Note that the autoscaling applies to all axes, and not only supplied as arguments.

It is not usually required to call this method, as it is called automatically by process().

The core method. The actual piddles to be used for the axes and variables can be supplied to this function, although the argument list can be empty if all piddles have already been supplied. The argument list is the same as the one of feed(), i.e., a list of key-value pairs specifying name and piddle.

# if all piddles have already been set with feed()
$binner->process();

Return the output computed by the previous call(s) to process(). Each output variable is reshaped to make the number of dimensions equal to the number of axes, and the extent of each dimension equal to the number of bins along the axis.

The return value in list context is a hash, the keys and values of which correspond to the names and data of the variables. The return value in scalar context is a reference to this hash. When no variables have been supplied, a hash with a single key called histogram is returned.

my $result = $binner->output;
print $result->{average};

Note that it is not possible to call process() after having called output(), because the piddle data may have been reshaped.

PDL::NDBin provides the two functions ndbinning() and ndbin(), which are (almost) drop-in replacements for histogram() and hist(), except that they handle an arbitrary number of dimensions.

ndbinning() and ndbin() are actually wrappers around the object-oriented interface of PDL::NDBin, and may be the most convenient way to work with PDL::NDBin for simple cases. For more advanced usage, the object-oriented interface may be required.

Calculate an n-dimensional histogram from one or more piddles. The arguments must be specified (almost) like in histogram() and histogram2d(). That is, each axis must be followed by its three specifications step, min and n, being the step size, the minimum value, and the number of bins, respectively. The difference with histogram2d() is that the axis specifications follow the piddle immediately, instead of coming at the end.

If no variables are supplied, the behaviour of histogram() and histogram2d() is emulated, i.e., an n-dimensional histogram is produced. This function, although more flexible than the former two, is likely slower. If all you need is a one- or two-dimensional histogram, use histogram() and histogram2d() instead. Note that, when no variables are supplied, the returned histogram is of type indx (or long if your PDL doesn't have 64-bit support), in contrast with histogram() and histogram2d(). The histogramming is achieved by passing an action which simply counts the number of elements in the bin.

Unlike the output of output(), the resulting piddles are output as an array reference, in the same order as the variables passed in. There are as many output piddles as variables, and exactly one output piddle if no variables have been supplied. The output piddles take the type of the variables. All values in the output piddles are initialized to the bad value, so missing bins can be distinguished from zero.

Calculate an n-dimensional histogram from one or more piddles. The arguments must be specified like in hist(). That is, each axis may be followed by at most three specifications min, max, and step, being the the minimum value, maximum value, and the step size, respectively.

If no variables are supplied, the behaviour of hist() is emulated, i.e., an n-dimensional histogram is produced. This function, although more flexible than the other, is likely slower. If all you need is a one-dimensional histogram, use hist() instead. Note that, when no variables are supplied, the returned histogram is of type indx (or long if your PDL doesn't have 64-bit support), in contrast with hist(). The histogramming is achieved by passing an action which simply counts the number of elements in the bin.

Unlike the output of output(), the resulting piddles are output as an array reference, in the same order as the variables passed in. There are as many output piddles as variables, and exactly one output piddle if no variables have been supplied. The output piddles take the type of the variables. All values in the output piddles are initialized to the bad value, so missing bins can be distinguished from zero.

(The table is also included in the example files, and is taken from John Fox, Applied Regression Analysis, Linear Models, and Related Methods, SAGE Publications, Inc., 1997). We will now write a script to compute the histogram of the Income field, in bins of 10 units wide.

use PDL;
use PDL::NDBin;

Note that loading PDL::NBBin does not automatically export PDL to your namespace, so you need to load PDL explicitly.

First we build the object with a call to new(). Note that the same name can be used in both axes and variables (in this case, Income). step signifies the width of the bins. By associating the action Count with Income, we will produce a histogram of the elements in Income. (The action name is actually the name of a class in the PDL::NDBin::Action namespace.)

my( $prestige, $income, $education ) = rcols 'table';

Next, we read the data from the data file. The PDL function rcols() is very convenient to read tabular data of the kind shown above.

$binner->process( Income => $income );

The data is then 'fed' into the binning object, with a call to process(). Note that you need to specify the name that was given in the constructor call in order to associate the numerical data with the axis and variable.

my %results = $binner->output;
my $histogram = $results{Income};

We now recover the histogram with output(), which returns a hash with the results, keyed by name (again the same name as used in the constructor). To find the number of elements in the bin with 40 <= income < 50, for instance, you could also use the following awk(1) script:

$2 >= 40 && $2 < 50 { cnt++ }
END { print cnt }

Of course, for this very simple example, the histogram could as well be calculated with the following built-in function of PDL:

my $histogram = hist( $income, 0, 100, 10 );

If you'd rather print a stem-and-leaf plot, you could modify the constructor call as follows:

Now the action associated with $income is no longer Count (which counts the elements in each bin), but a reference to the user-supplied subroutine stem_and_leaf_plot(). The latter could be implemented as shown below.

The only argument supplied to our callback stem_and_leaf_plot() is an object of the type PDL::NDBin::Iterator. This object is used to iterate over the bins of the variable ($income). With the method bin(), we can recover the current bin number. With selection(), we recover those elements of $income that fall in the current bin. Those elements are then printed in a neat list (retaining only the last digit).

This is a slightly more complicated example, where PDL::NDBin is used to average two-dimensional data in boxes of 1x1. Suppose you have elevation data of a particular area in the form of (x,y)-located samples:

(The data have been taken from Example 14 of the GMT Cookbook. You can find more information on GMT under "SEE ALSO".) Note that the samples are not distributed uniformly over the area. We want to compute the average elevation in boxes of 1 by 1, replacing multiple samples in any given box by the mean of those samples (e.g., prior to computing a surface through these points). When using the Generic Mapping Tools, you'd do it as follows:

The constructor call specifies two axes for two-dimensional binning, and will compute the average in each bin of three variables simultaneously: x- and y-coordinate, and elevation ($z). We need to average the coordinates, as we want to replace multiple points with a single, average point; that is why x and y appear in the axes as well as in the variables.

To produce a table with averaged data, proceed (roughly) like in the first example:

The data are actual satellite data obtained with the GERB instrument (http://gerb.oma.be). The data are located by longitude, latitude, and the task at hand is to assign each sample to boxes of m degrees longitude by n degrees latitude, and then to average all samples belonging to any given box, as well as computing the standard deviation. An example of this kind of binning in Python is shown here. In awk(1), you could compute the average flux in the box bounded by -60 < longitude < -20 and -60 < latitude < -20 as follows:

For the purpose of this example, the data sets have been stripped down very much, and the number of lat/lon boxes has been reduced greatly. A variant of this script is used to bin and average the samples for a complete month of data, totalling around 4GB of input data and more than 60 million samples.

In an application, a large volume of data would likely be spread over multiple data files. Suppose that the data are distributed over a number of .txt files (in a real application, a binary format would be preferred over plain text). The following loop then processes all files without loading the entire data volume into memory:

Note how the data are read from disk and immediately processed. After the call to process(), the data are no longer required, and can be discarded! The actions Avg, StdDev and Count (and also Sum which is not shown in this example) keep an internal state which allows them to 'chain' multiple calls to process(). Note how the same variable $flux is fed three times to three different actions in order to obtain its average, standard deviation, and count, respectively.

Another point to note in this example is that the optimized action classes Avg (and similar) are required for performance when processing large volumes of data. The average could in principle also be computed with a coderef:

avg => sub { shift->selection->avg }

Although the result will be the same, the computation will be much slower, since the call to selection() is very time-consuming.

All data equal to or less than the minimum (either supplied or automatically determined) will be binned in the lowest bin. All data equal to or larger than the maximum (either supplied or automatically determined) will be binned in the highest bin. This is a slight asymmetry, as all other bins contain their lower bound but not their upper bound. However, it does the right thing when binning floating-point data.

You are required to supply an action with every variable. An action can be a code reference (i.e., a reference to a subroutine, or an anonymous subroutine), the name of a class that implements the methods new(), process() and result(), or a hash reference.

The actions will be called in the order they are given for each bin, before proceeding to the next bin. You can depend on this behaviour, for instance, when you have an action that depends on the result of a previous action within the same bin.

In case the action specifies a code reference, this subroutine will be called with the following argument:

$coderef->( $iterator )

$iterator is an object of the class PDL::NDBin::Iterator, which will have been instantiated for you. Important to note is that the action will be called for every bin, with the given variable. The iterator must be used to retrieve information about the current bin and variable. With $iterator->selection(), for instance, you can access the elements that belong to this variable and this bin.

In case the action specifies a class name, an object of the class will be instantiated with

$object = $class->new( $N )

where $N signifies the total number of bins. The variables will be processed by calling

$object->process( $iterator )

where $iterator again signifies an iterator object. Results will be collected by calling

$object->result

The object is responsible for correct bin traversal, and for storing the result of the operation. The class must implement the three methods.

When supplying a class instead of an action reference, it is possible to compute multiple bins at once in one call to process(). This can be much more efficient than calling the action for every bin, especially if the loop can be coded in XS.

In this example, average() is misspelled. If the action were executed in an eval block, the typo would go unnoticed, and all values of the output piddle would be undefined. If you want to trap exceptions in actions, use a wrapper action defined as follows:

By default, ndbin() will loop over all bins, and create a piddle per bin holding only the values in that bin. This piddle is accessible to your actions via the iterator object. This ensures that every action will only see the data in one bin at a time. You need to do this when, e.g., you are taking the average of the values in a bin with the standard PDL function avg(). However, the selection and extraction of the data is time-consuming. If you have an action that knows how to deal with indirection, you can do away with the selection and extraction. Examples of such actions are: PDL::NDBin::Action::Count, PDL::NDBin::Action::Sum, etc. They take the original data and the flattened bin numbers and produce an output piddle in one step.

Note that empty bins are not skipped. If you want to use an action that cannot handle empty piddles (e.g., PDL method min()), you can wrap the action as follows to skip empty bins:

The range, when not given explicitly, is calculated from the data by calling min() and max() on the data. An exception will be thrown if the data range is zero. autoscale_axis() honours the round key to round bin boundaries to the nearest multiple of round.

The number of bins n, when not given explicitly, is determined automatically. If the step size is defined and positive, the number of bins is calculated from the range and the step size as discussed below. If neither the number of bins, nor the step size have been supplied by the user, the number of bins is taken equal to the number of data values, or equal to 100, whichever is smaller.

The calculation of the number of bins is based on the formula

n = range / step

but needs to be modified. First, n calculated in this way may well be fractional. When n is ultimately used in the binning, it is converted to integral type by truncating. To have sufficient bins, n must be rounded up to the next integer. Second, the computation of n is and should be different for floating-point data and integral data.

For floating-point data, n is calculated as follows:

n = ceil( range / step )

The calculation is slightly different for integral data. When binning an integral number, say 4, it really belongs in a bin that spans the range 4 through 4.99...; to bin a list of data values with, say, min = 3 and max = 8, we must consider the range to be 9-3 = 6. A step size of 3 would yield 2 bins, one containing the values (3, 4, 5), and another containing the values (6, 7, 8). The correct formula for calculating the number of bins for integral data is therefore

n = ceil( (range+1) / step )

The modified formula for integral data values leads to more natural results, as the following example shows:

The step size, when not given explicitly, is determined from the range and the number of bins n as follows:

step = range / n

for floating-point data, and

step = (range+1) / n

for integral data.

The step size may have a fractional part, even for integral data. The step size must not be less than one, however. If this happens, there are more bins than distinct numbers in the data, and the function will abort.

Note that when the number of n is not given either, a default value is substituted for it by PDL::NDBin, as described above.

The ability to process data piecewise means that the input data (i.e., the data points) required to produce the output (e.g., a histogram) do not have to be fed all at once. Instead, the input data can be fed in chunks of any size. The resulting output is of course identical, whether the input data be fed piecewise or all at once. However, the input data do not have to fit in memory all at once, which is very useful when dealing with very large data sets.

An example may help to understand this feature. Suppose you want to calculate the monthly mean cloud cover over an area of the globe, in boxes of 1 by 1 degree. The total amount of cloud cover data is too large to fit in memory, but fortunately, the data are spread of several files, one by day. With PDL::NDBin, you can do the following:

In this example, only the data of a single day have to be kept in memory. The $binner object keeps a running average of the data, and retains the proper counts until the output $avg must be generated.

Only PDL::NDBin offers this feature. It can be simulated with other libraries for histograms, as long as histograms can be added together. PDL::NDBin extends the feature of piecewise data processing to sums, averages, and standard deviations.

Bad value support, when it is present, allows to distinguish missing or invalid data from valid data. The missing or invalid data are excluded from the processing. Only the PDL-based libraries PDL and PDL::NDBin support bad values.

When data is co-located, e.g., cloud cover, cloud phase, and cloud optical thickness on a latitude-longitude grid, some time can be saved by binning the cloud variables together. Once the bin number has been determined for the given latitude and longitude, it can be reused for all cloud variables. This is marginally faster than binning the cloud variables separately. Only PDL::NDBin supports this feature.

PDL::NDBin can handle any type of calculation on the values in the bins that you can express in Perl or C, not only counting the number of elements in order to produce a histogram. At the time of writing (version 0.008), PDL::NDBin supports counting, summing, averaging, and taking the standard deviation of the values in each bin. Additionally, Perl or C subroutines can be defined and used to perform any operation on the values in each bin.

This feature, arguably the most important feature of PDL::NDBin, is not found in other modules.

Serialization is the process of storing a histogram to disk, or retrieving it from disk. Math::GSL::Histogram, Math::Histogram, Math::SimpleHisto::XS, and PDL all have built-in support for serialization. PDL::NDBin doesn't, but the serialization facilities of PDL can be used to store and retrieve data. (I usually store computed data in netCDF files with PDL::NetCDF.)

Data lower than the lowest range of the first bin, or higher than the highest range of the last bin, are treated differently in different modules.

Math::GSL::Histogram ignores out-of-range values.

Math::Histogram and Math::SimpleHisto::XS have overflow bins, i.e., by default they create more bins than you define. These so-called overflow bins are situated at either end of every dimension. Out-of-range values end up in the overflow bins.

The histogramming functions of PDL, and PDL::NDBin, store low out-of-range values in the first bin, and high out-of-range values in the last bin.

To ignore out-of-range values with PDL::NDBin, define an additional bin at either end of every dimension, and disregard the values in these additional bins.

To simulate overflow and underflow bins with PDL::NDBin, define an additional bin at either end of every dimension.

Proc. means that the module has a procedural interface. OO means that the module has an object-oriented interface. PDL::NDBin has both. Which interface you should use is largely a matter of preference, unless you want to use advanced features such as piecewise data feeding, which require the object-oriented interface.

Math::GSL::Histogram has a somewhat awkward interface, requiring the user to explicitly deallocate the data structure after use.

Obviously, deep down, all data values are just C scalars. By 'native data type' is meant the data type used to communicate with the library in the most efficient way.

At the time of writing (Math::GSL version 0.27), Math::GSL::Histogram did not have a facility to enter multiple data points at once. It accepts only Perl scalars, and requires the user to input the data points one by one. Similarly, to produce the final histogram, the bins must be queried one by one.

Math::Histogram and Math::SimpleHisto::XS accept Perl arrays filled with values (although they also accept data points one by one as Perl scalars). Passing large amounts of data in an array is generally more efficient than passing the data points one by one as scalars.

PDL and PDL::NDBin operate on piddles only, which are memory-efficient, packed data arrays. This could be considered both an advantage and a disadvantage. The advantage is that the piddles can be operated on very efficiently in C. The disadvantage is that PDL is required!

In a weighted histogram, data points contribute by a fractional amount (or weight) between 0 and 1. All libraries, except PDL::NDBin, support weighted histograms. In PDL::NDBin, the weight of all data points is fixed at 1.

In PDL, threading is a technique to automatically loop certain operations over an arbitrary number of dimensions. An example is the sumover() operation, which calculates the row sum. It is defined over the first dimension only (i.e., the rows in PDL), but it will be looped automatically over all remaining dimensions. If the piddle is three-dimensional, for instance, sumover() will calculate the sum in every row of every matrix.

Threading is supported by the PDL functions histogram(), whistogram(), and their two-dimensional counterparts, but not by hist() or whist(). PDL::NDBin does not (yet) support threading.

In a histogram with variable-width bins, the width of the bins needn't be equal. This feature can be useful, for example, to construct bins on a logarithmic scale. Math::GSL, Math::Histogram, and Math::SimpleHisto::XS support variable-width bins; PDL does not, and is limited to fixed-width bins.

Since version 0.017, PDL::NDBin supports variable-width bins if a piddle or Perl array containing the bin boundaries is passed in via the grid parameter to axis specifications.

This section aims to give an idea of the performance of PDL::NDBin. Some of the most important features of PDL::NDBin aren't found in other modules on CPAN. But there are a few histogramming modules on CPAN, and it is interesting to examine how well PDL::NDBin does in comparison.

I've run a number of tests with PDL version 0.008 on a laptop with an Intel i3 CPU running at 2.40 GHz, and on a desktop with an Intel i7 CPU running at 2.80 GHz and fast disks. The following table, obtained with 100 bins and a data file of 2 million data points, shows typical results on the laptop:

Although this module is actually a wrapper around the C library GSL, the performance is rather low. The process of getting a large number of data points into Math::GSL::Histogram's data structures is inefficient, as the data points have to be input one by one.

This library wraps another multidimensional histogramming library written in C. It allows inputting multiple data points at once. It is quite a bit faster than Math::GSL::Histogram, but does not offer the raw performance of PDL or Math::Histogram's cousin Math::SimpleHisto::XS.

Math::SimpleHisto::XS, by the same author as Math::Histogram, is similar to the latter library, but implemented in XS for speed, and limited to one-dimensional histograms. It is slower than PDL::NDBin.

PDL's built-in functions hist() and histogram() are, on average, the fastest functions. Given that the core of these routines runs entirely in C, this is not very surprising. The PDL functions have very low overhead and are very memory-efficient.

Note that, in the tests, various data conversions between piddles and ordinary Perl arrays were required. The timings exclude these conversions, and count only the time required to produce a histogram from the "natural" data structure, i.e. piddles for PDL-based modules, and ordinary Perl arrays for the other modules.

Note also that the histograms produced by the different methods were verified to be equal.

Performance figures for a few tests on a particular machine don't say much. As PDL::NDBin is intended to handle large amounts of data, it is important to check how well PDL::NDBin's performance scales as the problem size increases.

The first and most obvious way in which a problem may be 'large', is the number of data points. If a given method cannot process a large number of data points, or can only do so with increased effort, it is not suitable for large problems. How large that is, depends on the application, but in the field of satellite data retrieval (where I work), 33 million data points is not exceptional at all (but it is the largest size I could test). In this section, we examine how well PDL::NDBin's performance scales with the number of data points, and compare with alternative modules.

The following table shows timing data on the laptop for 100 bins, but with a variable number of data points:

Note that the tests couldn't be run with Math::GSL::Histogram, Math::Histogram, and Math::SimpleHisto::XS on the largest data file (33 million points), due to insufficient memory.

The methods show a linear increase in time per iteration with the number of data points, which translates to a fixed time per iteration per data point. This is the desired behaviour: it guarantees that the effort required to produce a histogram does not increase faster than the problem size. Every method examined here displays this behaviour.

Something else to note is the higher CPU time per iteration per data point of PDL::NDBin for small data files, although there is also a hint of this effect in the results for Math::Histogram. For large data files, the time per iteration per data point is more or less constant. This effect is not fully understood, but may indicate high overhead or start-up cost.

The results suggest that PDL::NDBin scales well with the number of data points, and that it is therefore well suited for large data. PDL::NDBin and histogram() (and hist()) are currently the only methods that allow processing very large data files.

The number of data points may not be the only way in which a problem may be 'large' or hard. The number of bins may also be high. In applications with satellite data, for instance, a latitude/longitude grid with a resolution of only 5 degrees already yields more than 2000 bins, and raising the resolution to 1 degree yields approximately 64,000 bins.

Most of the methods depend in some way on the number of bins. If the execution time depends to a significant extent on the number of bins, the method is not suitable for large numbers of bins. In this section, we examine how well PDL::NDBin's performance scales with the number of bins, and compare with alternative modules.

The following table shows timing data on the laptop for 2 million data points, with a variable number of bins:

Note that some data are missing because the associated test didn't run successfully (e.g., segmentation fault).

The methods show more or less constant execution time per iteration, independent of the number of bins. This is the desired behaviour: the overhead of managing the bins does not dominate the execution time.

At high bin counts, however, execution time increases for almost all methods, as there is always some amount of bookkeeping required proportional to the number of bins. Except for Math::GSL::Histogram, the bookkeeping is not prohibitive.

The results suggest that PDL::NDBin scales well with the number of bins up to 1,000,000, and that it is therefore well suited for large data.