Rethinking the Benchmark module

The first thing we instinctively do is add TIMES.times with a large TIMES
constant. Surely "errors" cancel each other out for large enough values of TIMES, right?
But, how large is large enough?

I wrote a simple AdaptativeBenchmark which works more or less like the
Benchmark in stdlib, but also decides how many times to repeat execution in order to
approach the average time with the desired precision for a given confidence
level:

AdaptativeBenchmark.bmdo|bm|# by default, approach the average with a 10% confidence interval for a 95%# confidence levelbm.report("bar"){bar}bm.report("foo",:precision=>0.05,:confidence=>0.9){foo}end

The AdaptativeBenchmark first estimates the sample variance, population
variance and population average by running the given block min_runs times (10 by
default), and then uses those initial estimates to compute how many iterations
are needed for the desired confidence interval and level. The initial number
of runs should probably be adjusted the same way the extra ones are, but I'm
not sure it's worth the effort.

The maths

In order to avoid storing all the sample values, the sample average is
computed with the recursive relation

and the sample variance with

the population variance being estimated as . The number of iterations is taken as

where is the desired confidence interval (as a fraction of the population average) and P the confidence level, given as the :precision and :confidence parameters.

Limitations

As said above, the initial estimates determine how many runs are needed, so
at times way too many will be deemed necessary if the variance was high for some
reason or the desired confidence interval (level) is too narrow (high).

Computing the maximum likelihood estimates with the above relations makes sense
if there are lots of samples, but the basic definitions are probably more practical.