Tim Hochberg writes:
> Overhead (c) Overhead (nc)
> TimePerElement (c) TimePerElement (nc)
> NumPy 10 us 10
> us 85 ps 95 ps
> NumArray 200 us 530 us
> 45 ps 135 ps
> Psymeric 50 us 65
> us 80 ps 80 ps
>>> The times shown above are for Float64s and are pretty approximate, and
> they happen to be a particularly favorable array shape for Psymeric. I
> have seen pymeric as much as 50% slower than NumPy for large arrays of
> certain shapes.
>> The overhead for NumArray is surprisingly large. After doing this
> experiment I'm certainly more sympathetic to Konrad wanting less
> overhead for NumArray before he adopts it.
>Wow! Do you really mean picoseconds? I never suspected that
either Numeric or numarray were that fast. ;-)
Anyway, this issue is timely [Err...]. As it turns out we started
looking at ways of improving small array performance a couple weeks
ago and are coming closer to trying out an approach that should
reduce the overhead significantly.
But I have some questions about your benchmarks. Could you show me
the code that is used to generate the above timings? In particular
I'm interested in the kinds of arrays that are being operated on.
It turns out that that the numarray overhead depends on more than
just contiguity and it isn't obvious to me which case you are testing.
For example, Todd's benchmarks indicate that numarray's overhead is
about a factor of 5 larger than numpy when the input arrays are
contiguous and of the same type. On the other hand, if the array
is not contiguous or requires a type conversion, the overhead is
much larger. (Also, these cases require blocking loops over large
arrays; we have done nothing yet to optimize the block size or
the speed of that loop.) If you are doing the benchmark on
contiguous, same type arrays, I'd like to get a copy of the benchmark
program to try to see where the disagreement arises.
The very preliminary indications are that we should be able to make
numarray overheads approximately 3 times higher for all ufunc cases.
That's still slower, but not by a factor of 20 as shown above. How
much work it would take to reduce it further is unclear (the main
bottleneck at that point appears to be how long it takes to create
new output arrays)
We are still mainly in the analysis and design phase of how to
improve performance for small arrays and block looping. We believe
that this first step will not require moving very much of the
existing Python code into C (but some will be). Hopefully we
will have some working code in a couple weeks.
Thanks, Perry