On Feb 20, 2012, at 7:08 PM, Dag Sverre Seljebotn wrote:
> On 02/20/2012 09:34 AM, Christopher Jordan-Squire wrote:
>> On Mon, Feb 20, 2012 at 9:18 AM, Dag Sverre Seljebotn
>> <d.s.seljebotn@astro.uio.no> wrote:
>>> On 02/20/2012 08:55 AM, Sturla Molden wrote:
>>>> Den 20.02.2012 17:42, skrev Sturla Molden:
>>>>> There are still other options than C or C++ that are worth considering.
>>>>> One would be to write NumPy in Python. E.g. we could use LLVM as a
>>>>> JIT-compiler and produce the performance critical code we need on the fly.
>>>>>>>>>>>>>>>>>> LLVM and its C/C++ frontend Clang are BSD licenced. It compiles faster
>>>> than GCC and often produces better machine code. They can therefore be
>>>> used inside an array library. It would give a faster NumPy, and we could
>>>> keep most of it in Python.
>>>>>> I think it is moot to focus on improving NumPy performance as long as in
>>> practice all NumPy operations are memory bound due to the need to take a
>>> trip through system memory for almost any operation. C/C++ is simply
>>> "good enough". JIT is when you're chasing a 2x improvement or so, but
>>> today NumPy can be 10-20x slower than a Cython loop.
>>>>>>> I don't follow this. Could you expand a bit more? (Specifically, I
>> wasn't aware that numpy could be 10-20x slower than a cython loop, if
>> we're talking about the base numpy library--so core operations. I'm
>> The problem with NumPy is the temporaries needed -- if you want to compute
>> A + B + np.sqrt(D)
>> then, if the arrays are larger than cache size (a couple of megabytes),
> then each of those operations will first transfer the data in and out
> over the memory bus. I.e. first you compute an element of sqrt(D), then
> the result of that is put in system memory, then later the same number
> is read back in order to add it to an element in B, and so on.
>> The compute-to-bandwidth ratio of modern CPUs is between 30:1 and
> 60:1... so in extreme cases it's cheaper to do 60 additions than to
> transfer a single number from system memory.
>> It is much faster to only transfer an element (or small block) from each
> of A, B, and D to CPU cache, then do the entire expression, then
> transfer the result back. This is easy to code in Cython/Fortran/C and
> impossible with NumPy/Python.
>> This is why numexpr/Theano exists.
Well, I can't speak for Theano (it is quite more general than numexpr, and more geared towards using GPUs, right?), but this was certainly the issue that make David Cooke to create numexpr. A more in-deep explanation about this problem can be seen in:
http://www.euroscipy.org/talk/1657
which includes some graphical explanations.
-- Francesc Alted