On 06/15/2011 07:25 PM, Sturla Molden wrote:
> Den 15.06.2011 23:22, skrev Christopher Barker:
>> I think the issue got confused -- the OP was not looking to speed up a
>> matrix multiply, but rather to speed up a whole bunch of independent
>> matrix multiplies.
> I would do it like this:
>> 1. Write a Fortran function that make multiple calls DGEMM in a do loop.
> (Or Fortran intrinsics dot_product or matmul.)
>> 2. Put an OpenMP pragma around the loop (!$omp parallel do). Invoke the
> OpenMP compiler on compilation. Use static or guided thread scheduling.
>> 3. Call Fortran from Python using f2py, ctypes or Cython.
>> Build with a thread-safe and single-threaded BLAS library.
>> That should run as fast as it gets.
>> Sturla
> _______________________________________________
> NumPy-Discussion mailing list
>NumPy-Discussion@scipy.org>http://mail.scipy.org/mailman/listinfo/numpy-discussion
The idea was to calculate:
innerProduct =numpy.sum(array1*array2) where array1 and array2 are, in
general, multidimensional.
Numpy has an inner product function called np.inner
(http://www.scipy.org/Numpy_Example_List_With_Doc#inner) which is a
special case of tensordot. See the documentation for what is does and
other examples. Also note that np.inner provides of the cross-products
as well. For example, you can just do:
>>> import numpy as np
>>> a=np.arange(10).reshape(2,5)
>>> a
array([[0, 1, 2, 3, 4],
[5, 6, 7, 8, 9]])
>>> b=a*10
>>> b
array([[ 0, 10, 20, 30, 40],
[50, 60, 70, 80, 90]])
>>> c=a*b
>>> c
array([[ 0, 10, 40, 90, 160],
[250, 360, 490, 640, 810]])
>>> c.sum()
2850
>>>d=np.inner(a,b)
>>>d
array([[ 300, 800],
[ 800, 2550]])
>>>np.diag(d).sum()
2850
I do not know if numpy's multiplication, np.inner() and np.sum() are
single-threaded and thread-safe when you link to multi-threaded blas,
lapack or altas libraries. But if everything is *single-threaded* and
thread-safe, then you just create a function and use Anne's very useful
handythread.py (http://www.scipy.org/Cookbook/Multithreading). Otherwise
you would have to determine the best approach such doing nothing
(especially if the arrays are huge), or making the functions single-thread.
By the way, if the arrays are sufficiently small, there is a lot of
overhead involved such that there is more time in communication than
computation. So you have to determine the best algorithm for you case.
Bruce