Well I have proof we're going things right! And it also shows the
power of MKL...
I checked with EPD and my python snippet (see also my first email),
and all 4 timings are the same...
However, if this behavior is the right one, then my python snippet
should get the same trend. But it's not.
So I though that indeed MKL could be tricking me, so I built scipy and
time again with the same config (ATLAS 3.10).
I call the 'dot' function I put on pastebin as dot and here are the timings:
In [8]: %timeit dot(big, big)
10 loops, best of 3: 133 ms per loop
In [9]: %timeit dot(big.T, big.T)
10 loops, best of 3: 133 ms per loop
In [10]: %timeit dot(big.T, big)
10 loops, best of 3: 97.9 ms per loop
In [11]: %timeit dot(big, big.T)
10 loops, best of 3: 96.4 ms per loop
So you guys were right, it's cache friendly.
We're doing the right thing, and the differences are most likely not
significant (or the snippet is wrong too)!
In [14]: np.sqrt(((dot(big, big.T) - np.dot(big, big.T))**2).sum())
Out[14]: 0.015331409
In [15]: np.sqrt(((dot(big, big) - np.dot(big, big))**2).sum())
Out[15]: 0.0
All this to say that we are getting a 2x speed improvement on
a.dot(a.T), which is awesome.
-nicolas
Nath: same trend if arrays are different.
On Fri, Nov 9, 2012 at 3:05 PM, Nathaniel Smith <njs@pobox.com> wrote:
> In this case it's even possible the blas is being smart enough to notice
> that you've passed it the same pointer twice with the transpose switch on
> one of them, and is just skipping half the multiplies since the output is
> provably symmetric. Don't know if they actually do that but...
>> -n
>> On 9 Nov 2012 22:57, "Matthieu Brucher" <matthieu.brucher@gmail.com> wrote:
>>>> Hi,
>>>> A.A slower than A.A' is not a surprise for me. The latter is far more
>> cache friendly that the former. Everything follows cache lines, so it is
>> faster than something that will use one element from each cache line. In
>> fact it is exactly what "proves" that the new version is correct.
>> Good job (if all the tests were made and still pass ;) )
>>>> Cheers,
>>>> Matthieu
>>>>>> 2012/11/9 Nicolas SCHEFFER <scheffer.nicolas@gmail.com>
>>>>>> Ok: comparing apples to apples. I'm clueless on my observations and
>>> would need input from you guys.
>>>>>> Using ATLAS 3.10, numpy with and without my changes, I'm getting these
>>> timings and comparisons.
>>>>>> #
>>> #I. Generate matrices using regular dot:
>>> #
>>> big = np.array(np.random.randn(2000, 2000), 'f');
>>> np.savez('out', big=big, none=big.dot(big), both=big.T.dot(big.T),
>>> left=big.T.dot(big), right=big.dot(big.T))"
>>>>>> #
>>> #II. Timings with regular dot
>>> #
>>> In [3]: %timeit np.dot(big, big)
>>> 10 loops, best of 3: 138 ms per loop
>>>>>> In [4]: %timeit np.dot(big, big.T)
>>> 10 loops, best of 3: 166 ms per loop
>>>>>> In [5]: %timeit np.dot(big.T, big.T)
>>> 10 loops, best of 3: 193 ms per loop
>>>>>> In [6]: %timeit np.dot(big.T, big)
>>> 10 loops, best of 3: 165 ms per loop
>>>>>> #
>>> #III. I load these arrays and time again with the "fast" dot
>>> #
>>> In [21]: %timeit np.dot(big, big)
>>> 10 loops, best of 3: 138 ms per loop
>>>>>> In [22]: %timeit np.dot(big.T, big)
>>> 10 loops, best of 3: 104 ms per loop
>>>>>> In [23]: %timeit np.dot(big.T, big.T)
>>> 10 loops, best of 3: 138 ms per loop
>>>>>> In [24]: %timeit np.dot(big, big.T)
>>> 10 loops, best of 3: 102 ms per loop
>>>>>> 1. A'.A': great!
>>> 2. A.A' becomes faster than A.A !?!
>>>>>> #
>>> #IV. MSE on differences
>>> #
>>> In [25]: np.sqrt(((arr['none'] - none)**2).sum())
>>> Out[25]: 0.0
>>>>>> In [26]: np.sqrt(((arr['both'] - both)**2).sum())
>>> Out[26]: 0.0
>>>>>> In [27]: np.sqrt(((arr['left'] - left)**2).sum())
>>> Out[27]: 0.015900515
>>>>>> In [28]: np.sqrt(((arr['right'] - right)**2).sum())
>>> Out[28]: 0.015331409
>>>>>> #
>>> # CCl
>>> #
>>> While the MSE are small, I'm wondering whether:
>>> - It's a bug: it should be exactly the same
>>> - It's a feature: BLAS is taking shortcuts when you have A.A'. The
>>> difference is not significant. Quick: PR that asap!
>>>>>> I don't have enough expertise to answer that...
>>>>>> Thanks much!
>>>>>> -nicolas
>>> On Fri, Nov 9, 2012 at 2:13 PM, Nicolas SCHEFFER
>>> <scheffer.nicolas@gmail.com> wrote:
>>> > I too encourage users to use scipy.linalg for speed and robustness
>>> > (hence calling this scipy.dot), but it just brings so much confusion!
>>> > When using the scipy + numpy ecosystem, you'd almost want everything
>>> > be done with scipy so that you get the best implementation in all
>>> > cases: scipy.zeros(), scipy.array(), scipy.dot(), scipy.linalg.inv().
>>> >
>>> > Anyway this is indeed for another thread, the confusion we'd like to
>>> > fix here is that users shouldn't have to understand the C/F contiguous
>>> > concepts to get the maximum speed for np.dot()
>>> >
>>> > To summarize:
>>> > - The python snippet I posted is still valid and can speed up your
>>> > code if you can change all your dot() calls.
>>> > - The change in dotblas.c is a bit more problematic because it's very
>>> > core. I'm having issues right now to replicate the timings, I've got
>>> > better timing for a.dot(a.T) than for a.dot(a). There might be a bug.
>>> >
>>> > It's a pain to test because I cannot do the test in a single python
>>> > session.
>>> > I'm going to try to integrate most of your suggestions, I cannot
>>> > guarantee I'll have time to do them all though.
>>> >
>>> > -nicolas
>>> > On Fri, Nov 9, 2012 at 8:56 AM, Nathaniel Smith <njs@pobox.com> wrote:
>>> >> On Fri, Nov 9, 2012 at 4:25 PM, Gael Varoquaux
>>> >> <gael.varoquaux@normalesup.org> wrote:
>>> >>> On Fri, Nov 09, 2012 at 03:12:42PM +0000, Nathaniel Smith wrote:
>>> >>>> But what if someone compiles numpy against an optimized blas (mkl,
>>> >>>> say) and then compiles SciPy against the reference blas? What do you
>>> >>>> do then!? ;-)
>>> >>>
>>> >>> This could happen. But the converse happens very often. What happens
>>> >>> is
>>> >>> that users (eg on shared computing resource) ask for a scientific
>>> >>> python
>>> >>> environment. The administrator than installs the package starting
>>> >>> from
>>> >>> the most basic one, to the most advanced one, thus starting with
>>> >>> numpy
>>> >>> that can very well build without any external blas. When he gets to
>>> >>> scipy
>>> >>> he hits the problem that the build system does not detect properly
>>> >>> the
>>> >>> blas, and he solves that problem.
>>> >>>
>>> >>> Also, it used to be that on the major linux distributions, numpy
>>> >>> would not
>>> >>> be build with an optimize lapack because numpy was in the 'base' set
>>> >>> of
>>> >>> packages, but not lapack. On the contrary, scipy being in the
>>> >>> 'contrib'
>>> >>> set, it could depend on lapack. I just checked, and this has been
>>> >>> fixed
>>> >>> in the major distributions (Fedora, Debian, Ubuntu).
>>> >>>
>>> >>> Now we can discuss with such problems should not happen, and put the
>>> >>> blame on the users/administrators, the fact is that they happen
>>> >>> often. I
>>> >>> keep seeing environments in which np.linalg is unreasonnably slow.
>>> >>
>>> >> If this is something that's been a problem for you, maybe we should
>>> >> start another thread on things we could do to fix it directly? Improve
>>> >> build instructions, advertise build systems that set up the whole
>>> >> environment (and thus do the right thing), make numpy's setup.py
>>> >> scream and yell if blas isn't available...?
>>> >>
>>> >> -n
>>> >> _______________________________________________
>>> >> NumPy-Discussion mailing list
>>> >> NumPy-Discussion@scipy.org>>> >> http://mail.scipy.org/mailman/listinfo/numpy-discussion>>> _______________________________________________
>>> NumPy-Discussion mailing list
>>>NumPy-Discussion@scipy.org>>>http://mail.scipy.org/mailman/listinfo/numpy-discussion>>>>>>>>>> --
>> Information System Engineer, Ph.D.
>> Blog: http://matt.eifelle.com>> LinkedIn: http://www.linkedin.com/in/matthieubrucher>> Music band: http://liliejay.com/>>>>>> _______________________________________________
>> NumPy-Discussion mailing list
>>NumPy-Discussion@scipy.org>>http://mail.scipy.org/mailman/listinfo/numpy-discussion>>>> _______________________________________________
> NumPy-Discussion mailing list
>NumPy-Discussion@scipy.org>http://mail.scipy.org/mailman/listinfo/numpy-discussion>