For 4 cores, on your system, your conclusion makes some sense. That
said, I played around with this on both a core 2 duo and the 12 core
system. For the 12-core system, on my tests the 0 case ran extremely
close to the 2-thread case for all my sizes.
The core 2 duo runs windows 7, and after downloading pthreadsGC2.dll
from the pthreads project, I was able to use openmp under a year-old
(32-bit) pythonxy distribution with mingw. The result, 0 threads come in
slightly faster than one thread, .00102 versus .00106, and 2 threads
took .00060.
My current theory is that gcc under linux uses some background trick to
get two thread-like streams going. As I assess scale-up under linux, I
will need to consider this behavior.
Creating optimal codes with OpenMP certainly requires a considerable
commitment. Given the problem-specific fine tuning required, I would not
expect much gain in general-purpose routines. In specific routines like
cdist, it might make more sense. I talked to a Dell HPC rep today, and
he said that squeezing out an extra 15% performance boost on an Intel
CPU was a pleasant surprise, so the 30% improvement is maybe not so bad.
Cheers,
Eric