On 10/7/2010 6:52 PM, Matthew Brett wrote:
> Hi,
>> On Thu, Oct 7, 2010 at 3:47 PM, Andrew P. Mullhaupt
> <doc@zen-pharaohs.com> wrote:
>>> Most machines now
>>> and in the future are not going to choke on these issues (for a variety
>>> of reasons).
> 'Talk is cheap, show me the code' .
Yes, let's.
First we want to initialize memory mapped arrays which will be used for
the variables. This python script does that:
from numpy import *
N = 819200
X = random.uniform(0, 1, 2*N)
Y = random.uniform(0, 1, 2*N)
Z = random.uniform(0, 1, 2*N)
p = random.uniform(0, 1, N)
q = random.uniform(0, 1, N)
r = random.uniform(0, 1, N)
s = random.uniform(0, 1, N)
w = random.uniform(0, 1, N)
z = random.uniform(0, 1, N)
save('X', X)
save('Y', Y)
save('Z', Z)
save('p', p)
save('q', q)
save('r', r)
save('s', s)
save('w', w)
save('z', z)
Note that we have two arrays of length 2*N representing complex arrays
with interleaved real and imaginary parts, and we have four arrays of
length N representing the segregated real and complex parts.
We also create arrays for the sums; these are not sparse (since we are
concerned with the case where there actually are imaginary parts that
would require memory bandwidth).
Since we want to check the memory bandwidth we need a test which
essentially shovels these arrays through memory very quickly. I have
chosen to produce a result to ensure that everything actually gets done
- although this is probably not strictly necessary, why not.
First the interleaved version:
from numpy import *
X = load('X.npy', mmap_mode='r')
Y = load('Y.npy', mmap_mode='r')
Z = load('Z.npy', mmap_mode='r+')
N = Z.size / 2
its = int(8192*100000 / N)
for k in range(its) :
Z[:] = X + Y
print Z.max()
Then the segregated version:
from numpy import *
p = load('p.npy', mmap_mode='r')
q = load('q.npy', mmap_mode='r')
r = load('r.npy', mmap_mode='r')
s = load('s.npy', mmap_mode='r')
w = load('w.npy', mmap_mode='r+')
z = load('z.npy', mmap_mode='r+')
N = s.size
its = int(8192*100000 / N)
for k in range(its) :
w[:] = p + r
z[:] = s + q
print maximum(w.max(), z.max())
Note that the segregated version does not "cheat" by computing p + q
and r + s which presumably span less address space. In fact, we would
not expect that timing to be too much different.
Now lets conduct the interleaved timing:
andrew@flyer:/media/fraid/andrew/junk$ time python interleave.py
1.99960173329
real 0m6.438s
user 0m6.400s
sys 0m0.020s
and the segregated one:
andrew@flyer:/media/fraid/andrew/junk$ time python segregate.py
1.99858982651
real 0m6.198s
user 0m6.140s
sys 0m0.050s
Well OK, I would describe this as pretty close. Oddly enough the
segregated version is faster.
Another pair of runs with larger files:
andrew@flyer:/media/fraid/andrew/junk$ time python interleave.py
1.99995091618
real 0m11.480s
user 0m6.070s
sys 0m4.400s
andrew@flyer:/media/fraid/andrew/junk$ time python segregate.py
1.99989270428
real 0m12.865s
user 0m7.250s
sys 0m4.330s
Here we end up with another close run, but the segregated version is
slower.
So yeah, here is code you can use to see if your machine chokes on the
old sort of issues (which were very common in early RISC workstation
architectures - that awful TLB on the RS/6000 series springs immediately
to mind).
You can strobe through file sizes if you want to try and tickle one of
those weirder creases in memory architecture on your machine, but you
will probably not find anything that blatant. People who design
computers have been through this rodeo a couple times.
Best regards,
Andrew Mullhaupt