We're hoping to use yt parallel volume rendering on a very large generic
brick - it's a simple rectangular unigrid slab, but containing something
like 1.5e11 points, so much too large for load_uniform_grid() to load
into memory in a single machine.

I imagine it wouldn't be hard to do the domain decomposition by hand,
loading a different chunk of grid into each MPI process. But then
what? What would it take to invoke the volume renderer on each piece
and composite them together? Would it help if the chunks were stored
in a KDTree? Is there some example (one of the existing data loaders?)
which I could follow?

We're hoping to use yt parallel volume rendering on a very large generic
brick - it's a simple rectangular unigrid slab, but containing something
like 1.5e11 points, so much too large for load_uniform_grid() to load
into memory in a single machine.

At that point, there are two possible routes that both may or may not work
well.

1) Just try rendering with ~256-512 cores, and the AMRKDTree should try to
geometrically split the grid before performing and I/O.
or
2) Use load_uniform_grid with the keyword nprocs=N ( for this size
simulation, you probably need something like 256-1024 processors depending
on the memory per core). This should do the equivalent thing to (1), but it
may hit the I/O here instead of in the kd-tree.

I think (1) should be your best option, but I haven't tried rendering
this large of a single-grid output.

When you build the camera option, definitely start out using the keyword
"no_ghost=True", as this will extrapolate rather than interpolate from
boundary grids to the vertices. The rendering quality won't be quite as
good but for unigrid simulations there isn't a tremendous difference.

Let us know how that goes! I'd be very excited to see images from such a
large sim...

I imagine it wouldn't be hard to do the domain
decomposition by hand,
loading a different chunk of grid into each MPI process. But then
what? What would it take to invoke the volume renderer on each piece
and composite them together? Would it help if the chunks were stored
in a KDTree? Is there some example (one of the existing data loaders?)
which I could follow?

We're hoping to use yt parallel volume rendering on a very large generic
brick - it's a simple rectangular unigrid slab, but containing something
like 1.5e11 points, so much too large for load_uniform_grid() to load
into memory in a single machine.

At that point, there are two possible routes that both may or may not work
well.

1) Just try rendering with ~256-512 cores, and the AMRKDTree should try to
geometrically split the grid before performing and I/O.
or
2) Use load_uniform_grid with the keyword nprocs=N ( for this size
simulation, you probably need something like 256-1024 processors depending
on the memory per core). This should do the equivalent thing to (1), but it
may hit the I/O here instead of in the kd-tree.

I think (1) should be your best option, but I haven't tried rendering
this large of a single-grid output.

When you build the camera option, definitely start out using the keyword
"no_ghost=True", as this will extrapolate rather than interpolate from
boundary grids to the vertices. The rendering quality won't be quite as
good but for unigrid simulations there isn't a tremendous difference.

Let us know how that goes! I'd be very excited to see images from such a
large sim...

Sam

>

I imagine it wouldn't be hard to do the domain
decomposition by hand,
loading a different chunk of grid into each MPI process. But then
what? What would it take to invoke the volume renderer on each piece
and composite them together? Would it help if the chunks were stored
in a KDTree? Is there some example (one of the existing data loaders?)
which I could follow?

Thank you, Sam! I think this makes sense. Except, in case (1), do I
need to do something to bring the AMRKDTree into the picture? Or are you
telling me that it is automatically constructed whenever you
load_uniform_grid(), or volume-render it?

I think the available nodes have 64GB, so to load the whole ~600GB might
take at least 32 nodes or 1024 cores.

Hi Stuart,
On Thu Nov 06 2014 at 8:36:28 AM Stuart Levy <salevy@illinois.edu
<mailto:salevy@illinois.edu>> wrote:
Hello all,
We're hoping to use yt parallel volume rendering on a very
large generic
brick - it's a simple rectangular unigrid slab, but containing
something
like 1.5e11 points, so much too large for load_uniform_grid()
to load
into memory in a single machine.
Are you loading directly using something like numpy.fromfile? If
so, I think the easiest method would be to replace that with a
np.memmap
(http://docs.scipy.org/doc/numpy/reference/generated/numpy.memmap.html).
Once that is loaded, you should be able to use load_uniform_grid.
At that point, there are two possible routes that both may or may
not work well.
1) Just try rendering with ~256-512 cores, and the AMRKDTree
should try to geometrically split the grid before performing and I/O.
or
2) Use load_uniform_grid with the keyword nprocs=N ( for this size
simulation, you probably need something like 256-1024 processors
depending on the memory per core). This should do the equivalent
thing to (1), but it may hit the I/O here instead of in the kd-tree.
I *think* (1) should be your best option, but I haven't tried
rendering this large of a single-grid output.
When you build the camera option, definitely start out using the
keyword "no_ghost=True", as this will extrapolate rather than
interpolate from boundary grids to the vertices. The rendering
quality won't be quite as good but for unigrid simulations there
isn't a tremendous difference.
Let us know how that goes! I'd be very excited to see images from
such a large sim...
Sam
I imagine it wouldn't be hard to do the domain decomposition
by hand,
loading a different chunk of grid into each MPI process. But
then
what? What would it take to invoke the volume renderer on
each piece
and composite them together? Would it help if the chunks
were stored
in a KDTree? Is there some example (one of the existing data
loaders?)
which I could follow?
_______________________________________________
yt-users mailing list
yt-users@lists.spacepope.org <mailto:yt-users@lists.spacepope.org>
http://lists.spacepope.org/listinfo.cgi/yt-users-spacepope.org

Yep, the volume rendering should build the AMRKDTree itself, and should
automatically decompose the giant brick into Np pieces. As for memory, you
may need to (eek) allow for yt casting to 64-bit floats for the data, but
you'll have to just experiment a bit.

Thank you, Sam! I think this makes sense. Except,
in case (1), do I
need to do something to bring the AMRKDTree into the picture? Or are you
telling me that it is automatically constructed whenever you
load_uniform_grid(), or volume-render it?

I think the available nodes have 64GB, so to load the whole ~600GB might
take at least 32 nodes or 1024 cores.

Will let you know how it goes!

On 11/7/14 11:08 AM, Sam Skillman wrote:

Ack, my calculation of 256-512 cores is probably low... feel free to push
up much higher.

We're hoping to use yt parallel volume rendering on a very large generic
brick - it's a simple rectangular unigrid slab, but containing something
like 1.5e11 points, so much too large for load_uniform_grid() to load
into memory in a single machine.

At that point, there are two possible routes that both may or may not
work well.

1) Just try rendering with ~256-512 cores, and the AMRKDTree should try
to geometrically split the grid before performing and I/O.
or
2) Use load_uniform_grid with the keyword nprocs=N ( for this size
simulation, you probably need something like 256-1024 processors depending
on the memory per core). This should do the equivalent thing to (1), but it
may hit the I/O here instead of in the kd-tree.

I think (1) should be your best option, but I haven't tried rendering
this large of a single-grid output.

When you build the camera option, definitely start out using the
keyword "no_ghost=True", as this will extrapolate rather than interpolate
from boundary grids to the vertices. The rendering quality won't be quite
as good but for unigrid simulations there isn't a tremendous difference.

Let us know how that goes! I'd be very excited to see images from such
a large sim...

Sam

>

I imagine it wouldn't be hard to do the domain
decomposition by hand,
loading a different chunk of grid into each MPI process. But then
what? What would it take to invoke the volume renderer on each piece
and composite them together? Would it help if the chunks were stored
in a KDTree? Is there some example (one of the existing data loaders?)
which I could follow?

First, sheepishly: I was off by a decimal place in the original file
size. It's a unigrid with 1.5e10, not 1.5e11 voxels - big enough to be
a nuisance but not heroic.

Second: load_uniform_grid() on a big numpy.memmap()'ed file, even a
modest 8GB fraction of the full grid, takes a long time - many tens of
minutes? I ran out of time slice before it finished even doing that.
Note this was just calling load_uniform_grid(), not any attempt at
calculation yet.

Speculation: something sweeps through the memory, causes a page fault,
sweeps a bit more, another page fault, etc. So there'd be many small
I/O calls triggered sequentially, wasting lots of time. Could that be?
If so, then I'm wondering if it could be possible to discover which
portions of the array will be in each node's domain, and prefetch those
in bulk first, using a few very efficient huge I/O calls (maybe via
madvise()).

Either that, or if I can do my own domain decomposition up front and
tell the AMRKDTree which nodes own which slabs of grid, then I could
just read() them in - also efficiently - and let yt do any further
decomposition, maybe.

Yep, the volume rendering should build the AMRKDTree
itself, and
should automatically decompose the giant brick into Np pieces. As
for memory, you may need to (eek) allow for yt casting to 64-bit
floats for the data, but you'll have to just experiment a bit.

Thank you, Sam! I think this makes sense. Except, in case (1),
do I need to do something to bring the AMRKDTree into the
picture? Or are you telling me that it is automatically
constructed whenever you load_uniform_grid(), or volume-render it?
I think the available nodes have 64GB, so to load the whole ~600GB
might take at least 32 nodes or 1024 cores.
Will let you know how it goes!
On 11/7/14 11:08 AM, Sam Skillman wrote:

Ack, my calculation of 256-512 cores is
probably low... feel free
to push up much higher.
On Fri Nov 07 2014 at 9:03:51 AM Sam Skillman
<samskillman@gmail.com <mailto:samskillman@gmail.com>> wrote:
Hi Stuart,
On Thu Nov 06 2014 at 8:36:28 AM Stuart Levy
<salevy@illinois.edu <mailto:salevy@illinois.edu>> wrote:
Hello all,
We're hoping to use yt parallel volume rendering on a
very large generic
brick - it's a simple rectangular unigrid slab, but
containing something
like 1.5e11 points, so much too large for
load_uniform_grid() to load
into memory in a single machine.
Are you loading directly using something like
numpy.fromfile? If so, I think the easiest method would be
to replace that with a np.memmap
(http://docs.scipy.org/doc/numpy/reference/generated/numpy.memmap.html).
Once that is loaded, you should be able to use load_uniform_grid.
At that point, there are two possible routes that both may or
may not work well.
1) Just try rendering with ~256-512 cores, and the AMRKDTree
should try to geometrically split the grid before performing
and I/O.
or
2) Use load_uniform_grid with the keyword nprocs=N ( for this
size simulation, you probably need something like 256-1024
processors depending on the memory per core). This should do
the equivalent thing to (1), but it may hit the I/O here
instead of in the kd-tree.
I *think* (1) should be your best option, but I haven't tried
rendering this large of a single-grid output.
When you build the camera option, definitely start out using
the keyword "no_ghost=True", as this will extrapolate rather
than interpolate from boundary grids to the vertices. The
rendering quality won't be quite as good but for unigrid
simulations there isn't a tremendous difference.
Let us know how that goes! I'd be very excited to see images
from such a large sim...
Sam
I imagine it wouldn't be hard to do the domain
decomposition by hand,
loading a different chunk of grid into each MPI process.
But then
what? What would it take to invoke the volume renderer
on each piece
and composite them together? Would it help if the
chunks were stored
in a KDTree? Is there some example (one of the existing
data loaders?)
which I could follow?
_______________________________________________
yt-users mailing list
yt-users@lists.spacepope.org
<mailto:yt-users@lists.spacepope.org>
http://lists.spacepope.org/listinfo.cgi/yt-users-spacepope.org
_______________________________________________
yt-users mailing list
yt-users@lists.spacepope.org <mailto:yt-users@lists.spacepope.org>
http://lists.spacepope.org/listinfo.cgi/yt-users-spacepope.org

I just tried to do the same thing on a fake 8GB file and am seeing similar
things. After killing the job, it exited while doing this:
^CTraceback (most recent call last):
File "test_load.py", line 11, in <module>
ds = yt.load_uniform_grid(data, (1024, 1024, 1024))
File
"/home/skillman/local/yt-x86_64/src/yt-3.0/yt/frontends/stream/data_structures.py",
line 605, in load_uniform_grid
field_units, data = unitify_data(data)
File
"/home/skillman/local/yt-x86_64/src/yt-3.0/yt/frontends/stream/data_structures.py",
line 494, in unitify_data
data = dict((field, np.array(val)) for field, val in data.iteritems())
File
"/home/skillman/local/yt-x86_64/src/yt-3.0/yt/frontends/stream/data_structures.py",
line 494, in <genexpr>
data = dict((field, np.array(val)) for field, val in data.iteritems())
KeyboardInterrupt

Looking at line 605 in yt/frontends/stream/data_structures.py, it was
choking when trying to "unitify_data". However, if you go in and change
line 494 from:
data = dict((field, np.array(val)) for field, val in data.iteritems())
to
data = dict((field, val) for field, val in data.iteritems())

First, sheepishly: I was off by a decimal place in the original file
size. It's a unigrid with 1.5e10, not 1.5e11 voxels - big enough to be a
nuisance but not heroic.

Second: load_uniform_grid() on a big numpy.memmap()'ed file, even a modest
8GB fraction of the full grid, takes a long time - many tens of minutes?
I ran out of time slice before it finished even doing that. Note this was
just calling load_uniform_grid(), not any attempt at calculation yet.

Speculation: something sweeps through the memory, causes a page fault,
sweeps a bit more, another page fault, etc. So there'd be many small I/O
calls triggered sequentially, wasting lots of time. Could that be? If
so, then I'm wondering if it could be possible to discover which portions
of the array will be in each node's domain, and prefetch those in bulk
first, using a few very efficient huge I/O calls (maybe via madvise()).

Either that, or if I can do my own domain decomposition up front and
tell the AMRKDTree which nodes own which slabs of grid, then I could just
read() them in - also efficiently - and let yt do any further
decomposition, maybe.

Does either route make sense? Is there code I should look at?

Thanks as ever

Stuart

On 11/7/14 1:33 PM, Sam Skillman wrote:

Yep, the volume rendering should build the AMRKDTree itself, and should
automatically decompose the giant brick into Np pieces. As for memory, you
may need to (eek) allow for yt casting to 64-bit floats for the data, but
you'll have to just experiment a bit.

Thank you, Sam! I think this makes sense.
Except, in case (1), do I
need to do something to bring the AMRKDTree into the picture? Or are you
telling me that it is automatically constructed whenever you
load_uniform_grid(), or volume-render it?

I think the available nodes have 64GB, so to load the whole ~600GB might
take at least 32 nodes or 1024 cores.

Will let you know how it goes!

On 11/7/14 11:08 AM, Sam Skillman wrote:

Ack, my calculation of 256-512 cores is probably low... feel free to push
up much higher.

We're hoping to use yt parallel volume rendering on a very large generic
brick - it's a simple rectangular unigrid slab, but containing something
like 1.5e11 points, so much too large for load_uniform_grid() to load
into memory in a single machine.

At that point, there are two possible routes that both may or may not
work well.

1) Just try rendering with ~256-512 cores, and the AMRKDTree should
try to geometrically split the grid before performing and I/O.
or
2) Use load_uniform_grid with the keyword nprocs=N ( for this size
simulation, you probably need something like 256-1024 processors depending
on the memory per core). This should do the equivalent thing to (1), but it
may hit the I/O here instead of in the kd-tree.

I think (1) should be your best option, but I haven't tried
rendering this large of a single-grid output.

When you build the camera option, definitely start out using the
keyword "no_ghost=True", as this will extrapolate rather than interpolate
from boundary grids to the vertices. The rendering quality won't be quite
as good but for unigrid simulations there isn't a tremendous difference.

Let us know how that goes! I'd be very excited to see images from
such a large sim...

Sam

>

I imagine it wouldn't be hard to do the domain
decomposition by hand,
loading a different chunk of grid into each MPI process. But then
what? What would it take to invoke the volume renderer on each piece
and composite them together? Would it help if the chunks were stored
in a KDTree? Is there some example (one of the existing data loaders?)
which I could follow?

I just tried to do the same thing on a fake 8GB file and am seeing similar
things. After killing the job, it exited while doing this:
^CTraceback (most recent call last):
File "test_load.py", line 11, in <module>
ds = yt.load_uniform_grid(data, (1024, 1024, 1024))
File
"/home/skillman/local/yt-x86_64/src/yt-3.0/yt/frontends/stream/data_structures.py",
line 605, in load_uniform_grid
field_units, data = unitify_data(data)
File
"/home/skillman/local/yt-x86_64/src/yt-3.0/yt/frontends/stream/data_structures.py",
line 494, in unitify_data
data = dict((field, np.array(val)) for field, val in data.iteritems())
File
"/home/skillman/local/yt-x86_64/src/yt-3.0/yt/frontends/stream/data_structures.py",
line 494, in <genexpr>
data = dict((field, np.array(val)) for field, val in data.iteritems())
KeyboardInterrupt

Looking at line 605 in yt/frontends/stream/data_structures.py, it was
choking when trying to "unitify_data". However, if you go in and change
line 494 from:
data = dict((field, np.array(val)) for field, val in data.iteritems())
to
data = dict((field, val) for field, val in data.iteritems())

First, sheepishly: I was off by a decimal place in the original file
size. It's a unigrid with 1.5e10, not 1.5e11 voxels - big enough to be a
nuisance but not heroic.

Second: load_uniform_grid() on a big numpy.memmap()'ed file, even a
modest 8GB fraction of the full grid, takes a long time - many tens of
minutes? I ran out of time slice before it finished even doing that.
Note this was just calling load_uniform_grid(), not any attempt at
calculation yet.

Speculation: something sweeps through the memory, causes a page fault,
sweeps a bit more, another page fault, etc. So there'd be many small I/O
calls triggered sequentially, wasting lots of time. Could that be? If
so, then I'm wondering if it could be possible to discover which portions
of the array will be in each node's domain, and prefetch those in bulk
first, using a few very efficient huge I/O calls (maybe via madvise()).

Either that, or if I can do my own domain decomposition up front and
tell the AMRKDTree which nodes own which slabs of grid, then I could just
read() them in - also efficiently - and let yt do any further
decomposition, maybe.

Does either route make sense? Is there code I should look at?

Thanks as ever

Stuart

On 11/7/14 1:33 PM, Sam Skillman wrote:

Yep, the volume rendering should build the AMRKDTree itself, and should
automatically decompose the giant brick into Np pieces. As for memory, you
may need to (eek) allow for yt casting to 64-bit floats for the data, but
you'll have to just experiment a bit.

Thank you, Sam! I think this makes sense.
Except, in case (1), do I
need to do something to bring the AMRKDTree into the picture? Or are you
telling me that it is automatically constructed whenever you
load_uniform_grid(), or volume-render it?

I think the available nodes have 64GB, so to load the whole ~600GB might
take at least 32 nodes or 1024 cores.

Will let you know how it goes!

On 11/7/14 11:08 AM, Sam Skillman wrote:

Ack, my calculation of 256-512 cores is probably low... feel free to
push up much higher.

We're hoping to use yt parallel volume rendering on a very large
generic
brick - it's a simple rectangular unigrid slab, but containing
something
like 1.5e11 points, so much too large for load_uniform_grid() to load
into memory in a single machine.

At that point, there are two possible routes that both may or may not
work well.

1) Just try rendering with ~256-512 cores, and the AMRKDTree should
try to geometrically split the grid before performing and I/O.
or
2) Use load_uniform_grid with the keyword nprocs=N ( for this size
simulation, you probably need something like 256-1024 processors depending
on the memory per core). This should do the equivalent thing to (1), but it
may hit the I/O here instead of in the kd-tree.

I think (1) should be your best option, but I haven't tried
rendering this large of a single-grid output.

When you build the camera option, definitely start out using the
keyword "no_ghost=True", as this will extrapolate rather than interpolate
from boundary grids to the vertices. The rendering quality won't be quite
as good but for unigrid simulations there isn't a tremendous difference.

Let us know how that goes! I'd be very excited to see images from
such a large sim...

Sam

>

I imagine it wouldn't be hard to do the domain
decomposition by hand,
loading a different chunk of grid into each MPI process. But then
what? What would it take to invoke the volume renderer on each piece
and composite them together? Would it help if the chunks were stored
in a KDTree? Is there some example (one of the existing data
loaders?)
which I could follow?

So if you change that line, then at least the load_uniform_grid() call for
me takes about 0.02 seconds.

If you could test that out and report back, that would be awesome. Once you
make that change, you'll need to go to the root yt directory, like
path/src/yt-hg probably, then type "python setup.py develop".

I'll be flying for the next few hours, but if that all works, let us know,
and i can submit a pull request to fix it up.

I just tried to do the same thing on a fake 8GB file and am seeing similar
things. After killing the job, it exited while doing this:
^CTraceback (most recent call last):
File "test_load.py", line 11, in <module>
ds = yt.load_uniform_grid(data, (1024, 1024, 1024))
File
"/home/skillman/local/yt-x86_64/src/yt-3.0/yt/frontends/stream/data_structures.py",
line 605, in load_uniform_grid
field_units, data = unitify_data(data)
File
"/home/skillman/local/yt-x86_64/src/yt-3.0/yt/frontends/stream/data_structures.py",
line 494, in unitify_data
data = dict((field, np.array(val)) for field, val in data.iteritems())
File
"/home/skillman/local/yt-x86_64/src/yt-3.0/yt/frontends/stream/data_structures.py",
line 494, in <genexpr>
data = dict((field, np.array(val)) for field, val in data.iteritems())
KeyboardInterrupt

Looking at line 605 in yt/frontends/stream/data_structures.py, it was
choking when trying to "unitify_data". However, if you go in and change
line 494 from:
data = dict((field, np.array(val)) for field, val in data.iteritems())
to
data = dict((field, val) for field, val in data.iteritems())

First, sheepishly: I was off by a decimal place in the original file
size. It's a unigrid with 1.5e10, not 1.5e11 voxels - big enough to be a
nuisance but not heroic.

Second: load_uniform_grid() on a big numpy.memmap()'ed file, even a
modest 8GB fraction of the full grid, takes a long time - many tens of
minutes? I ran out of time slice before it finished even doing that.
Note this was just calling load_uniform_grid(), not any attempt at
calculation yet.

Speculation: something sweeps through the memory, causes a page fault,
sweeps a bit more, another page fault, etc. So there'd be many small I/O
calls triggered sequentially, wasting lots of time. Could that be? If
so, then I'm wondering if it could be possible to discover which portions
of the array will be in each node's domain, and prefetch those in bulk
first, using a few very efficient huge I/O calls (maybe via madvise()).

Either that, or if I can do my own domain decomposition up front and
tell the AMRKDTree which nodes own which slabs of grid, then I could just
read() them in - also efficiently - and let yt do any further
decomposition, maybe.

Does either route make sense? Is there code I should look at?

Thanks as ever

Stuart

On 11/7/14 1:33 PM, Sam Skillman wrote:

Yep, the volume rendering should build the AMRKDTree itself, and should
automatically decompose the giant brick into Np pieces. As for memory, you
may need to (eek) allow for yt casting to 64-bit floats for the data, but
you'll have to just experiment a bit.

Thank you, Sam! I think this makes sense.
Except, in case (1), do I
need to do something to bring the AMRKDTree into the picture? Or are you
telling me that it is automatically constructed whenever you
load_uniform_grid(), or volume-render it?

I think the available nodes have 64GB, so to load the whole ~600GB might
take at least 32 nodes or 1024 cores.

Will let you know how it goes!

On 11/7/14 11:08 AM, Sam Skillman wrote:

Ack, my calculation of 256-512 cores is probably low... feel free to
push up much higher.

We're hoping to use yt parallel volume rendering on a very large
generic
brick - it's a simple rectangular unigrid slab, but containing
something
like 1.5e11 points, so much too large for load_uniform_grid() to load
into memory in a single machine.

At that point, there are two possible routes that both may or may not
work well.

1) Just try rendering with ~256-512 cores, and the AMRKDTree should
try to geometrically split the grid before performing and I/O.
or
2) Use load_uniform_grid with the keyword nprocs=N ( for this size
simulation, you probably need something like 256-1024 processors depending
on the memory per core). This should do the equivalent thing to (1), but it
may hit the I/O here instead of in the kd-tree.

I think (1) should be your best option, but I haven't tried
rendering this large of a single-grid output.

When you build the camera option, definitely start out using the
keyword "no_ghost=True", as this will extrapolate rather than interpolate
from boundary grids to the vertices. The rendering quality won't be quite
as good but for unigrid simulations there isn't a tremendous difference.

Let us know how that goes! I'd be very excited to see images from
such a large sim...

Sam

>

I imagine it wouldn't be hard to do the domain
decomposition by hand,
loading a different chunk of grid into each MPI process. But then
what? What would it take to invoke the volume renderer on each piece
and composite them together? Would it help if the chunks were stored
in a KDTree? Is there some example (one of the existing data
loaders?)
which I could follow?

So if you change that line, then at least the load_uniform_grid() call
for me takes about 0.02 seconds.

If you could test that out and report back, that would be awesome.
Once you make that change, you'll need to go to the root yt directory,
like path/src/yt-hg probably, then type "python setup.py develop".

I'll be flying for the next few hours, but if that all works, let us
know, and i can submit a pull request to fix it up.

Hi Stuart,
I just tried to do the same thing on a fake 8GB file and am seeing
similar things. After killing the job, it exited while doing this:
^CTraceback (most recent call last):
File "test_load.py", line 11, in <module>
ds = yt.load_uniform_grid(data, (1024, 1024, 1024))
File
"/home/skillman/local/yt-x86_64/src/yt-3.0/yt/frontends/stream/data_structures.py",
line 605, in load_uniform_grid
field_units, data = unitify_data(data)
File
"/home/skillman/local/yt-x86_64/src/yt-3.0/yt/frontends/stream/data_structures.py",
line 494, in unitify_data
data = dict((field, np.array(val)) for field, val in
data.iteritems())
File
"/home/skillman/local/yt-x86_64/src/yt-3.0/yt/frontends/stream/data_structures.py",
line 494, in <genexpr>
data = dict((field, np.array(val)) for field, val in
data.iteritems())
KeyboardInterrupt
Looking at line 605 in yt/frontends/stream/data_structures.py, it
was choking when trying to "unitify_data". However, if you go in
and change line 494 from:
data = dict((field, np.array(val)) for field, val in
data.iteritems())
to
data = dict((field, val) for field, val in data.iteritems())
On Fri Nov 21 2014 at 3:07:49 PM Stuart Levy <salevy@illinois.edu
<mailto:salevy@illinois.edu>> wrote:
OK, so I have experimented, though not much.
First, sheepishly: I was off by a decimal place in the
original file size. It's a unigrid with 1.5e10, not 1.5e11
voxels - big enough to be a nuisance but not heroic.
Second: load_uniform_grid() on a big numpy.memmap()'ed file,
even a modest 8GB fraction of the full grid, takes a long time
- many tens of minutes? I ran out of time slice before it
finished even doing that. Note this was just calling
load_uniform_grid(), not any attempt at calculation yet.
Speculation: something sweeps through the memory, causes a
page fault, sweeps a bit more, another page fault, etc. So
there'd be many small I/O calls triggered sequentially,
wasting lots of time. Could that be? If so, then I'm
wondering if it could be possible to discover which portions
of the array will be in each node's domain, and prefetch those
in bulk first, using a few very efficient huge I/O calls
(maybe via madvise()).
Either that, or if I can do my own domain decomposition up
front and *tell* the AMRKDTree which nodes own which slabs of
grid, then I could just read() them in - also efficiently -
and let yt do any further decomposition, maybe.
Does either route make sense? Is there code I should look at?
Thanks as ever
Stuart
On 11/7/14 1:33 PM, Sam Skillman wrote:

Yep, the volume rendering should
build the AMRKDTree itself,
and *should* automatically decompose the giant brick into Np
pieces. As for memory, you may need to (eek) allow for yt
casting to 64-bit floats for the data, but you'll have to
just experiment a bit.
Sam
On Fri Nov 07 2014 at 11:15:13 AM Stuart Levy
<salevy@illinois.edu <mailto:salevy@illinois.edu>> wrote:
Thank you, Sam! I think this makes sense. Except, in
case (1), do I need to do something to bring the
AMRKDTree into the picture? Or are you telling me that
it is automatically constructed whenever you
load_uniform_grid(), or volume-render it?
I think the available nodes have 64GB, so to load the
whole ~600GB might take at least 32 nodes or 1024 cores.
Will let you know how it goes!
On 11/7/14 11:08 AM, Sam Skillman wrote:

Ack, my calculation of 256-512
cores is probably low...
feel free to push up much higher.
On Fri Nov 07 2014 at 9:03:51 AM Sam Skillman
<samskillman@gmail.com <mailto:samskillman@gmail.com>>
wrote:
Hi Stuart,
On Thu Nov 06 2014 at 8:36:28 AM Stuart Levy
<salevy@illinois.edu <mailto:salevy@illinois.edu>>
wrote:
Hello all,
We're hoping to use yt parallel volume rendering
on a very large generic
brick - it's a simple rectangular unigrid slab,
but containing something
like 1.5e11 points, so much too large for
load_uniform_grid() to load
into memory in a single machine.
Are you loading directly using something like
numpy.fromfile? If so, I think the easiest method
would be to replace that with a np.memmap
(http://docs.scipy.org/doc/numpy/reference/generated/numpy.memmap.html).
Once that is loaded, you should be able to use
load_uniform_grid.
At that point, there are two possible routes that
both may or may not work well.
1) Just try rendering with ~256-512 cores, and the
AMRKDTree should try to geometrically split the grid
before performing and I/O.
or
2) Use load_uniform_grid with the keyword nprocs=N (
for this size simulation, you probably need
something like 256-1024 processors depending on the
memory per core). This should do the equivalent
thing to (1), but it may hit the I/O here instead of
in the kd-tree.
I *think* (1) should be your best option, but I
haven't tried rendering this large of a single-grid
output.
When you build the camera option, definitely start
out using the keyword "no_ghost=True", as this will
extrapolate rather than interpolate from boundary
grids to the vertices. The rendering quality won't
be quite as good but for unigrid simulations there
isn't a tremendous difference.
Let us know how that goes! I'd be very excited to
see images from such a large sim...
Sam
I imagine it wouldn't be hard to do the domain
decomposition by hand,
loading a different chunk of grid into each MPI
process. But then
what? What would it take to invoke the volume
renderer on each piece
and composite them together? Would it help if
the chunks were stored
in a KDTree? Is there some example (one of the
existing data loaders?)
which I could follow?
_______________________________________________
yt-users mailing list
yt-users@lists.spacepope.org
<mailto:yt-users@lists.spacepope.org>
http://lists.spacepope.org/listinfo.cgi/yt-users-spacepope.org
_______________________________________________
yt-users mailing list
yt-users@lists.spacepope.org <mailto:yt-users@lists.spacepope.org>
http://lists.spacepope.org/listinfo.cgi/yt-users-spacepope.org

Thank you, Sam! I think this makes sense. Except,
in case (1), do I
need to do something to bring the AMRKDTree into the picture? Or are you
telling me that it is automatically constructed whenever you
load_uniform_grid(), or volume-render it?

The AMRKDTree is part of the volume rendering infrastructure - one gets
constructed every time you make a volume rendering.

Sam gave a very nice talk at SciPy 2013 about making massively parallel
volume renders, it might help to give it a watch:

We're hoping to use yt parallel volume rendering on a very large generic
brick - it's a simple rectangular unigrid slab, but containing something
like 1.5e11 points, so much too large for load_uniform_grid() to load
into memory in a single machine.

At that point, there are two possible routes that both may or may not
work well.

1) Just try rendering with ~256-512 cores, and the AMRKDTree should try
to geometrically split the grid before performing and I/O.
or
2) Use load_uniform_grid with the keyword nprocs=N ( for this size
simulation, you probably need something like 256-1024 processors depending
on the memory per core). This should do the equivalent thing to (1), but it
may hit the I/O here instead of in the kd-tree.

I think (1) should be your best option, but I haven't tried rendering
this large of a single-grid output.

When you build the camera option, definitely start out using the
keyword "no_ghost=True", as this will extrapolate rather than interpolate
from boundary grids to the vertices. The rendering quality won't be quite
as good but for unigrid simulations there isn't a tremendous difference.

Let us know how that goes! I'd be very excited to see images from such
a large sim...

Sam

>

I imagine it wouldn't be hard to do the domain
decomposition by hand,
loading a different chunk of grid into each MPI process. But then
what? What would it take to invoke the volume renderer on each piece
and composite them together? Would it help if the chunks were stored
in a KDTree? Is there some example (one of the existing data loaders?)
which I could follow?