[pdal] Does Entwine support distributed builds?

[pdal] Does Entwine support distributed builds?

Hi there,

I have a question regarding the usage of Entwine and was hoping
somebody could help me? The use case is merging point clouds that
have been generated on different machines. Each of these point
clouds is part to the same final dataset. Entwine works great with
the current workflow:

entwine scan -i a.las b.las ... -o output/

for i in {a, b, ... }

entwine build -i output/scan.json -o output/ --run 1

The "--run 1" is done to lower the memory usage. On small
datasets runtime is excellent, but with more models the runtime
starts to increase quite a bit. I'm looking specifically to see if
there are ways to speed the generation of the EPT index. In
particular, since I generate the various LAS files on different
machines, I was wondering if there was a way to let each machine
contribute its part of the index from the individual LAS files
(such index mapped to a network location) or if a workflow is
supported in which each machine can build its own EPT index and
then merge all EPT indexes into one? I don't think this is
possible, but wanted to check.

Re: [pdal] Does Entwine support distributed builds?

The `subset` option lets each iteration of the build run a spatially distinct region, which can be trivially merged afterward, which sounds like what you're after. Another option could be to simply use multiple indexes - potree can accept multiple input EPT sources, and a PDAL pipeline may have multiple EPT readers.

I have a question regarding the usage of Entwine and was hoping
somebody could help me? The use case is merging point clouds that
have been generated on different machines. Each of these point
clouds is part to the same final dataset. Entwine works great with
the current workflow:

entwine scan -i a.las b.las ... -o output/

for i in {a, b, ... }

entwine build -i output/scan.json -o output/ --run 1

The "--run 1" is done to lower the memory usage. On small
datasets runtime is excellent, but with more models the runtime
starts to increase quite a bit. I'm looking specifically to see if
there are ways to speed the generation of the EPT index. In
particular, since I generate the various LAS files on different
machines, I was wondering if there was a way to let each machine
contribute its part of the index from the individual LAS files
(such index mapped to a network location) or if a workflow is
supported in which each machine can build its own EPT index and
then merge all EPT indexes into one? I don't think this is
possible, but wanted to check.

Then merge the results. I've noticed two things with this. It
seemed that as the number of input files increased, the memory and
time required to create each subset seemed increased also (that's
why I opted to use scan + build --run 1). The second is that I
need to wait for all point clouds to be available (both 1.las and
2.las need to be available before I can start processing them).

I wanted to rule out whether it was possible to do something like
(on two separate machines):

1] entwine build -i 1.las -o out1
2] entwine build -i 2.las -o out2

And then merge the resulting EPT indexes into a "global" one:

entwine merge -i out1 out2 -o merged

But I don't think it's possible, correct?

-Piero

On 6/13/19 10:43 AM, Connor Manning
wrote:

The `subset` option lets each iteration of the
build run a spatially distinct region, which can be trivially
merged afterward, which sounds like what you're after. Another
option could be to simply use multiple indexes - potree can
accept multiple input EPT sources, and a PDAL pipeline may have
multiple EPT readers.

I have a question regarding the usage of Entwine and was
hoping somebody could help me? The use case is merging
point clouds that have been generated on different
machines. Each of these point clouds is part to the same
final dataset. Entwine works great with the current
workflow:

entwine scan -i a.las b.las ... -o output/

for i in {a, b, ... }

entwine build -i output/scan.json -o output/ --run 1

The "--run 1" is done to lower the memory usage. On small
datasets runtime is excellent, but with more models the
runtime starts to increase quite a bit. I'm looking
specifically to see if there are ways to speed the
generation of the EPT index. In particular, since I
generate the various LAS files on different machines, I
was wondering if there was a way to let each machine
contribute its part of the index from the individual LAS
files (such index mapped to a network location) or if a
workflow is supported in which each machine can build its
own EPT index and then merge all EPT indexes into one? I
don't think this is possible, but wanted to check.

Then merge the results. I've noticed two things with this. It
seemed that as the number of input files increased, the memory and
time required to create each subset seemed increased also (that's
why I opted to use scan + build --run 1). The second is that I
need to wait for all point clouds to be available (both 1.las and
2.las need to be available before I can start processing them).

I wanted to rule out whether it was possible to do something like
(on two separate machines):

1] entwine build -i 1.las -o out1
2] entwine build -i 2.las -o out2

And then merge the resulting EPT indexes into a "global" one:

entwine merge -i out1 out2 -o merged

But I don't think it's possible, correct?

-Piero

On 6/13/19 10:43 AM, Connor Manning
wrote:

The `subset` option lets each iteration of the
build run a spatially distinct region, which can be trivially
merged afterward, which sounds like what you're after. Another
option could be to simply use multiple indexes - potree can
accept multiple input EPT sources, and a PDAL pipeline may have
multiple EPT readers.

I have a question regarding the usage of Entwine and was
hoping somebody could help me? The use case is merging
point clouds that have been generated on different
machines. Each of these point clouds is part to the same
final dataset. Entwine works great with the current
workflow:

entwine scan -i a.las b.las ... -o output/

for i in {a, b, ... }

entwine build -i output/scan.json -o output/ --run 1

The "--run 1" is done to lower the memory usage. On small
datasets runtime is excellent, but with more models the
runtime starts to increase quite a bit. I'm looking
specifically to see if there are ways to speed the
generation of the EPT index. In particular, since I
generate the various LAS files on different machines, I
was wondering if there was a way to let each machine
contribute its part of the index from the individual LAS
files (such index mapped to a network location) or if a
workflow is supported in which each machine can build its
own EPT index and then merge all EPT indexes into one? I
don't think this is possible, but wanted to check.

Re: [pdal] Does Entwine support distributed builds?

Thanks, suspected that was the case but wanted to confirm.

In regard to building subsets, is there an advantage to using
"entwine scan" vs. the input files directly to "entwine build" in
terms of performance (or is scan a simple utility to simplify
finding datasets within a folder)?

Are there any tips or tricks that I should be aware of in terms
of memory usage when building using subset? For example, is it
memory efficient to do:

Then merge the results. I've noticed two things with this.
It seemed that as the number of input files increased, the
memory and time required to create each subset seemed
increased also (that's why I opted to use scan + build
--run 1). The second is that I need to wait for all point
clouds to be available (both 1.las and 2.las need to be
available before I can start processing them).

I wanted to rule out whether it was possible to do
something like (on two separate machines):

1] entwine build -i 1.las -o out1
2] entwine build -i 2.las -o out2

And then merge the resulting EPT indexes into a "global"
one:

entwine merge -i out1 out2 -o merged

But I don't think it's possible, correct?

-Piero

On
6/13/19 10:43 AM, Connor Manning wrote:

The `subset` option lets each iteration of
the build run a spatially distinct region, which can be
trivially merged afterward, which sounds like what
you're after. Another option could be to simply use
multiple indexes - potree can accept multiple input EPT
sources, and a PDAL pipeline may have multiple EPT
readers.

I have a question regarding the usage of Entwine
and was hoping somebody could help me? The use
case is merging point clouds that have been
generated on different machines. Each of these
point clouds is part to the same final dataset.
Entwine works great with the current workflow:

entwine scan -i a.las b.las ... -o output/

for i in {a, b, ... }

entwine build -i output/scan.json -o output/
--run 1

The "--run 1" is done to lower the memory usage.
On small datasets runtime is excellent, but with
more models the runtime starts to increase quite a
bit. I'm looking specifically to see if there are
ways to speed the generation of the EPT index. In
particular, since I generate the various LAS files
on different machines, I was wondering if there
was a way to let each machine contribute its part
of the index from the individual LAS files (such
index mapped to a network location) or if a
workflow is supported in which each machine can
build its own EPT index and then merge all EPT
indexes into one? I don't think this is possible,
but wanted to check.

Re: [pdal] Does Entwine support distributed builds?

Hi Piero

I'm watching your questions with interest - many have been on my mind also!

...did your second proposal (run 400 times) work?

that would, on the surface, use less memory since you're reading from one las file at a time rather than (400/64) las files (potentially, assuming a lot about how the data are distributed in space). ...but would also mean partial writing of each entwine chunk, which will eventually contain data from potentially (400/64) of your files...

...so the question there is 'can entwine support partial writing of subsets'?

In regard to building subsets, is there an advantage to using
"entwine scan" vs. the input files directly to "entwine build" in
terms of performance (or is scan a simple utility to simplify
finding datasets within a folder)?

Are there any tips or tricks that I should be aware of in terms
of memory usage when building using subset? For example, is it
memory efficient to do:

Then merge the results. I've noticed two things with this.
It seemed that as the number of input files increased, the
memory and time required to create each subset seemed
increased also (that's why I opted to use scan + build
--run 1). The second is that I need to wait for all point
clouds to be available (both 1.las and 2.las need to be
available before I can start processing them).

I wanted to rule out whether it was possible to do
something like (on two separate machines):

1] entwine build -i 1.las -o out1
2] entwine build -i 2.las -o out2

And then merge the resulting EPT indexes into a "global"
one:

entwine merge -i out1 out2 -o merged

But I don't think it's possible, correct?

-Piero

On
6/13/19 10:43 AM, Connor Manning wrote:

The `subset` option lets each iteration of
the build run a spatially distinct region, which can be
trivially merged afterward, which sounds like what
you're after. Another option could be to simply use
multiple indexes - potree can accept multiple input EPT
sources, and a PDAL pipeline may have multiple EPT
readers.

I have a question regarding the usage of Entwine
and was hoping somebody could help me? The use
case is merging point clouds that have been
generated on different machines. Each of these
point clouds is part to the same final dataset.
Entwine works great with the current workflow:

entwine scan -i a.las b.las ... -o output/

for i in {a, b, ... }

entwine build -i output/scan.json -o output/
--run 1

The "--run 1" is done to lower the memory usage.
On small datasets runtime is excellent, but with
more models the runtime starts to increase quite a
bit. I'm looking specifically to see if there are
ways to speed the generation of the EPT index. In
particular, since I generate the various LAS files
on different machines, I was wondering if there
was a way to let each machine contribute its part
of the index from the individual LAS files (such
index mapped to a network location) or if a
workflow is supported in which each machine can
build its own EPT index and then merge all EPT
indexes into one? I don't think this is possible,
but wanted to check.