0. Overview and vote
I think the library delivers the high-level goods (parallel foreach,
map, reduce) but is a bit fuzzy in the lower level details.
Documentation hurts understanding of the capabilities of the library and
essentially is of inadequate quality. Entity documentation and examples
do little more than give the impression of dispensing with an unpleasant
chore.
My vote in favor of acceptance is contingent upon a radical improvement
in the quality of documentation and examples. Most if not all artifacts
should be motivated by simple, compelling examples. The introductory
section must contain a brief and attractive synopsis of the flagship
capabilities. All terms must be defined before being used and introduced
in a carefully-chosen order. The relationship between various entities
should be clarified.
I've seen the argument that simple and strong examples are difficult to
find. Though I agree such examples are not easy to come up with, I also
believe the author of library is in the best position to produce them.
================
1. Library proper:
* "In the case of non-random access ranges, parallel foreach is still
usable but buffers lazily to an array..." Wouldn't strided processing
help? If e.g. 4 threads the first works on 0, 4, 8, ... second works on
1, 5, 9, ... and so on.
* Example with squares would be better if double replaced uint, and if a
more complicated operation (sqrt, log...) were involved.
* I'm unclear on the tactics used by lazyMap. I'm thinking the obvious
method should be better: just use one circular buffer. The presence of
two dependent parameters makes this abstraction difficult to operate with.
* Same question about asyncBuf. What is wrong with a circular buffer
filled on one side by threads and on the consumed from the other by the
client? I can think of a couple of answers but it would be great if they
were part of the documentation.
* Why not make workerIndex a ulong and be done with it?
* Is stop() really trusted or just unsafe? If it's forcibly killing
threads then its unsafe.
* uint size() should be size_t for conformity.
* defaultPoolThreads - should it be a property?
* I don't understand what Task is supposed to do. It is circularly
defined: "A struct that encapsulates the information about a task,
including its current status, what pool it was submitted to, and its
arguments." OK, but what's a task? Could a task be used outside a task
pool, and if so to what effect(s)?
* If LazyMap is only necessary as the result of lazyMap, could that
become a local type inside lazyMap?
2. Documentation:
* Documentation unacceptable. It lacks precision and uses terms before
defining them. For example: "This class encapsulates a task queue and a
set of worker threads" comes before the notions of "task queue" and
"worker thread" are defined. Clearly there is an intuition of what those
are supposed to mean, but in practice each library lays down some fairly
detailed definition of such.
* "Note: Initializing a pool with zero threads (as would happen in the
case of a single-core CPU) is well-tested and does work." The absence of
a bug should not be advertised in so many words. Simpler: "Note:
Single-CPU machines will operate transparently with zero-sized pools."
* "Allows for custom pool size." I have no idea what pool size means.
* "// Do something interesting." Worst example ever. You are the creator
of the library. If _you_ can't find a brief compelling example, who could?
* "Immediately after the range argument, an optional block size argument
may be provided. If none is provided, the default block size is used. An
optional buffer for returining the results may be provided as the last
argument. If one is not provided, one will be automatically allocated.
If one is provided, it must be the same length as the range." An example
should be inserted after each of these sentences.
* "// Do something expensive with line." Better you do something
expensive with line.
* "This is not the same thing as commutativity." I think this is obvious
enough to be left out. The example of matrices (associative but not
commutative) is nice though.
* "immutable n = 1000000000;" -> use underscores
* Would be great if the documentation examples included some rough
experimental results, e.g. "3.8 times faster on a 4-core machine than
the equivalent serial loop".
* No example for workerIndex and why it's useful.
* Is LocalWorker better than WorkerLocal? No, because it's not the
worker that's local, it's the storage - which is missing from the name!
WorkerLocalStorage? Also replace "create" with "make" or drop it
entirely. The example doesn't tell me how I can use bufs. I suspect
workerIndex has something to do with it but the example fails to divulge
that relationship.
* No example for join(), which is quite a central primitive.
* No example for put(), which would be interesting to see.
* No example for task().
* No example for run().
* The example for Task uses myFuture without defining it. In case Task
defines the promise/future idiom, the documentation misses a terrific
opportunity of clarifying that.
* What is 'run' in the definition of safe task()?
* Why is AsyncBuf's doc far away from the function returning it? Same
about WorkerLocal.
Andrei

Thanks for the advice. You mentioned in the past that the documentation
was inadequate but didn't give enough specifics as to how until now. As
the author of the library, things seem obvious to me that don't seem
obvious to anyone else, so I don't feel that I'm in a good position to
judge the quality of the documentation and where it needs improvement.
I plan to fix most of the issues you raised, but I've left comments for
the few that I can't/won't fix or believe are based on misunderstandings
below.
On 3/18/2011 11:29 PM, Andrei Alexandrescu wrote:

1. Library proper:
* "In the case of non-random access ranges, parallel foreach is still
usable but buffers lazily to an array..." Wouldn't strided processing
help? If e.g. 4 threads the first works on 0, 4, 8, ... second works on
1, 5, 9, ... and so on.

You can have this if you want, by setting the work unit size to 1.
Setting it to a larger size just causes more elements to be buffered,
which may be more efficient in some cases.

* I'm unclear on the tactics used by lazyMap. I'm thinking the obvious
method should be better: just use one circular buffer. The presence of
two dependent parameters makes this abstraction difficult to operate with.
* Same question about asyncBuf. What is wrong with a circular buffer
filled on one side by threads and on the consumed from the other by the
client? I can think of a couple of answers but it would be great if they
were part of the documentation.

Are you really suggesting I give detailed rationales for implementation
decisions in the documentation? Anyhow, the two reasons for this choice
are to avoid needing synchronization/atomic ops/etc. on every write to
the buffer (which we would need since it can be read and written
concurrently and we need to track whether we have space to write to) and
because parallel map works best when it operates on relatively large
buffers, resulting in minimal synchronization overhead per element.
(Under the hood, the buffer is filled and then eager parallel map is
called.)

* Why not make workerIndex a ulong and be done with it?

I doubt anyone's really going to create anywhere near 4 billion TaskPool
threads over the lifetime of a program. Part of the point of TaskPool
is recycling threads rather than paying the overhead of creating and
destroying them. Using a ulong on a 32-bit architecture would make
worker-local storage substantially slower. workerIndex is how
worker-local storage works under the hood, so it needs to be fast.

* No example for workerIndex and why it's useful.

It should just be private. The fact that it's public is an artifact of
when I was designing worker-local storage and didn't know how it was
going to work yet. I never thought to revisit this until now. It
really isn't useful to client code.

* Is stop() really trusted or just unsafe? If it's forcibly killing
threads then its unsafe.

It's not forcibly killing threads. As the documentation states, it has
no effect on jobs already executing, only ones in the queue.
Furthermore, it's needed unless makeDaemon is called. Speaking of
which, makeDaemon and makeAngel should probably be trusted, too.

* defaultPoolThreads - should it be a property?

Yes. In spirit it's a global variable. It requires some extra
machinations, though, to be threadsafe, which is why it's not
implemented as a simple global variable.

* No example for task().

???? Yes there is, for both flavors, though these could admittedly be
improved. Only the safe version doesn't have an example, and this is
just a more restricted version of the function pointer case, so it seems
silly to make a separate example for it.

1. Library proper:
* "In the case of non-random access ranges, parallel foreach is still
usable but buffers lazily to an array..." Wouldn't strided processing
help? If e.g. 4 threads the first works on 0, 4, 8, ... second works on
1, 5, 9, ... and so on.

You can have this if you want, by setting the work unit size to 1.
Setting it to a larger size just causes more elements to be buffered,
which may be more efficient in some cases.

Please add an example showing that, too. Sure, the documentation says
that's what's being done, but an example would show it more clearly.
--
Simen

1. Library proper:
* "In the case of non-random access ranges, parallel foreach is still
usable but buffers lazily to an array..." Wouldn't strided processing
help? If e.g. 4 threads the first works on 0, 4, 8, ... second works on
1, 5, 9, ... and so on.

You can have this if you want, by setting the work unit size to 1.
Setting it to a larger size just causes more elements to be buffered,
which may be more efficient in some cases.

that's what's being done, but an example would show it more clearly.

I don't understand how this can be demonstrated in an example. It's an
under-the-hood thing. The only place this appears in the API is in the
workUnitSize parameter.

1. Library proper:
* "In the case of non-random access ranges, parallel foreach is still
usable but buffers lazily to an array..." Wouldn't strided processing
help? If e.g. 4 threads the first works on 0, 4, 8, ... second works

1, 5, 9, ... and so on.

You can have this if you want, by setting the work unit size to 1.
Setting it to a larger size just causes more elements to be buffered,
which may be more efficient in some cases.

that's what's being done, but an example would show it more clearly.

I don't understand how this can be demonstrated in an example. It's an
under-the-hood thing. The only place this appears in the API is in the
workUnitSize parameter.

Yeah, scratch that. I for some reason thought the "array of size
workUnitSize" was global, but it's per thread, innit? Seems so logical now.
--
Simen

Thanks for the advice. You mentioned in the past that the documentation
was inadequate but didn't give enough specifics as to how until now. As
the author of the library, things seem obvious to me that don't seem
obvious to anyone else, so I don't feel that I'm in a good position to
judge the quality of the documentation and where it needs improvement. I
plan to fix most of the issues you raised, but I've left comments for
the few that I can't/won't fix or believe are based on misunderstandings
below.

Great, thanks.

On 3/18/2011 11:29 PM, Andrei Alexandrescu wrote:

1. Library proper:
* "In the case of non-random access ranges, parallel foreach is still
usable but buffers lazily to an array..." Wouldn't strided processing
help? If e.g. 4 threads the first works on 0, 4, 8, ... second works on
1, 5, 9, ... and so on.

You can have this if you want, by setting the work unit size to 1.
Setting it to a larger size just causes more elements to be buffered,
which may be more efficient in some cases.

Got it.

* Why not make workerIndex a ulong and be done with it?

I doubt anyone's really going to create anywhere near 4 billion TaskPool
threads over the lifetime of a program. Part of the point of TaskPool is
recycling threads rather than paying the overhead of creating and
destroying them. Using a ulong on a 32-bit architecture would make
worker-local storage substantially slower. workerIndex is how
worker-local storage works under the hood, so it needs to be fast.

If you're confident that overflow won't occur, you may want to eliminate
that detail from the docs. It throws off the reader.

> * No example for workerIndex and why it's useful.
It should just be private. The fact that it's public is an artifact of
when I was designing worker-local storage and didn't know how it was
going to work yet. I never thought to revisit this until now. It really
isn't useful to client code.

It could be public and undocumented.

* Is stop() really trusted or just unsafe? If it's forcibly killing
threads then its unsafe.

It's not forcibly killing threads. As the documentation states, it has
no effect on jobs already executing, only ones in the queue.
Furthermore, it's needed unless makeDaemon is called. Speaking of which,
makeDaemon and makeAngel should probably be trusted, too.

Great. The more safe/trusted, the better.

* defaultPoolThreads - should it be a property?

Yes. In spirit it's a global variable. It requires some extra
machinations, though, to be threadsafe, which is why it's not
implemented as a simple global variable.

Then it should be a property. I think ddoc doesn't reflect that, but an
example could.

* No example for task().

???? Yes there is, for both flavors, though these could admittedly be
improved. Only the safe version doesn't have an example, and this is
just a more restricted version of the function pointer case, so it seems
silly to make a separate example for it.

Towards the bottom of the document there are overloads of task that
don't have examples.

It should just be private. The fact that it's public is an artifact of
when I was designing worker-local storage and didn't know how it was
going to work yet. I never thought to revisit this until now. It really
isn't useful to client code.

It could be public and undocumented.

I've already made it private. I can't see what purpose having it public
would serve. The fact that it was public before was **purely** an
oversight.

* defaultPoolThreads - should it be a property?

Yes. In spirit it's a global variable. It requires some extra
machinations, though, to be threadsafe, which is why it's not
implemented as a simple global variable.

Then it should be a property. I think ddoc doesn't reflect that, but an
example could.

Right. It's always been property but ddoc doesn't reflect this. I've
changed the docs slightly to call it a "property" instead of a
"function". It seems like overkill to me to give examples for this,
though, since it's just a getter and a setter.

* No example for task().

???? Yes there is, for both flavors, though these could admittedly be
improved. Only the safe version doesn't have an example, and this is
just a more restricted version of the function pointer case, so it seems
silly to make a separate example for it.

Towards the bottom of the document there are overloads of task that
don't have examples.

You mean TaskPool.task()? Since these are such slight variations of the
other overloads, I thought an example would be overkill. Since people
less familiar with the library don't think so, though, I've added
examples that are accordingly slight variations of the examples for the
other overloads.

Ok, I ended up moving the docs for run() directly above the stuff that
uses it. run() is described as "Calls a delegate or function pointer
with args. This is an adapter that makes Task work with delegates,
function pointers and functors instead of just aliases. It is included
in the documentation to clarify how this case is handled, but is not
meant to be used directly by client code."
I know the Higher Principles of Encapsulation say this should be private
and the relevant overloads should return auto. I strongly believe,
though, that being anal about encapsulation of this detail is silly
since it is so unlikely to change and that exposing it helps to clarify
what's really going on here. Encapsulation is good up to a point, but
sometimes it's just easier to think about things when you know how they
really work at a concrete level, and this tradeoff needs to be weighted.

Towards the bottom of the document there are overloads of task that
don't have examples.

You mean TaskPool.task()? Since these are such slight variations of the
other overloads, I thought an example would be overkill. Since people
less familiar with the library don't think so, though, I've added
examples that are accordingly slight variations of the examples for the
other overloads.

A great way to handle bunches of almost-identical overloads is to group
them together with /// ditto and explain the slight differences in the
consolidated documentation.
Andrei

Ok, thanks again for clarifying **how** the docs could be improved.
I've implemented the suggestions and generally given the docs a good
reading over and clean up. The new docs are at:
http://cis.jhu.edu/~dsimcha/d/phobos/std_parallelism.html
On 3/18/2011 11:29 PM, Andrei Alexandrescu wrote:

0. Overview and vote
I think the library delivers the high-level goods (parallel foreach,
map, reduce) but is a bit fuzzy in the lower level details.
Documentation hurts understanding of the capabilities of the library and
essentially is of inadequate quality. Entity documentation and examples
do little more than give the impression of dispensing with an unpleasant
chore.
My vote in favor of acceptance is contingent upon a radical improvement
in the quality of documentation and examples. Most if not all artifacts
should be motivated by simple, compelling examples. The introductory
section must contain a brief and attractive synopsis of the flagship
capabilities. All terms must be defined before being used and introduced
in a carefully-chosen order. The relationship between various entities
should be clarified.
I've seen the argument that simple and strong examples are difficult to
find. Though I agree such examples are not easy to come up with, I also
believe the author of library is in the best position to produce them.
================
1. Library proper:
* "In the case of non-random access ranges, parallel foreach is still
usable but buffers lazily to an array..." Wouldn't strided processing
help? If e.g. 4 threads the first works on 0, 4, 8, ... second works on
1, 5, 9, ... and so on.
* Example with squares would be better if double replaced uint, and if a
more complicated operation (sqrt, log...) were involved.
* I'm unclear on the tactics used by lazyMap. I'm thinking the obvious
method should be better: just use one circular buffer. The presence of
two dependent parameters makes this abstraction difficult to operate with.
* Same question about asyncBuf. What is wrong with a circular buffer
filled on one side by threads and on the consumed from the other by the
client? I can think of a couple of answers but it would be great if they
were part of the documentation.
* Why not make workerIndex a ulong and be done with it?
* Is stop() really trusted or just unsafe? If it's forcibly killing
threads then its unsafe.
* uint size() should be size_t for conformity.
* defaultPoolThreads - should it be a property?
* I don't understand what Task is supposed to do. It is circularly
defined: "A struct that encapsulates the information about a task,
including its current status, what pool it was submitted to, and its
arguments." OK, but what's a task? Could a task be used outside a task
pool, and if so to what effect(s)?
* If LazyMap is only necessary as the result of lazyMap, could that
become a local type inside lazyMap?
2. Documentation:
* Documentation unacceptable. It lacks precision and uses terms before
defining them. For example: "This class encapsulates a task queue and a
set of worker threads" comes before the notions of "task queue" and
"worker thread" are defined. Clearly there is an intuition of what those
are supposed to mean, but in practice each library lays down some fairly
detailed definition of such.
* "Note: Initializing a pool with zero threads (as would happen in the
case of a single-core CPU) is well-tested and does work." The absence of
a bug should not be advertised in so many words. Simpler: "Note:
Single-CPU machines will operate transparently with zero-sized pools."
* "Allows for custom pool size." I have no idea what pool size means.
* "// Do something interesting." Worst example ever. You are the creator
of the library. If _you_ can't find a brief compelling example, who could?
* "Immediately after the range argument, an optional block size argument
may be provided. If none is provided, the default block size is used. An
optional buffer for returining the results may be provided as the last
argument. If one is not provided, one will be automatically allocated.
If one is provided, it must be the same length as the range." An example
should be inserted after each of these sentences.
* "// Do something expensive with line." Better you do something
expensive with line.
* "This is not the same thing as commutativity." I think this is obvious
enough to be left out. The example of matrices (associative but not
commutative) is nice though.
* "immutable n = 1000000000;" -> use underscores
* Would be great if the documentation examples included some rough
experimental results, e.g. "3.8 times faster on a 4-core machine than
the equivalent serial loop".
* No example for workerIndex and why it's useful.
* Is LocalWorker better than WorkerLocal? No, because it's not the
worker that's local, it's the storage - which is missing from the name!
WorkerLocalStorage? Also replace "create" with "make" or drop it
entirely. The example doesn't tell me how I can use bufs. I suspect
workerIndex has something to do with it but the example fails to divulge
that relationship.
* No example for join(), which is quite a central primitive.
* No example for put(), which would be interesting to see.
* No example for task().
* No example for run().
* The example for Task uses myFuture without defining it. In case Task
defines the promise/future idiom, the documentation misses a terrific
opportunity of clarifying that.
* What is 'run' in the definition of safe task()?
* Why is AsyncBuf's doc far away from the function returning it? Same
about WorkerLocal.
Andrei

* When using "parallelism" as a common noun, prefix is with a '_' so
ddoc doesn't underline it.
* "Most of this module completely subverts..." Vague characterizations
("most", "completely", "some") don't belong in a technical
documentation. (For example there's either subversion going on or there
isn't.) Also, std.concurrency and std.parallelism address different
needs so there's little competition between them. Better: "Unless
explicitly marked as $(D trusted) or $(D safe), artifacts in this
module are not provably memory-safe and cannot be used with SafeD. If
used as documented, memory safety is guaranteed."
* Speaking of std.concurrency vs. std.parallelism, the first paragraph
might be something like: "This module implements high-level primitives
for shared memory SMP parallelism. These include parallel foreach,
parallel reduce, parallel eager map, pipelining and future/promise
parallelism primitives. $(D std._parallelism) is best recommended when
the same operation is to be executed in parallel over different data.
For communication between arbitrary threads, see $(D std.concurrency)."
* Still no synopsis example that illustrates in a catchy way the most
attractive artifacts.
* "After creation, Task objects are submitted to a TaskPool for
execution." I understand it's possible to use Task straight as a
promise/future, so s/are/may be/. Also it is my understanding that
TaskPool efficiently adapts the concrete number of CPUs to an arbitrary
number of tasks. If that's the case, it's worth mentioning.
* "A call to workWait(), yieldWait(), or spinWait() can be used to
retrive the return value after the function is finished executing." As
an aside, a spell checking step would be great ("retrieve") - just put
the text in an editor with spellchecking. I think what this means is:
"The methods workWait(), yieldWait(), or spinWait() make sure that the
function finishes execution and then return its result to the initiating
thread. Each uses a different waiting strategy as detailed below."
* "If a Task has been submitted to a TaskPool instance, is being stored
in a stack frame, and has not yet finished, the destructor for this
struct will automatically call yieldWait() so that the task can finish
and the stack frame can be destroyed safely." At this point in the doc
the reader doesn't understand that at all because TaskPool has not been
seen yet. The reader gets worried that she'll be essentially serializing
the entire process by mistake. Either move this explanation down or
provide an example.
* "Function results are returned from yieldWait() and friends by ref."
Someone coming from C++ may be thrown off by this sudden casual use of
"friends" and think there's a notion of frienship by reference in D.
Better: "The forcing methods yieldWait(), workWait(), and spinWait()
return the result by reference."
* Speaking of which, I'd replace "Wait" with "Force". Right now the
nomenclature is far removed from futures and promises.
* Is done() a property?
* The example that reads two files at the same time should NOT use
taskPool. It's just one task, why would the pool ever be needed? If you
also provided an example that reads n files in memory at the same time
using a pool, that would illustrate nicely why you need it. If a Task
can't be launched without being put in a pool, there should be a
possibility to do so. At my work we have a simple function called
callInNewThread that does what's needed to launch a function in a new
thread.
* The note below that example gets me thinking: it is an artificial
limitation to force users of Task to worry about scope and such. One
should be able to create a Future object (Task I think in your
terminology), pass it around like a normal value, and ultimately force
it. This is the case for all other languages that implement futures. I
suspect the "scope" parameter associated with the delegate a couple of
definitions below plays a role here, but I think we need to work for
providing the smoothest interface possible (possibly prompting
improvements in the language along the way).
* I'm not sure how to interpret the docs for
ReturnType!(F) run(F, Args...)(F fpOrDelegate, ref Args args);
So it's documented but I'm not supposed to care. Why not just remove?
Surely there must be higher-level examples that clarify that I can use
delegates etc.
* The examples have code at top-level. That's fine for short snippets
but not when using import etc. I recommend putting the code inside
unittests or function bodies for such cases.
* "If you want to escape the Task object from the function in which it
was created or prefer to heap allocate and automatically submit to the
pool, see TaskPool.task()." I'm uncomfortable that I need to remember a
subtle distinction of the same primitive name ("task") depending on
membership in a TaskPool or not, which is a tenuous relationship to
remember. How about "scopedTask()" vs. "task()"? Also, it's worth asking
ourselves what's the relative overhead of closure allocation vs. task
switching etc. Maybe we get to simplify things a lot at a small cost in
efficiency.
* As I mentioned, in the definition:
Task!(run,TypeTuple!(F,Args)) task(F, Args...)(F fun, Args args);
I can't tell what "run" is.
* osReportedNcpu - a case of naming convention gone wild... how about
totalCPUs or reportedCPUs.
* "A goto from inside the parallel foreach loop to a label outside the
loop will result in undefined behavior." Would this be a bug in dmd?
* I would group these two together:
ParallelForeach!(R) parallel(R)(R range, size_t workUnitSize);
ParallelForeach!(R) parallel(R)(R range);
Just put a /// ditto for the second, use only one code snippet that
illustrates two foreach loops, one with the explicit parameter and one
without. In fact swap the two - most of the time people will use the one
with the default work unit size. At best don't specify 512 because later
on you may come with better algorithm to determine the optimal block size.
* "workUnitSize: The number of elements to evaluate in a single Task.
Must be less than or equal to bufSize, and in practice should be a
fraction of bufSize such that all worker threads can be used." Then why
not specify a different parameter such a multiplier or something? The
dependence between bufSize and worUnitSize is a sign that these two
should be improved. If you have good reasons that the user must have the
parameters in this form, give an example substantiating that.
* "auto myMax = taskPool.reduce!"a + b * b"(myArr);" This is not
computing anything meaningful. To compute the sum of squares, you need
to give the seed 0.0.
* Again: speed of e.g. parallel min/max vs. serial, pi computation etc.
on a usual machine?
* The join example is fine, but the repetitive code suggests that loops
should be better:
import std.file;
void main() {
auto pool = new TaskPool();
foreach (filename; ["foo.txt", "bar.txt", "baz.txt"]) {
pool.put(task!read(filename));
}
// Call join() to guarantee that all tasks are done running, the worker
// threads have terminated and that the results of all of the tasks can
// be accessed without any synchronization primitives.
pool.join();
ubyte[][] data; // void[], meh
// Use spinWait() since the results are guaranteed to have been
computed
// and spinWait() is the cheapest of the wait functions.
foreach (task; pool.howDoIEnumerateTasks()) {
data ~= task1.spinWait();
}
Andrei

* "Most of this module completely subverts..." Vague characterizations
("most", "completely", "some") don't belong in a technical
documentation. (For example there's either subversion going on or there
isn't.) Also, std.concurrency and std.parallelism address different
needs so there's little competition between them. Better: "Unless
explicitly marked as $(D trusted) or $(D safe), artifacts in this
module are not provably memory-safe and cannot be used with SafeD. If
used as documented, memory safety is guaranteed."

Actually, I think this is a bad description of what it subverts. What
it subverts isn't the memory-safety that SafeD provides, but the safety
against low-level races that even unsafe D protects against unless you
cast shared away. For instance:
void main() {
int sum = 0;
foreach (int value; taskPool.parallel([0,2,3,6,1,4,6,3,3,3,6])) {
sum += value;
}
writeln(sum);
}
The "+=" would need to be an atomic operation to avoid low-level races.
I think that ParallelForeach's opApply should only accept a shared
delegate. I define shared delegate as a delegate that does not
reference any non-shared variables of its outer scope. The problem is
that DMD currently doesn't know how to determine whether a delegate
literal is shared or not, thus a delegate literal is never shared and
if ParallelForeach's opApply asked a shared delegate as it should it
would just not work. Fix DMD to create shared delegate literals where
appropriate and everything can be guarantied race-free.
--
Michel Fortin
michel.fortin michelf.com
http://michelf.com/

* "Most of this module completely subverts..." Vague characterizations
("most", "completely", "some") don't belong in a technical
documentation. (For example there's either subversion going on or
there isn't.) Also, std.concurrency and std.parallelism address
different needs so there's little competition between them. Better:
"Unless explicitly marked as $(D trusted) or $(D safe), artifacts in
this module are not provably memory-safe and cannot be used with
SafeD. If used as documented, memory safety is guaranteed."

Actually, I think this is a bad description of what it subverts. What it
subverts isn't the memory-safety that SafeD provides, but the safety
against low-level races that even unsafe D protects against unless you
cast shared away. For instance:
void main() {
int sum = 0;
foreach (int value; taskPool.parallel([0,2,3,6,1,4,6,3,3,3,6])) {
sum += value;
}
writeln(sum);
}
The "+=" would need to be an atomic operation to avoid low-level races.
I think that ParallelForeach's opApply should only accept a shared
delegate. I define shared delegate as a delegate that does not reference
any non-shared variables of its outer scope. The problem is that DMD
currently doesn't know how to determine whether a delegate literal is
shared or not, thus a delegate literal is never shared and if
ParallelForeach's opApply asked a shared delegate as it should it would
just not work. Fix DMD to create shared delegate literals where
appropriate and everything can be guarantied race-free.

If you want pedal-to-metal parallelism without insane levels of
verbosity, you can't have these safety features. I thought long and
hard about this issue before submitting this lib for review and
concluded that any solution would make std.parallelism so slow, so
limited and/or such a PITA to use that I'd rather it just punt these
issues to the programmer. In practice, parallel foreach is used with
very small, performance-critical snippets that are fairly easy to reason
about even without any language-level race safety. This is a
fundamental design decision that will not be changing. If it's
unacceptable then:
1. I give up.
2. I wish someone had told me earlier.

For instance:
void main() {
int sum = 0;
foreach (int value; taskPool.parallel([0,2,3,6,1,4,6,3,3,3,6])) {
sum += value;
}
writeln(sum);
}
The "+=" would need to be an atomic operation to avoid low-level races.
I think that ParallelForeach's opApply should only accept a shared
delegate. I define shared delegate as a delegate that does not reference
any non-shared variables of its outer scope. The problem is that DMD
currently doesn't know how to determine whether a delegate literal is
shared or not, thus a delegate literal is never shared and if
ParallelForeach's opApply asked a shared delegate as it should it would
just not work. Fix DMD to create shared delegate literals where
appropriate and everything can be guarantied race-free.

If you want pedal-to-metal parallelism without insane levels of
verbosity, you can't have these safety features.

I'm not sure where my proposal asks for more verbosity or less
performance. All I can see is a few less casts in std.parallelism and
that it'd disallow the case in my example above that is totally wrong.
Either you're interpreting it wrong or there are things I haven't
thought about (and I'd be happy to know about them).
But by looking at all the examples in the documentation, I cannot find
one that would need to be changed... well, except the one I'll discuss
below.

I thought long and hard about this issue before submitting this lib for
review and concluded that any solution would make std.parallelism so
slow, so limited and/or such a PITA to use that I'd rather it just punt
these issues to the programmer. In practice, parallel foreach is used
with very small, performance-critical snippets that are fairly easy to
reason about even without any language-level race safety.

I'm not too convinced about the "I know what I'm doing" argument when I
look at this example from asyncBuf's documentation:
auto lines = File("foo.txt").byLine();
auto duped = map!"a.idup"(lines); // Necessary b/c byLine()
recycles buffer
// Fetch more lines in the background while we process the lines already
// read into memory into a matrix of doubles.
double[][] matrix;
auto asyncReader = taskPool.asyncBuf(duped);
foreach(line; asyncReader) {
auto ls = line.split("\t");
matrix ~= to!(double[])(ls);
}
Look at the last line of the foreach. You are appending to a non-shared
array from many different threads. How is that not a race condition?
With my proposal, the compiler would have caught that because opApply
would want the foreach body to be a shared delegate, and
reading/writing to non-shared variable "matrix" in the outer scope from
a shared delegate literal would be an error.
I'm not too sure how hard it'd be to do that in the compiler, but I
think it's the right thing to do. Once the compiler can do this we can
have safety; until that time I'd agree to see std.parallelism stay
unsafe.
--
Michel Fortin
michel.fortin michelf.com
http://michelf.com/

I'm not too convinced about the "I know what I'm doing" argument when I
look at this example from asyncBuf's documentation:
auto lines = File("foo.txt").byLine();
auto duped = map!"a.idup"(lines); // Necessary b/c byLine()
recycles buffer
// Fetch more lines in the background while we process the lines already
// read into memory into a matrix of doubles.
double[][] matrix;
auto asyncReader = taskPool.asyncBuf(duped);
foreach(line; asyncReader) {
auto ls = line.split("\t");
matrix ~= to!(double[])(ls);
}
Look at the last line of the foreach. You are appending to a non-shared
array from many different threads. How is that not a race condition?

... or maybe I just totally misunderstood asyncBuf. Rereading the
documentation I'm under the impression I'd have to write this to get
what I expected:
foreach (line; parallel(asyncReader))
...
And that would cause a race condition. If that's the case, the example
is fine. Sorry for the misunderstanding.
--
Michel Fortin
michel.fortin michelf.com
http://michelf.com/

I'm not too convinced about the "I know what I'm doing" argument when
I look at this example from asyncBuf's documentation:
auto lines = File("foo.txt").byLine();
auto duped = map!"a.idup"(lines); // Necessary b/c byLine() recycles
buffer
// Fetch more lines in the background while we process the lines already
// read into memory into a matrix of doubles.
double[][] matrix;
auto asyncReader = taskPool.asyncBuf(duped);
foreach(line; asyncReader) {
auto ls = line.split("\t");
matrix ~= to!(double[])(ls);
}
Look at the last line of the foreach. You are appending to a
non-shared array from many different threads. How is that not a race
condition?

.... or maybe I just totally misunderstood asyncBuf. Rereading the
documentation I'm under the impression I'd have to write this to get
what I expected:
foreach (line; parallel(asyncReader))
...
And that would cause a race condition. If that's the case, the example
is fine. Sorry for the misunderstanding.

Right. And this is pretty obviously a race. The other example (without
the parallel) is completely safe.

For instance:
void main() {
int sum = 0;
foreach (int value; taskPool.parallel([0,2,3,6,1,4,6,3,3,3,6])) {
sum += value;
}
writeln(sum);
}
The "+=" would need to be an atomic operation to avoid low-level races.
I think that ParallelForeach's opApply should only accept a shared
delegate. I define shared delegate as a delegate that does not reference
any non-shared variables of its outer scope. The problem is that DMD
currently doesn't know how to determine whether a delegate literal is
shared or not, thus a delegate literal is never shared and if
ParallelForeach's opApply asked a shared delegate as it should it would
just not work. Fix DMD to create shared delegate literals where
appropriate and everything can be guarantied race-free.

If you want pedal-to-metal parallelism without insane levels of
verbosity, you can't have these safety features.

I'm not sure where my proposal asks for more verbosity or less
performance. All I can see is a few less casts in std.parallelism and
that it'd disallow the case in my example above that is totally wrong.
Either you're interpreting it wrong or there are things I haven't
thought about (and I'd be happy to know about them).
But by looking at all the examples in the documentation, I cannot find
one that would need to be changed... well, except the one I'll discuss
below.

Ok, I've had some time to think about this. The following example is
safe, but wouldn't work if I understand correctly what you're proposing:
auto logs = new double[1_000_000];
foreach(i, ref elem; taskPool.parallel(logs)) {
elem = log(i + 1.0);
}
The problem is that you're writing to the same array from multiple
threads, which the compiler would interpret as writing to the same
variable. However, you can guarantee that no two threads ever write to
the same element, so it's safe.
Note: I'm aware of the false sharing issue when writing to adjacent
memory addresses. However, when writing to an array this big it occurs
to a negligible extent. If you were using a smaller array, then either
the loop body would be so expensive that the cost of false sharing would
be negligible, or the whole loop would be too cheap to be worth
parallelizing.
I'm also aware that word tearing is a concern on some architectures,
though not x86. IMHO std.parallelism and its documentation should not
be pedantic about portability to architectures that a D compiler doesn't
even exist for yet.
Also, your example can be trivially modified to be safe.
void main() {
int sum = 0;
foreach (int value; taskPool.parallel([0,2,3,6,1,4,6,3,3,3,6])) {
synchronized sum += value;
}
writeln(sum);
}
In this case that kills all parallelism, but in more realistic cases I
use this pattern often. I find it very common to have an expensive loop
body can be performed in parallel, except for a tiny portion that must
update a shared data structure. I'm aware that it might be possible, in
theory, to write this more formally using reduce() or something. However:
1. If the portion of the loop that deals with shared data is very small
(and therefore the serialization caused by the synchronized block is
negligible), it's often more efficient to only keep one data structure
in memory and update it concurrently, rather than use stronger isolation
between threads like reduce() does, and have to maintain one data
structure for each thread.
2. In my experience synchronizing on a small portion of the loop body
works very well in practice. My general philosophy is that, in a
library like this, dangerous but useful constructs must be supported and
treated as innocent until proven guilty, not the other way round.

For instance:
void main() {
int sum = 0;
foreach (int value; taskPool.parallel([0,2,3,6,1,4,6,3,3,3,6])) {
sum += value;
}
writeln(sum);
}
The "+=" would need to be an atomic operation to avoid low-level races.
I think that ParallelForeach's opApply should only accept a shared
delegate. I define shared delegate as a delegate that does not reference
any non-shared variables of its outer scope. The problem is that DMD
currently doesn't know how to determine whether a delegate literal is
shared or not, thus a delegate literal is never shared and if
ParallelForeach's opApply asked a shared delegate as it should it would
just not work. Fix DMD to create shared delegate literals where
appropriate and everything can be guarantied race-free.

If you want pedal-to-metal parallelism without insane levels of
verbosity, you can't have these safety features.

I'm not sure where my proposal asks for more verbosity or less
performance. All I can see is a few less casts in std.parallelism and
that it'd disallow the case in my example above that is totally wrong.
Either you're interpreting it wrong or there are things I haven't
thought about (and I'd be happy to know about them).
But by looking at all the examples in the documentation, I cannot find
one that would need to be changed... well, except the one I'll discuss
below.

Ok, I've had some time to think about this. The following example is
safe, but wouldn't work if I understand correctly what you're proposing:
auto logs = new double[1_000_000];
foreach(i, ref elem; taskPool.parallel(logs)) {
elem = log(i + 1.0);
}
The problem is that you're writing to the same array from multiple
threads, which the compiler would interpret as writing to the same
variable. However, you can guarantee that no two threads ever write to
the same element, so it's safe.

I don't see a problem with the above. The array elements you modify are
passed through parallel's opApply which can check easily whether it's
safe or not to pass them by ref to different threads (by checking the
element's size) and allow or disallow the operation accordingly.
It could even do a clever trick to make it safe to pass things such as
elements of array of bytes by ref (by coalescing loop iterations for
all bytes sharing the same word into one task). That might not work for
ranges which are not arrays however.
That said, feel free to suggest more problematic examples.

Note: I'm aware of the false sharing issue when writing to adjacent
memory addresses. However, when writing to an array this big it occurs
to a negligible extent. If you were using a smaller array, then either
the loop body would be so expensive that the cost of false sharing
would be negligible, or the whole loop would be too cheap to be worth
parallelizing.
I'm also aware that word tearing is a concern on some architectures,
though not x86. IMHO std.parallelism and its documentation should not
be pedantic about portability to architectures that a D compiler
doesn't even exist for yet.

No question there.

Also, your example can be trivially modified to be safe.
void main() {
int sum = 0;
foreach (int value; taskPool.parallel([0,2,3,6,1,4,6,3,3,3,6])) {
synchronized sum += value;
}
writeln(sum);
}
In this case that kills all parallelism, but in more realistic cases I
use this pattern often. I find it very common to have an expensive
loop body can be performed in parallel, except for a tiny portion that
must update a shared data structure. I'm aware that it might be
possible, in theory, to write this more formally using reduce() or
something. However:
1. If the portion of the loop that deals with shared data is very
small (and therefore the serialization caused by the synchronized block
is negligible), it's often more efficient to only keep one data
structure in memory and update it concurrently, rather than use
stronger isolation between threads like reduce() does, and have to
maintain one data structure for each thread.
2. In my experience synchronizing on a small portion of the loop body
works very well in practice. My general philosophy is that, in a
library like this, dangerous but useful constructs must be supported
and treated as innocent until proven guilty, not the other way round.

Your second example is not really a good justification of anything.
I'll refer you to how synchronized classes work. It was decided that
synchronized in a class protects everything that is directly stored in
the class. Anything behind an indirection is considered shared by the
compiler. The implication of this is that if you have an array or a
pointer to something that you want semantically to be protected by the
class's mutex, you have to cast things to unshared. It was decided that
things should be safe against low-level races first, and convenience
was relegated as a secondary concern. I don't like it very much, but
that's what was decided and written in TDPL.
Now we're in the exact same situation (except that no classes are
involved) and you're proposing that for this case we make convenience
prime over low-level race safety? To me this would be an utter lack of
coherency in the design of D's "synchronized" feature to go that route.
For the case above, wouldn't it be better to use an atomic add instead
of a synchronized block? In this case you can make "sum" shared and
have the type system check that everything is safe (using my proposed
rules). And if your data structure is bigger, likely it'll be a
synchronized class and you won't have to resort to bypass type-system
safeties inside your loop (although you might have to inside your
synchronized class, but that's another matter).
--
Michel Fortin
michel.fortin michelf.com
http://michelf.com/

I don't see a problem with the above. The array elements you modify are
passed through parallel's opApply which can check easily whether it's
safe or not to pass them by ref to different threads (by checking the
element's size) and allow or disallow the operation accordingly.
It could even do a clever trick to make it safe to pass things such as
elements of array of bytes by ref (by coalescing loop iterations for all
bytes sharing the same word into one task). That might not work for
ranges which are not arrays however.
That said, feel free to suggest more problematic examples.

Ok, I completely agree in principle, though I question whether it's
worth actually implementing something like this, especially until we get
some kind of support for shared delegates.

Also, your example can be trivially modified to be safe.
void main() {
int sum = 0;
foreach (int value; taskPool.parallel([0,2,3,6,1,4,6,3,3,3,6])) {
synchronized sum += value;
}
writeln(sum);
}
In this case that kills all parallelism, but in more realistic cases I
use this pattern often. I find it very common to have an expensive
loop body can be performed in parallel, except for a tiny portion that
must update a shared data structure. I'm aware that it might be
possible, in theory, to write this more formally using reduce() or
something. However:
1. If the portion of the loop that deals with shared data is very
small (and therefore the serialization caused by the synchronized
block is negligible), it's often more efficient to only keep one data
structure in memory and update it concurrently, rather than use
stronger isolation between threads like reduce() does, and have to
maintain one data structure for each thread.
2. In my experience synchronizing on a small portion of the loop body
works very well in practice. My general philosophy is that, in a
library like this, dangerous but useful constructs must be supported
and treated as innocent until proven guilty, not the other way round.

Your second example is not really a good justification of anything. I'll
refer you to how synchronized classes work. It was decided that
synchronized in a class protects everything that is directly stored in
the class. Anything behind an indirection is considered shared by the
compiler. The implication of this is that if you have an array or a
pointer to something that you want semantically to be protected by the
class's mutex, you have to cast things to unshared. It was decided that
things should be safe against low-level races first, and convenience was
relegated as a secondary concern. I don't like it very much, but that's
what was decided and written in TDPL.

I'd go a little further. If the guarantees that shared was supposed to
provide are strong, i.e. apply no matter what threading module is used,
then I utterly despise it. It's one of the worst decisions made in the
design of D. Making things pedantically strict, so that the type system
gets in the way more than it helps, encourages the user to reflexively
circumvent the type system without thinking hard about doing this, thus
defeating its purpose. (The alternative of always complying with what
the type system "expects" you to do is too inflexible to even be worth
considering.) Type systems should err on the side of accepting a
superset of what's correct and treating code as innocent until proven
guilty, not the other way around. I still believe this even if some of
the bugs it could be letting pass through might be very difficult to
debug. See the discussion we had a few weeks ago about implicit integer
casting and porting code to 64.
My excuse for std.parallelism is that it's pedal-to-metal parallelism,
so it's more acceptable for it to be dangerous than general case
concurrency. IMHO when you use the non- safe parts of std.parallelism
(i.e. most of the library), that's equivalent to casting away shared in
a whole bunch of places. Typing "import std.parallelism;" in a
non- safe module is an explicit enough step here.
The guarantee is still preserved that, if you only use std.concurrency
(D's flagship "safe" concurrency module) for multithreading and don't
cast away shared, there can be no low level data races. IMHO this is
still a substantial accomplishment in that there exists a way to do
safe, statically checkable concurrency in D, even if it's not the
**only** way concurrency can be done. BTW, core.thread can also be used
to get around D's type system, not just std.parallelism. If you want to
check that only safe concurrency is used, importing std.parallelism and
core.thread can be grepped just as easily as casting away shared.
If, on the other hand, the guarantees of shared are supposed to be weak
in that they only apply to programs where only std.concurrency is used
for multithreading, then I think strictness is the right thing to do.
The whole point of std.concurrency is to give strong guarantees, but if
you prefer more dangerous but more flexible multithreading, other
paradigms should be readily available.

Now we're in the exact same situation (except that no classes are
involved) and you're proposing that for this case we make convenience
prime over low-level race safety? To me this would be an utter lack of
coherency in the design of D's "synchronized" feature to go that route.
For the case above, wouldn't it be better to use an atomic add instead
of a synchronized block?

Yes. I've used this pattern in much more complicated cases, though,
where atomic wouldn't cut it.

In this case you can make "sum" shared and have
the type system check that everything is safe (using my proposed rules).
And if your data structure is bigger, likely it'll be a synchronized
class and you won't have to resort to bypass type-system safeties inside
your loop (although you might have to inside your synchronized class,
but that's another matter).

I'm **still** totally confused about how shared is supposed to work,
because I don't have a fully debugged/implemented implementation or good
examples of stuff written in this paradigm to play around with. TDPL
helps to some degree, but for me it's a lot easier to understand
something by actually trying to use it.

I don't see a problem with the above. The array elements you modify are
passed through parallel's opApply which can check easily whether it's
safe or not to pass them by ref to different threads (by checking the
element's size) and allow or disallow the operation accordingly.
It could even do a clever trick to make it safe to pass things such as
elements of array of bytes by ref (by coalescing loop iterations for all
bytes sharing the same word into one task). That might not work for
ranges which are not arrays however.
That said, feel free to suggest more problematic examples.

Ok, I completely agree in principle, though I question whether it's
worth actually implementing something like this, especially until we
get some kind of support for shared delegates.

Well, it'll work irrespective of whether shared delegates are used or
not. I think you could add a compile-time check that the array element
size is a multiple of the word size when the element is passed by ref
in the loop and leave the clever trick as a possible future
improvements. Would that work?

Also, your example can be trivially modified to be safe.
void main() {
int sum = 0;
foreach (int value; taskPool.parallel([0,2,3,6,1,4,6,3,3,3,6])) {
synchronized sum += value;
}
writeln(sum);
}
In this case that kills all parallelism, but in more realistic cases I
use this pattern often. I find it very common to have an expensive
loop body can be performed in parallel, except for a tiny portion that
must update a shared data structure. I'm aware that it might be
possible, in theory, to write this more formally using reduce() or
something. However:
1. If the portion of the loop that deals with shared data is very
small (and therefore the serialization caused by the synchronized
block is negligible), it's often more efficient to only keep one data
structure in memory and update it concurrently, rather than use
stronger isolation between threads like reduce() does, and have to
maintain one data structure for each thread.
2. In my experience synchronizing on a small portion of the loop body
works very well in practice. My general philosophy is that, in a
library like this, dangerous but useful constructs must be supported
and treated as innocent until proven guilty, not the other way round.

Your second example is not really a good justification of anything. I'll
refer you to how synchronized classes work. It was decided that
synchronized in a class protects everything that is directly stored in
the class. Anything behind an indirection is considered shared by the
compiler. The implication of this is that if you have an array or a
pointer to something that you want semantically to be protected by the
class's mutex, you have to cast things to unshared. It was decided that
things should be safe against low-level races first, and convenience was
relegated as a secondary concern. I don't like it very much, but that's
what was decided and written in TDPL.

I'd go a little further. If the guarantees that shared was supposed to
provide are strong, i.e. apply no matter what threading module is used,
then I utterly despise it. It's one of the worst decisions made in the
design of D. Making things pedantically strict, so that the type
system gets in the way more than it helps, encourages the user to
reflexively circumvent the type system without thinking hard about
doing this, thus defeating its purpose. (The alternative of always
complying with what the type system "expects" you to do is too
inflexible to even be worth considering.) Type systems should err on
the side of accepting a superset of what's correct and treating code as
innocent until proven guilty, not the other way around. I still
believe this even if some of the bugs it could be letting pass through
might be very difficult to debug. See the discussion we had a few
weeks ago about implicit integer casting and porting code to 64.

I agree with you that this is a serious problem. I think part of why it
hasn't been talked much yet is that nobody is currently using D2
seriously for multithreaded stuff at this time (apart from you I
guess), so we're missing experience with it. Andrei seems to think it's
fine to required casts as soon as you need to protect something beyond
an indirection inside synchronized classes, with the mitigation measure
that you can make classes share their mutex (not implemented yet I
think) so if the indirection leads to a class it is less of a problem.
Personally, I don't.

My excuse for std.parallelism is that it's pedal-to-metal parallelism,
so it's more acceptable for it to be dangerous than general case
concurrency. IMHO when you use the non- safe parts of std.parallelism
(i.e. most of the library), that's equivalent to casting away shared in
a whole bunch of places. Typing "import std.parallelism;" in a
non- safe module is an explicit enough step here.

I still think this "pedal-to-metal" qualification needs to be
justified. Not having shared delegates in the language seems like an
appropriate justification to me. Wanting to bypass casts you normally
have to do around synchronized as the sole reason seems like a bad
justification to me.
It's not that I like how synchronized works, it's just that I think it
should work the same everywhere.

The guarantee is still preserved that, if you only use std.concurrency
(D's flagship "safe" concurrency module) for multithreading and don't
cast away shared, there can be no low level data races. IMHO this is
still a substantial accomplishment in that there exists a way to do
safe, statically checkable concurrency in D, even if it's not the
**only** way concurrency can be done. BTW, core.thread can also be
used to get around D's type system, not just std.parallelism. If you
want to check that only safe concurrency is used, importing
std.parallelism and core.thread can be grepped just as easily as
casting away shared.

Unless I'm mistaken, the only thing that bypasses race-safety in
core.thread is the Thread constructor that takes a delegate. Which
means it could easily be made race-safe by making that delegate
parameter shared (once shared delegates are implemented).

If, on the other hand, the guarantees of shared are supposed to be weak
in that they only apply to programs where only std.concurrency is used
for multithreading, then I think strictness is the right thing to do.
The whole point of std.concurrency is to give strong guarantees, but if
you prefer more dangerous but more flexible multithreading, other
paradigms should be readily available.

I think the language as a whole is designed to have strong guaranties,
otherwise synchronized classes wouldn't require out-of-guaranty casts
at every indirection.
I'm not too pleased with the way synchronized classes are supposed to
work, nor am I too pleased with how it impacts the rest of the
language. But if this is a problem (and I think it is), it ought to be
fixed globally, not by shutting down safeties in every module dealing
with multithreading that isn't std.concurrency.

I'm **still** totally confused about how shared is supposed to work,
because I don't have a fully debugged/implemented implementation or
good examples of stuff written in this paradigm to play around with.

I think nobody have played much with the paradigm at this point, or
we'd have heard some feedback. Well, actually we have your feedback
which seem to indicate that it's better to shut off safeties than to
play nice with them.
- - -
Quoting Andrei, February 4 of 2010 in "there is no escape" on the
dmd-concurrency mailing list:

As we already knew, shared/synchronized limit quite drastically the
range of lock-based designs without casts. Fortunately, using a class
reference member inside a synchronized object will be possible
because... well I'll explain in the text.
I continue to believe this is the right bet to make, but I expect push
back from experienced lock-based programmers.

Now is the beginning of that push back, I guess.
--
Michel Fortin
michel.fortin michelf.com
http://michelf.com/

Well, it'll work irrespective of whether shared delegates are used or
not. I think you could add a compile-time check that the array element
size is a multiple of the word size when the element is passed by ref in
the loop and leave the clever trick as a possible future improvements.
Would that work?

On second thought, no, but for practical, not theoretical reasons: One,
you can't introspect whether a foreach loop is using a ref or a value
parameter. This is an issue with how opApply works. Two, AFAIK there's
no way to get the native word size.

I'd go a little further. If the guarantees that shared was supposed to
provide are strong, i.e. apply no matter what threading module is
used, then I utterly despise it. It's one of the worst decisions made
in the design of D. Making things pedantically strict, so that the
type system gets in the way more than it helps, encourages the user to
reflexively circumvent the type system without thinking hard about
doing this, thus defeating its purpose. (The alternative of always
complying with what the type system "expects" you to do is too
inflexible to even be worth considering.) Type systems should err on
the side of accepting a superset of what's correct and treating code
as innocent until proven guilty, not the other way around. I still
believe this even if some of the bugs it could be letting pass through
might be very difficult to debug. See the discussion we had a few
weeks ago about implicit integer casting and porting code to 64.

I agree with you that this is a serious problem. I think part of why it
hasn't been talked much yet is that nobody is currently using D2
seriously for multithreaded stuff at this time (apart from you I guess),
so we're missing experience with it. Andrei seems to think it's fine to
required casts as soon as you need to protect something beyond an
indirection inside synchronized classes, with the mitigation measure
that you can make classes share their mutex (not implemented yet I
think) so if the indirection leads to a class it is less of a problem.
Personally, I don't.

My excuse for std.parallelism is that it's pedal-to-metal parallelism,
so it's more acceptable for it to be dangerous than general case
concurrency. IMHO when you use the non- safe parts of std.parallelism
(i.e. most of the library), that's equivalent to casting away shared
in a whole bunch of places. Typing "import std.parallelism;" in a
non- safe module is an explicit enough step here.

I still think this "pedal-to-metal" qualification needs to be justified.
Not having shared delegates in the language seems like an appropriate
justification to me. Wanting to bypass casts you normally have to do
around synchronized as the sole reason seems like a bad justification to
me.
It's not that I like how synchronized works, it's just that I think it
should work the same everywhere.

This is where you and I disagree. I think that the type system's
guarantees should be weak, i.e. only apply to std.concurrency. IMHO the
strictness is reasonable when using message passing as your primary
method of multithreading and only very little shared state. However,
it's completely unreasonable if you want to use a paradigm where shared
state is more heavily used. D, being a systems language, needs to allow
other styles of multithreading without making them a huge PITA that
requires casts everywhere.

The guarantee is still preserved that, if you only use std.concurrency
(D's flagship "safe" concurrency module) for multithreading and don't
cast away shared, there can be no low level data races. IMHO this is
still a substantial accomplishment in that there exists a way to do
safe, statically checkable concurrency in D, even if it's not the
**only** way concurrency can be done. BTW, core.thread can also be
used to get around D's type system, not just std.parallelism. If you
want to check that only safe concurrency is used, importing
std.parallelism and core.thread can be grepped just as easily as
casting away shared.

Unless I'm mistaken, the only thing that bypasses race-safety in
core.thread is the Thread constructor that takes a delegate. Which means
it could easily be made race-safe by making that delegate parameter
shared (once shared delegates are implemented).

And then you'd only need one cast to break it if you wanted to, not
casts everywhere. Just cast an unshared delegate to shared when passing
it to core.thread.

If, on the other hand, the guarantees of shared are supposed to be
weak in that they only apply to programs where only std.concurrency is
used for multithreading, then I think strictness is the right thing to
do. The whole point of std.concurrency is to give strong guarantees,
but if you prefer more dangerous but more flexible multithreading,
other paradigms should be readily available.

I think the language as a whole is designed to have strong guaranties,
otherwise synchronized classes wouldn't require out-of-guaranty casts at
every indirection.

Well, there's an easy way around that, too. Just declare the whole
method body synchronized, but don't declare the method synchronized in
the signature.

I'm not too pleased with the way synchronized classes are supposed to
work, nor am I too pleased with how it impacts the rest of the language.
But if this is a problem (and I think it is), it ought to be fixed
globally, not by shutting down safeties in every module dealing with
multithreading that isn't std.concurrency.

Again, I completely disagree. IMHO it's not fixable globally such that
both of the following are achieved:
1. The strong guarantees when using only std.concurrency are preserved.
2. More shared state-intensive multithreading can be done without the
type system getting in the way more than it helps.

I'm **still** totally confused about how shared is supposed to work,
because I don't have a fully debugged/implemented implementation or
good examples of stuff written in this paradigm to play around with.

I think nobody have played much with the paradigm at this point, or we'd
have heard some feedback. Well, actually we have your feedback which
seem to indicate that it's better to shut off safeties than to play nice
with them.

I agree that it's better to shut off the safeties **unless** you're
doing very coarse-grained multithreading with very little shared state
like std.concurrency had in mind. If this is what you want, then the
safeties are great.
Unfortunately, I'm going to have to take a hard line on this one. The
issue of integrating std.parallelism into the race safety system had
been discussed a bunch in the past and it was basically agreed that
std.parallelism is a "here be dragons" module that cannot reasonably be
made to conform to such a model. Given the choice (and I hope I'm not
forced to make this choice) I'd rather std.parallelism be a third-party
module that's actually usable than a Phobos module that is such a PITA
to use that noone does in practice.

Well, it'll work irrespective of whether shared delegates are used or
not. I think you could add a compile-time check that the array element
size is a multiple of the word size when the element is passed by ref in
the loop and leave the clever trick as a possible future improvements.
Would that work?

On second thought, no, but for practical, not theoretical reasons:
One, you can't introspect whether a foreach loop is using a ref or a
value parameter. This is an issue with how opApply works.

Indeed a problem. Either we fix the compiler to support that, or we
change the syntax to something like this:
taskPool.apply(range, (ref int value) {
...
});
Or we leave things as they are.

Two, AFAIK there's no way to get the native word size.

Right. That's a problem too... you could probably alleviate this by
doing a runtime check with some fancy instructions to get the native
word size, but I'd expect that to be rather convoluted.
I'd like to check if I understand that well. For instance this code:
int[100] values;
foreach (i, ref value; parallel(values))
value = i;
would normally run fine on a 32-bit processor, but it'd create
low-level a race on a 64-bit processor (even a 64-bit processor running
a 32-bit program in 32-bit compatibility mode). And even that is a
generalization, some 32-bit processors out there *might* have 64-bit
native words. So the code above isn't portable. Is that right?
Which makes me think... we need to document those pitfalls somewhere.
Perhaps std.parallelism's documentation should link to a related page
about what you can and what you can't do safely. People who read that
"all the safeties are off" in std.parallelism aren't going to
understand what you're talking about unless you explain the pitfalls
with actual examples (like the one above).

Unfortunately, I'm going to have to take a hard line on this one. The
issue of integrating std.parallelism into the race safety system had
been discussed a bunch in the past and it was basically agreed that
std.parallelism is a "here be dragons" module that cannot reasonably be
made to conform to such a model.

About the "cannot reasonably be made to conform to such a model" part:
that is certainly true today, but might turn out not to be true as the
model evolves. It certainly is true as long as we don't have shared
delegates. Beyond that it becomes more fuzzy. The final goal of the
module shouldn't be to bypass safeties but to provide good parallelism
primitives. By all means, if safeties can be enabled reasonably they
should be (but as of today, they can't).
And I totally agree with you that it's quite silly to require casts
everywhere to use synchronized. I'll be the first to admit that it's
hard to see synchronized classes being very practical as they're
implemented today. There's room for improvements there too.

Given the choice (and I hope I'm not forced to make this choice) I'd
rather std.parallelism be a third-party module that's actually usable
than a Phobos module that is such a PITA to use that noone does in
practice.

On second thought, no, but for practical, not theoretical reasons:
One, you can't introspect whether a foreach loop is using a ref or a
value parameter. This is an issue with how opApply works.

change the syntax to something like this:
taskPool.apply(range, (ref int value) {
...
});
Or we leave things as they are.

Two, AFAIK there's no way to get the native word size.

doing a runtime check with some fancy instructions to get the native
word size, but I'd expect that to be rather convoluted.
I'd like to check if I understand that well. For instance this code:
int[100] values;
foreach (i, ref value; parallel(values))
value = i;
would normally run fine on a 32-bit processor, but it'd create
low-level a race on a 64-bit processor (even a 64-bit processor running
a 32-bit program in 32-bit compatibility mode). And even that is a
generalization, some 32-bit processors out there *might* have 64-bit
native words. So the code above isn't portable. Is that right?
Which makes me think... we need to document those pitfalls somewhere.
Perhaps std.parallelism's documentation should link to a related page
about what you can and what you can't do safely. People who read that
"all the safeties are off" in std.parallelism aren't going to
understand what you're talking about unless you explain the pitfalls
with actual examples (like the one above).

This problem is **much** less severe than you are suggesting. x86 can address
single bytes, so it's not a problem even if you're iterating over bytes on a
64-bit machine. CPUs like Alpha (which no D compiler even exists for) can't
natively address individual bytes. Therefore, writing to a byte would be
implemented much like writing to a bit is on x86: You'd read the full word in,
change one byte, write the full word back. I'm not sure exactly how it would be
implemented at the compiler level, or whether you'd even be allowed to have a
reference to a byte in such an implementation. This is why I consider this
more a
theoretical problem than a serious practical issue.

Unfortunately, I'm going to have to take a hard line on this one. The
issue of integrating std.parallelism into the race safety system had
been discussed a bunch in the past and it was basically agreed that
std.parallelism is a "here be dragons" module that cannot reasonably be
made to conform to such a model.

that is certainly true today, but might turn out not to be true as the
model evolves. It certainly is true as long as we don't have shared
delegates. Beyond that it becomes more fuzzy. The final goal of the
module shouldn't be to bypass safeties but to provide good parallelism
primitives. By all means, if safeties can be enabled reasonably they
should be (but as of today, they can't).

I'll agree that, if the model evolves sufficiently, everything I've said
deserves
to be reconsidered. My comments only apply to the model as it exists currently
or
will exist in the foreseeable future. It's just that I have **serious** doubts
that it will evolve much and I don't want to delay things to have an academic
discussion about what-ifs with regard to this.

And I totally agree with you that it's quite silly to require casts
everywhere to use synchronized. I'll be the first to admit that it's
hard to see synchronized classes being very practical as they're
implemented today. There's room for improvements there too.

This is the point I'm trying to make: Requiring these casts is perfectly
reasonable if you assume that shared state is going to be rare, as it is in the
std.concurrency model. When the user is looking for a more fine-grained, more
shared state-heavy paradigm, I don't think it's possible to make it safe without
also making it virtually unusable. This is why we punted and decided to go with
message passing and very limited shared data as the flagship multithreading
model.
On the other hand, in a systems language, we can't virtually prohibit shared
state-heavy multithreading by making it virtually unusable. Thus, the no low
level race guarantee needs to apply to std.concurrency only.

On second thought, no, but for practical, not theoretical reasons:
One, you can't introspect whether a foreach loop is using a ref or a
value parameter. This is an issue with how opApply works.

change the syntax to something like this:
taskPool.apply(range, (ref int value) {
...
});
Or we leave things as they are.

Two, AFAIK there's no way to get the native word size.

doing a runtime check with some fancy instructions to get the native
word size, but I'd expect that to be rather convoluted.
I'd like to check if I understand that well. For instance this code:
int[100] values;
foreach (i, ref value; parallel(values))
value = i;
would normally run fine on a 32-bit processor, but it'd create
low-level a race on a 64-bit processor (even a 64-bit processor running
a 32-bit program in 32-bit compatibility mode). And even that is a
generalization, some 32-bit processors out there *might* have 64-bit
native words. So the code above isn't portable. Is that right?
Which makes me think... we need to document those pitfalls somewhere.
Perhaps std.parallelism's documentation should link to a related page
about what you can and what you can't do safely. People who read that
"all the safeties are off" in std.parallelism aren't going to
understand what you're talking about unless you explain the pitfalls
with actual examples (like the one above).

single bytes, so it's not a problem even if you're iterating over bytes on a
64-bit machine. CPUs like Alpha (which no D compiler even exists for) can't
natively address individual bytes. Therefore, writing to a byte would be
implemented much like writing to a bit is on x86: You'd read the full word in,
change one byte, write the full word back. I'm not sure exactly how it would
be
implemented at the compiler level, or whether you'd even be allowed to have a
reference to a byte in such an implementation. This is why I consider this
more a
theoretical problem than a serious practical issue.

Actually, just remembered that word tearing is also an issue with unaligned
memory
access. I guess I could just include a warning that says not to do this with
unaligned memory.

On second thought, no, but for practical, not theoretical reasons:
One, you can't introspect whether a foreach loop is using a ref or a
value parameter. This is an issue with how opApply works.

change the syntax to something like this:
taskPool.apply(range, (ref int value) {
...
});
Or we leave things as they are.

Two, AFAIK there's no way to get the native word size.

doing a runtime check with some fancy instructions to get the native
word size, but I'd expect that to be rather convoluted.
I'd like to check if I understand that well. For instance this code:
int[100] values;
foreach (i, ref value; parallel(values))
value = i;
would normally run fine on a 32-bit processor, but it'd create
low-level a race on a 64-bit processor (even a 64-bit processor running
a 32-bit program in 32-bit compatibility mode). And even that is a
generalization, some 32-bit processors out there *might* have 64-bit
native words. So the code above isn't portable. Is that right?
Which makes me think... we need to document those pitfalls somewhere.
Perhaps std.parallelism's documentation should link to a related page
about what you can and what you can't do safely. People who read that
"all the safeties are off" in std.parallelism aren't going to
understand what you're talking about unless you explain the pitfalls
with actual examples (like the one above).

single bytes, so it's not a problem even if you're iterating over bytes on a
64-bit machine. CPUs like Alpha (which no D compiler even exists for) can't
natively address individual bytes. Therefore, writing to a byte would be
implemented much like writing to a bit is on x86: You'd read the full word in,
change one byte, write the full word back. I'm not sure exactly how it would
be
implemented at the compiler level, or whether you'd even be allowed to have a
reference to a byte in such an implementation. This is why I consider this
more a
theoretical problem than a serious practical issue.

Actually, just remembered that word tearing is also an issue with unaligned
memory
access. I guess I could just include a warning that says not to do this with
unaligned memory.

s/is/may be . I'm trying to read up on word tearing and post questions
here and to StackOverflow. Good documentation about it seems
ridiculously hard to come by. About the only solid pieces of
information I've found are that x86 definitely **can** do byte
granularity (which doesn't mean it always does) and that Java makes
these guarantees (which is only useful if you're writing in Java). The
general gut feeling people here and on StackOverflow seem to have is
that it's not an issue, but there doesn't seem to be much backing up of
this.
Maybe I'm just being paranoid and this is a complete non-issue except on
the most obscure/ancient hardware and/or most stupidly implemented
compilers, which is why few people even think of it.

* Still no synopsis example that illustrates in a catchy way the most
attractive artifacts.

I don't see what I could put here that isn't totally redundant with the
rest of the documentation. Anything I could think of would basically
just involve concatentating all the examples. Furthermore, none of the
other Phobos modules have this, so I don't know what one should look
like. In general I feel like std.parallelism is being held to a
ridiculously high standard that none of the other Phobos modules
currently meet.

* "After creation, Task objects are submitted to a TaskPool for
execution." I understand it's possible to use Task straight as a
promise/future, so s/are/may be/.

No. The only way Task is useful is by submitting it to a pool to be
executed. (Though this may change, see below.)

Also it is my understanding that
TaskPool efficiently adapts the concrete number of CPUs to an arbitrary
number of tasks. If that's the case, it's worth mentioning.

Isn't this kind of obvious from the examples, etc.?

* "If a Task has been submitted to a TaskPool instance, is being stored
in a stack frame, and has not yet finished, the destructor for this
struct will automatically call yieldWait() so that the task can finish
and the stack frame can be destroyed safely." At this point in the doc
the reader doesn't understand that at all because TaskPool has not been
seen yet. The reader gets worried that she'll be essentially serializing
the entire process by mistake. Either move this explanation down or
provide an example.

This is getting ridiculous. There are too many circular dependencies
between Task and TaskPool that are impossible to remove here that I'm
not even going to try. One or the other has to be introduced first, but
neither can be explained without mentioning the other. This is why I
explain the relationship briefly in the module level summary, so that
the user has at least some idea. I think this is about the best I can do.

* Is done() a property?

Yes. DDoc sucks.

* The example that reads two files at the same time should NOT use
taskPool. It's just one task, why would the pool ever be needed? If you
also provided an example that reads n files in memory at the same time
using a pool, that would illustrate nicely why you need it. If a Task
can't be launched without being put in a pool, there should be a
possibility to do so. At my work we have a simple function called
callInNewThread that does what's needed to launch a function in a new
thread.

I guess I could add something like this to Task. Under the hood it
would (for implementation simplicity, to reuse a bunch of code from
TaskPool) fire up a new single-thread pool, submit the task, call
TaskPool.finish(), and then return. Since you're already creating a new
thread, the extra overhead of creating a new TaskPool for the thread
would be negligible and it would massively simplify the implementation.
My only concern is that, when combined with scoped versus non-scoped
tasks (which are definitely here to stay, see below) this small
convenience function would add way more API complexity than it's worth.
Besides, task pools don't need to be created explicitly anyhow.
That's what the default instantiation is for. I don't see how
callInNewThread would really solve much.
Secondly, I think you're reading **WAY** too much into what was meant to
be a simple example to illustrate usage mechanics. This is another case
where I can't think of a small, cute example of where you'd really need
the pool. There are plenty of larger examples, but the smallest/most
self-contained one I can think of is a parallel sort. I decided to use
file reading because it was good enough to illustrate the mechanics of
usage, even if it didn't illustrate a particularly good use case.

* The note below that example gets me thinking: it is an artificial
limitation to force users of Task to worry about scope and such. One
should be able to create a Future object (Task I think in your
terminology), pass it around like a normal value, and ultimately force
it. This is the case for all other languages that implement futures. I
suspect the "scope" parameter associated with the delegate a couple of
definitions below plays a role here, but I think we need to work for
providing the smoothest interface possible (possibly prompting
improvements in the language along the way).

This is what TaskPool.task is for. Maybe this should be moved to the
top of the definition of TaskPool and emphasized, and the scoped/stack
allocated versions should be moved below TaskPool and de-emphasized?
At any rate, though, anyone who's playing with parallelism should be an
advanced enough programmer that concepts like scope and destructors are
second nature.

* I'm not sure how to interpret the docs for
ReturnType!(F) run(F, Args...)(F fpOrDelegate, ref Args args);
So it's documented but I'm not supposed to care. Why not just remove?
Surely there must be higher-level examples that clarify that I can use
delegates etc.

Yes, and there are such examples. It's just that I want to explicitly
state what type is returned in this case rather than returning auto. If
I omitted the docs for run(), you'd ask what run() is.

* "If you want to escape the Task object from the function in which it
was created or prefer to heap allocate and automatically submit to the
pool, see TaskPool.task()." I'm uncomfortable that I need to remember a
subtle distinction of the same primitive name ("task") depending on
membership in a TaskPool or not, which is a tenuous relationship to
remember. How about "scopedTask()" vs. "task()"? Also, it's worth asking
ourselves what's the relative overhead of closure allocation vs. task
switching etc. Maybe we get to simplify things a lot at a small cost in
efficiency.

We absolutely **need** scope delegates for calling stuff where a closure
can't be allocated due to objects on the stack frame having scoped
destruction. This is not going anywhere. Also, I **much** prefer to
have everything that does conceptually the same thing have the same
name. I think this is clearer, not less clear.

* As I mentioned, in the definition:
Task!(run,TypeTuple!(F,Args)) task(F, Args...)(F fun, Args args);
I can't tell what "run" is.

run() is just the adapter function described right above this code. I
cannot for the life of me understand how this could possibly be unclear.

* "A goto from inside the parallel foreach loop to a label outside the
loop will result in undefined behavior." Would this be a bug in dmd?

No, it's because a goto of this form has no reasonable, useful
semantics. I should probably mention in the docs that the same applies
to labeled break and continue.
I have no idea what semantics these should have, and even if I did,
given the long odds that even one person would actually need them, I
think they'd be more trouble than they're worth to implement. For
example, once you break out of a parallel foreach loop to some arbitrary
address (and different threads can goto different labels, etc.), well,
it's no longer a parallel foreach loop. It's just a bunch of completely
unstructured threading doing god-knows-what.
Therefore, I slapped undefined behavior on it as a big sign that says,
"Just don't do it." This also has the advantage that, if anyone ever
thinks of any good, clearly useful semantics, these will be
implementable without breaking code later.

I **STRONGLY** believe this does not belong in API documentation because
it's too machine specific, compiler specific, stack alignment specific,
etc. and almost any benchmark worth doing takes up more space than an
example should. Furthermore, anyone who wants to know this can easily
time it themselves. I have absolutely no intention of including this.
While in general I appreciate and have tried to accommodate your
suggestions, this is one I'll be standing firm on.

* The join example is fine, but the repetitive code suggests that loops
should be better:
import std.file;
void main() {
auto pool = new TaskPool();
foreach (filename; ["foo.txt", "bar.txt", "baz.txt"]) {
pool.put(task!read(filename));
}
// Call join() to guarantee that all tasks are done running, the worker
// threads have terminated and that the results of all of the tasks can
// be accessed without any synchronization primitives.
pool.join();
ubyte[][] data; // void[], meh
// Use spinWait() since the results are guaranteed to have been computed
// and spinWait() is the cheapest of the wait functions.
foreach (task; pool.howDoIEnumerateTasks()) {
data ~= task1.spinWait();
}

You can't enumerate the tasks like this, but there's a good reason for
this limitation. The design requires the pool not hold onto any
references to the tasks after they're out of the queue, so that they can
be safely destroyed as soon as the caller wants to destroy them. If you
need to enumerate over them like this, then:
1. You're probably doing something wrong and parallel foreach would be
a better choice.
2. If you insist on doing this, you need to explicitly store them in an
array or something.
As a more general comment, I know the join() example sucks. join() is
actually a seldom-used primitive, not a frequently used one as you
suggest. Most of the time you wait for individual tasks (explicitly or
implicitly through the higher level functions), not the whole pool, to
finish executing. I feel like it's obligatory that join() and finish()
exist, but I never use them myself. They were much more useful in
pre-release versions of this lib, when parallel foreach didn't exist and
it made sense to have explicit arrays of Tasks. The difficulty I've had
in coming up with a decent example for them makes me halfway tempted to
just rip those $#)*% functions out entirely.

* "A goto from inside the parallel foreach loop to a label outside the
loop will result in undefined behavior." Would this be a bug in dmd?

No, it's because a goto of this form has no reasonable, useful
semantics. I should probably mention in the docs that the same applies
to labeled break and continue.
I have no idea what semantics these should have, and even if I did,
given the long odds that even one person would actually need them, I
think they'd be more trouble than they're worth to implement. For
example, once you break out of a parallel foreach loop to some
arbitrary address (and different threads can goto different labels,
etc.), well, it's no longer a parallel foreach loop. It's just a bunch
of completely unstructured threading doing god-knows-what.
Therefore, I slapped undefined behavior on it as a big sign that says,
"Just don't do it." This also has the advantage that, if anyone ever
thinks of any good, clearly useful semantics, these will be
implementable without breaking code later.

I think an improvement over undefined behaviour would be to throw an exception.
Also, what happens if one of the tasks throws an exception?
--
Michel Fortin
michel.fortin michelf.com
http://michelf.com/

* "A goto from inside the parallel foreach loop to a label outside the
loop will result in undefined behavior." Would this be a bug in dmd?

No, it's because a goto of this form has no reasonable, useful
semantics. I should probably mention in the docs that the same applies
to labeled break and continue.
I have no idea what semantics these should have, and even if I did,
given the long odds that even one person would actually need them, I
think they'd be more trouble than they're worth to implement. For
example, once you break out of a parallel foreach loop to some
arbitrary address (and different threads can goto different labels,
etc.), well, it's no longer a parallel foreach loop. It's just a bunch
of completely unstructured threading doing god-knows-what.
Therefore, I slapped undefined behavior on it as a big sign that says,
"Just don't do it." This also has the advantage that, if anyone ever
thinks of any good, clearly useful semantics, these will be
implementable without breaking code later.

I think an improvement over undefined behaviour would be to throw an
exception.

The only problem is that there's no easy, well-documented way to tell
from the return value of opApply whether it was a break, a goto, a
labeled break/continue, etc. This would be implementable only if I
changed the semantics of break to also throw. This might not be a bad
thing (IMHO any type of breaking out of a parallel foreach loop is just
silly) but others had asked for different semantics for break.

Also, what happens if one of the tasks throws an exception?

It gets rethrown when yieldWait()/spinWait()/workWait() is called. In
the case of the higher-level primitives, it gets re-thrown to the
calling thread at some non-deterministic point in the execution of these
functions. I didn't see the need to document this explicitly because it
"just works".

On 3/19/2011 2:35 PM, Michel Fortin wrote:
It gets rethrown when yieldWait()/spinWait()/workWait() is called. In
the case of the higher-level primitives, it gets re-thrown to the
calling thread at some non-deterministic point in the execution of these
functions. I didn't see the need to document this explicitly because it
"just works".

...Though now that you make me think of it I need to support exception
chaining for the case of multiple concurrently thrown exceptions instead
of just arbitrarily and non-deterministically rethrowing one. The code
that handles this was written before exception chaining existed. Will
get on that.

* "A goto from inside the parallel foreach loop to a label outside the
loop will result in undefined behavior." Would this be a bug in dmd?

No, it's because a goto of this form has no reasonable, useful
semantics. I should probably mention in the docs that the same applies
to labeled break and continue.
I have no idea what semantics these should have, and even if I did,
given the long odds that even one person would actually need them, I
think they'd be more trouble than they're worth to implement. For
example, once you break out of a parallel foreach loop to some
arbitrary address (and different threads can goto different labels,
etc.), well, it's no longer a parallel foreach loop. It's just a bunch
of completely unstructured threading doing god-knows-what.
Therefore, I slapped undefined behavior on it as a big sign that says,
"Just don't do it." This also has the advantage that, if anyone ever
thinks of any good, clearly useful semantics, these will be
implementable without breaking code later.

I think an improvement over undefined behaviour would be to throw an
exception.

The only problem is that there's no easy, well-documented way to tell
from the return value of opApply whether it was a break, a goto, a
labeled break/continue, etc. This would be implementable only if I
changed the semantics of break to also throw. This might not be a bad
thing (IMHO any type of breaking out of a parallel foreach loop is just
silly) but others had asked for different semantics for break.

It's not that silly.
Essentially, what you'd express like this with a normal function taking
a delegate:
taskPool.apply([1,2,3], (int i) {
if (i == 1)
return;
// do some things
});
you'd express like this in a parallel foreach:
foreach (int i; parallel([1,2,3])) {
if (i == 1)
break;
// do some things
}
It's not following the semantics of break within a foreach, but it's
still useful to be able to return early from a function (from a loop
iteration in this case), so I see the use case for making 'break' do
what it does.
My only gripe is that the semantics are distorted. In fact, just making
foreach parallel distorts its semantics. I was confused earlier about a
foreach being parallel when it was not, someone could also be confused
in the other direction, thinking foreach is the normal foreach when it
actually is parallel. This makes code harder to review. Don't consider
only my opinion on this, but in my opinion the first form above
(taskPool.apply) is preferable because you absolutely can't mistake it
with a normal foreach. And I think it's especially important to make it
stand out given that the compiler can't check for low-level races.

Also, what happens if one of the tasks throws an exception?

It gets rethrown when yieldWait()/spinWait()/workWait() is called. In
the case of the higher-level primitives, it gets re-thrown to the
calling thread at some non-deterministic point in the execution of
these functions. I didn't see the need to document this explicitly
because it "just works".

That's indeed what I'd expect. But I think it'd still be worth
mentioning in a short sentense in yeildWait()/spinWait()/workWait()'s
documentation. It's comforting when the documentation confirms your
expectations.
--
Michel Fortin
michel.fortin michelf.com
http://michelf.com/

The only problem is that there's no easy, well-documented way to tell
from the return value of opApply whether it was a break, a goto, a
labeled break/continue, etc. This would be implementable only if I
changed the semantics of break to also throw. This might not be a bad
thing (IMHO any type of breaking out of a parallel foreach loop is
just silly) but others had asked for different semantics for break.

It's not that silly.
Essentially, what you'd express like this with a normal function taking
a delegate:
taskPool.apply([1,2,3], (int i) {
if (i == 1)
return;
// do some things
});
you'd express like this in a parallel foreach:
foreach (int i; parallel([1,2,3])) {
if (i == 1)
break;
// do some things
}
It's not following the semantics of break within a foreach, but it's
still useful to be able to return early from a function (from a loop
iteration in this case), so I see the use case for making 'break' do
what it does.

Using continue is well defined behavior and does exactly what I think
you're suggesting. The problem with break is that it's supposed to stop
all subsequent iterations of the loop from being executed, not just end
the current one early. This only makes sense in a serial context, where
"subsequent iterations" is well-defined. Currently break breaks from
the current work unit but continues executing all other work units. I
put this semantic in because I couldn't think of anything else to make
it do and someone (I don't remember who) asked for it. I have never
encountered a use case for it.
IMHO all early termination that affects subsequent loop iterations as
well (break, goto, labeled break and continue, but not regular continue)
should just throw because they make absolutely no sense in a parallel
context.

IMHO all early termination that affects subsequent loop iterations as well
(break, goto, labeled break and continue, but not regular continue) should just
throw because they make absolutely no sense in a parallel context.

What about some kind of parallel search? You just want to know if something is
there and on first sight you're done, so you want to break out of this iteration
and don't want any new parallel iterations to be done.
(Not sure what should be done with currently running iterations.. should they be
killed or go on until they're done - anyway, the executing thread shouldn't
start a new iteration after that)

IMHO all early termination that affects subsequent loop iterations as well
(break, goto, labeled break and continue, but not regular continue) should just
throw because they make absolutely no sense in a parallel context.

What about some kind of parallel search? You just want to know if something is
there and on first sight you're done, so you want to break out of this
iteration
and don't want any new parallel iterations to be done.
(Not sure what should be done with currently running iterations.. should they
be
killed or go on until they're done - anyway, the executing thread shouldn't
start a new iteration after that)

It's an interesting suggestion, and I'll give some thought to it, but on
first glance it sounds **very** difficult to implement efficiently. To
avoid doing a lot of extra work after you've found what you need, you'd
need to use very small work units. To avoid excessive overhead you'd
need to use fairly large work units. The only case I can think of where
this would be do-able is when evaluating the predicate is very expensive.

* Still no synopsis example that illustrates in a catchy way the most
attractive artifacts.

I don't see what I could put here that isn't totally redundant with the
rest of the documentation. Anything I could think of would basically
just involve concatentating all the examples. Furthermore, none of the
other Phobos modules have this, so I don't know what one should look
like.

I'm thinking along the lines of:
http://www.digitalmars.com/d/2.0/phobos/std_exception.html
A nice synopsis would be the pi computation. Just move that up to the
synopsis. It's simple, clean, and easy to relate to. Generally, you'd
put here not all details but the stuff you think would make it easiest
for people to get into your library.

In general I feel like std.parallelism is being held to a
ridiculously high standard that none of the other Phobos modules
currently meet.

I agree, but that goes without saying. This is not a double standard;
it's a simple means to improve quality of Phobos overall. Clearly
certain modules that are already in Phobos would not make it if proposed
today. And that's a good thing! Comparing our current ambitions to the
quality of the average Phobos module would not help us.
Generally it seems you have grown already tired of the exchange and it
would take only a few more exchanges for you to say, "you know what,
either let's get it over it or forget about it."
Please consider for a minute how this is the wrong tone and attitude to
be having on several levels. This is not a debate and not the place to
get defensive. Your role in the discussion is not symmetric with the
others' and at best you'd use the review as an opportunity to improve
your design, not stick heels in the ground to defend its current
incarnation (within reason). Your potential customers are attempting to
help you by asking questions (some of which no doubt are silly) and by
making suggestions (some of which, again, are ill-founded).
Nevertheless, we _are_ your potential customers and in a way the
customer is always right. Your art is to steer customers from what they
think they want to what you know they need - because you're the expert!
- and to improve design, nomenclature, and implementation to the extent
that would help them.
Furthermore, you should expect that the review process will prompt
changes. My perception is that you consider the submission more or less
final modulo possibly a few minor nits. You shouldn't. I'm convinced you
know much more about SMP than most or all others in this group, but in
no way that means your design has reached perfection and is beyond
improvement even from a non-expert.
I know you'd have no problem finding the right voice in this discussion
if you only frame it in the right light. Again, people are trying to
help (however awkwardly) and in no way is that ridiculous.

* "After creation, Task objects are submitted to a TaskPool for
execution." I understand it's possible to use Task straight as a
promise/future, so s/are/may be/.

No. The only way Task is useful is by submitting it to a pool to be
executed. (Though this may change, see below.)

I very much hope this does change. Otherwise the role of Task in the
design could be drastically reduced (e.g. nested type inside of
TaskPool) without prejudice. At the minimum I want to be able to create
a task, launch it, and check its result later without involving a pool.
A pool is when I have many tasks that may exceed the number of CPUs etc.
Simplicity would be great.
// start three reads
auto readFoo = task!readText("foo.txt");
auto readBar = task!readText("bar.txt");
auto readBaz = task!readText("baz.txt");
// join'em all
auto foo = readFoo.yieldWait();
auto bar = readBar.yieldWait();
auto baz = readBaz.yieldWait();
There's no need at this level for a task pool. What would be nice would
be to have a join() that joins all tasks spawned by the current thread:
// start three reads
auto readFoo = task!readText("foo.txt");
auto readBar = task!readText("bar.txt");
auto readBaz = task!readText("baz.txt");
// join'em all
join();
// fetch results
auto foo = readFoo.spinWait();
auto bar = readBar.spinWait();
auto baz = readBaz.spinWait();
The way I see it is, task pools are an advanced means that coordinate m
threads over n CPUs. If I don't care about that (as above) there should
be no need for a pool at all. (Of course it's fine if used by the
implementation.)

Also it is my understanding that
TaskPool efficiently adapts the concrete number of CPUs to an arbitrary
number of tasks. If that's the case, it's worth mentioning.

Isn't this kind of obvious from the examples, etc.?

It wasn't to me. I had an intuition that that might be the case, but
obvious - far from it. The fact that I need a task pool even when I
don't care about m:n issues made me doubtful.

* "If a Task has been submitted to a TaskPool instance, is being stored
in a stack frame, and has not yet finished, the destructor for this
struct will automatically call yieldWait() so that the task can finish
and the stack frame can be destroyed safely." At this point in the doc
the reader doesn't understand that at all because TaskPool has not been
seen yet. The reader gets worried that she'll be essentially serializing
the entire process by mistake. Either move this explanation down or
provide an example.

This is getting ridiculous. There are too many circular dependencies
between Task and TaskPool that are impossible to remove here that I'm
not even going to try. One or the other has to be introduced first, but
neither can be explained without mentioning the other. This is why I
explain the relationship briefly in the module level summary, so that
the user has at least some idea. I think this is about the best I can do.

Perhaps cross-references could help (see macro XREF). As I mentioned, an
example here could go a long way, too.

* The example that reads two files at the same time should NOT use
taskPool. It's just one task, why would the pool ever be needed? If you
also provided an example that reads n files in memory at the same time
using a pool, that would illustrate nicely why you need it. If a Task
can't be launched without being put in a pool, there should be a
possibility to do so. At my work we have a simple function called
callInNewThread that does what's needed to launch a function in a new
thread.

I guess I could add something like this to Task. Under the hood it would
(for implementation simplicity, to reuse a bunch of code from TaskPool)
fire up a new single-thread pool, submit the task, call
TaskPool.finish(), and then return. Since you're already creating a new
thread, the extra overhead of creating a new TaskPool for the thread
would be negligible and it would massively simplify the implementation.
My only concern is that, when combined with scoped versus non-scoped
tasks (which are definitely here to stay, see below) this small
convenience function would add way more API complexity than it's worth.
Besides, task pools don't need to be created explicitly anyhow. That's
what the default instantiation is for. I don't see how callInNewThread
would really solve much.

If we sit down for a second and think outside of the current design, the
simplest Task use would be as such: "I start a task and get a sort of a
future. I do something else. Then when I nede the result of the future I
call some method force() against it." Nowhere is the notion of a task
pool to be found. To me this looks like an artifact of the
implementation has spilled into the API, which I understand how is
natural to you. But also it's unnatural to me.

Secondly, I think you're reading **WAY** too much into what was meant to
be a simple example to illustrate usage mechanics. This is another case
where I can't think of a small, cute example of where you'd really need
the pool. There are plenty of larger examples, but the smallest/most
self-contained one I can think of is a parallel sort. I decided to use
file reading because it was good enough to illustrate the mechanics of
usage, even if it didn't illustrate a particularly good use case.

It's impossible to not have a good small example. Sorting is great. You
have the partition primitive already in std.algorithm, then off you go
with tasks. Dot product on dense vectors is another good one. There's
just plenty of operations that people understand are important to make fast.
So to me it looks like this:
* there are already some good examples of "I want 1-3 of tasks but don't
care about pooling them": file reading and processing etc.
* there should be 1-2 good example of partitioning a large chunk of work
into many tasks and let taskPool carry them.
With such, people would clearly understand the capabilities of the
library and what they can achieve with various levels of involvement.

* The note below that example gets me thinking: it is an artificial
limitation to force users of Task to worry about scope and such. One
should be able to create a Future object (Task I think in your
terminology), pass it around like a normal value, and ultimately force
it. This is the case for all other languages that implement futures. I
suspect the "scope" parameter associated with the delegate a couple of
definitions below plays a role here, but I think we need to work for
providing the smoothest interface possible (possibly prompting
improvements in the language along the way).

This is what TaskPool.task is for. Maybe this should be moved to the top
of the definition of TaskPool and emphasized, and the scoped/stack
allocated versions should be moved below TaskPool and de-emphasized?
At any rate, though, anyone who's playing with parallelism should be an
advanced enough programmer that concepts like scope and destructors are
second nature.

I agree. That's not a reason to not find good names for entities. To me
the relationship "global task() <=> scoped delegate; TaskPool method
task() <=> closure" is inscrutable. I see no way of remembering it
except rote memorization.

We absolutely **need** scope delegates for calling stuff where a closure
can't be allocated due to objects on the stack frame having scoped
destruction. This is not going anywhere. Also, I **much** prefer to have
everything that does conceptually the same thing have the same name. I
think this is clearer, not less clear.

I agree in principle, but this case is just not that.

* As I mentioned, in the definition:
Task!(run,TypeTuple!(F,Args)) task(F, Args...)(F fun, Args args);
I can't tell what "run" is.

run() is just the adapter function described right above this code. I
cannot for the life of me understand how this could possibly be unclear.

OK.

* "A goto from inside the parallel foreach loop to a label outside the
loop will result in undefined behavior." Would this be a bug in dmd?

No, it's because a goto of this form has no reasonable, useful
semantics. I should probably mention in the docs that the same applies
to labeled break and continue.
I have no idea what semantics these should have, and even if I did,
given the long odds that even one person would actually need them, I
think they'd be more trouble than they're worth to implement. For
example, once you break out of a parallel foreach loop to some arbitrary
address (and different threads can goto different labels, etc.), well,
it's no longer a parallel foreach loop. It's just a bunch of completely
unstructured threading doing god-knows-what.
Therefore, I slapped undefined behavior on it as a big sign that says,
"Just don't do it." This also has the advantage that, if anyone ever
thinks of any good, clearly useful semantics, these will be
implementable without breaking code later.

Yah, I was actually thinking of disabling goto outside a local delegate
everywhere.

I **STRONGLY** believe this does not belong in API documentation because
it's too machine specific, compiler specific, stack alignment specific,
etc. and almost any benchmark worth doing takes up more space than an
example should. Furthermore, anyone who wants to know this can easily
time it themselves. I have absolutely no intention of including this.
While in general I appreciate and have tried to accommodate your
suggestions, this is one I'll be standing firm on.

If scalability information is present in however a non-committal form,
then people would be compelled ("ok, so this shape of the loop would
actually scale linearly with CPUs... neat").
Speaking of efficiency, I assume parallel foreach uses opApply with a
delegate and the inherent overhead. So I'm thinking that a practical way
to take advantage of parallel foreach would be to parallelize at some
block level and then do a regular foreach inside the block?
foreach (i, ref elem; taskPool.parallel(logs)) {
foreach (???)
elem = log(i + 1.0);
}
How can I arrange things such that I compute e.g. 64 logs serially
inside each pass?

* The join example is fine, but the repetitive code suggests that loops
should be better:
import std.file;
void main() {
auto pool = new TaskPool();
foreach (filename; ["foo.txt", "bar.txt", "baz.txt"]) {
pool.put(task!read(filename));
}
// Call join() to guarantee that all tasks are done running, the worker
// threads have terminated and that the results of all of the tasks can
// be accessed without any synchronization primitives.
pool.join();
ubyte[][] data; // void[], meh
// Use spinWait() since the results are guaranteed to have been computed
// and spinWait() is the cheapest of the wait functions.
foreach (task; pool.howDoIEnumerateTasks()) {
data ~= task1.spinWait();
}

You can't enumerate the tasks like this, but there's a good reason for
this limitation. The design requires the pool not hold onto any
references to the tasks after they're out of the queue, so that they can
be safely destroyed as soon as the caller wants to destroy them. If you
need to enumerate over them like this, then:
1. You're probably doing something wrong and parallel foreach would be a
better choice.
2. If you insist on doing this, you need to explicitly store them in an
array or something.
As a more general comment, I know the join() example sucks. join() is
actually a seldom-used primitive, not a frequently used one as you
suggest. Most of the time you wait for individual tasks (explicitly or
implicitly through the higher level functions), not the whole pool, to
finish executing. I feel like it's obligatory that join() and finish()
exist, but I never use them myself. They were much more useful in
pre-release versions of this lib, when parallel foreach didn't exist and
it made sense to have explicit arrays of Tasks. The difficulty I've had
in coming up with a decent example for them makes me halfway tempted to
just rip those $#)*% functions out entirely.

I'm confused here. I use join() pretty much all the time. I launch stuff
(e.g. run a large matrix-vector multiplication for distinct vectors) and
then I join that. Then again and again. The thread of execution has a
repeated hourglass shape - it fans out and then becomes single-threaded
at join points. I assume some of those computations are indeed the
charter of parallel foreach, but I'd guess not all. But then what do I
know - I'm the customer :o).
Andrei

* Still no synopsis example that illustrates in a catchy way the most
attractive artifacts.

I don't see what I could put here that isn't totally redundant with the
rest of the documentation. Anything I could think of would basically
just involve concatentating all the examples. Furthermore, none of the
other Phobos modules have this, so I don't know what one should look
like.

I'm thinking along the lines of:
http://www.digitalmars.com/d/2.0/phobos/std_exception.html
A nice synopsis would be the pi computation. Just move that up to the
synopsis. It's simple, clean, and easy to relate to. Generally, you'd
put here not all details but the stuff you think would make it easiest
for people to get into your library.

In general I feel like std.parallelism is being held to a
ridiculously high standard that none of the other Phobos modules
currently meet.

I agree, but that goes without saying. This is not a double standard;
it's a simple means to improve quality of Phobos overall. Clearly
certain modules that are already in Phobos would not make it if proposed
today. And that's a good thing! Comparing our current ambitions to the
quality of the average Phobos module would not help us.
Generally it seems you have grown already tired of the exchange and it
would take only a few more exchanges for you to say, "you know what,
either let's get it over it or forget about it."
Please consider for a minute how this is the wrong tone and attitude to
be having on several levels. This is not a debate and not the place to
get defensive. Your role in the discussion is not symmetric with the
others' and at best you'd use the review as an opportunity to improve
your design, not stick heels in the ground to defend its current
incarnation (within reason). Your potential customers are attempting to
help you by asking questions (some of which no doubt are silly) and by
making suggestions (some of which, again, are ill-founded).
Nevertheless, we _are_ your potential customers and in a way the
customer is always right. Your art is to steer customers from what they
think they want to what you know they need - because you're the expert!
- and to improve design, nomenclature, and implementation to the extent
that would help them.
Furthermore, you should expect that the review process will prompt
changes. My perception is that you consider the submission more or less
final modulo possibly a few minor nits. You shouldn't. I'm convinced you
know much more about SMP than most or all others in this group, but in
no way that means your design has reached perfection and is beyond
improvement even from a non-expert.
I know you'd have no problem finding the right voice in this discussion
if you only frame it in the right light. Again, people are trying to
help (however awkwardly) and in no way is that ridiculous.

Fair enough. Now that I think of it most of my frustration is that
these details are only getting looked at now, when I have a week (and an
otherwise very busy week) to fix all this stuff, when this module has
been officially in review for the past two weeks and unofficially for
several months. I would be much more open to this process if the issues
raised could be fixed at my leisure rather than on a hard and tight
deadline. This is exacerbated by the fact that I have another
important, unrelated deadline, also next Friday.
At the same time, though, I'm afraid that if we didn't fix a vote date
and put some urgency into things, the pace of the reviews would be
glacial at best, like it was for the first two weeks of official review
and the months of unofficial review.
How about we make next Friday a soft deadline/start of voting? It can
be extended as necessary by mutual agreement of me and Lars (the review
manager), and likely will be if the review process is still yielding
good suggestions and/or I haven't had time to implement/clean up some
key things. Having a deadline breathing down your neck like this is
really not conducive to being open to suggestions and thoughtful
consideration, especially for issues that seem like fairly minor details.
Also increasing the deadline pressure issue, Michael Fortin just caused
me to rethink the issue of exception handling in parallel foreach. I
had more-or-less working code for this, but I realized it's severely
broken in subtle ways that I've (knock on wood) never actually run into
in real world code. It's gonna take some time to fix. These kinds of
issues with error handling code can very easily slip under the radar in
a library with only a few users, and unfortunately, one has. I
definitely know how to fix it, but I have to implement the fix and it's
somewhat non-trivial, i.e. I'm debugging it now and it's looking like a
marathon debugging session.

I know you'd have no problem finding the right voice in this discussion
if you only frame it in the right light. Again, people are trying to
help (however awkwardly) and in no way is that ridiculous.

Fair enough. Now that I think of it most of my frustration is that these
details are only getting looked at now, when I have a week (and an
otherwise very busy week) to fix all this stuff, when this module has
been officially in review for the past two weeks and unofficially for
several months. I would be much more open to this process if the issues
raised could be fixed at my leisure rather than on a hard and tight
deadline. This is exacerbated by the fact that I have another important,
unrelated deadline, also next Friday.
At the same time, though, I'm afraid that if we didn't fix a vote date
and put some urgency into things, the pace of the reviews would be
glacial at best, like it was for the first two weeks of official review
and the months of unofficial review.

Exactly. I understand. Well here's a suggestion. How about we "git
stash" this review? I recall another submission was vying for the queue
so we can proceed with that while you have time to operate the changes
you want. If not, we can simply wait 2-3 weeks and then have you
resubmit for a shorter process (1-2 weeks) that would gather more
feedback (hopefully minor that time) and votes.
Lars?

Also increasing the deadline pressure issue, Michael Fortin just caused
me to rethink the issue of exception handling in parallel foreach. I had
more-or-less working code for this, but I realized it's severely broken
in subtle ways that I've (knock on wood) never actually run into in real
world code. It's gonna take some time to fix. These kinds of issues with
error handling code can very easily slip under the radar in a library
with only a few users, and unfortunately, one has. I definitely know how
to fix it, but I have to implement the fix and it's somewhat
non-trivial, i.e. I'm debugging it now and it's looking like a marathon
debugging session.

Fixing these sooner rather than later would be great, so all the better
for suspending the review.
Andrei

I know you'd have no problem finding the right voice in this discussion
if you only frame it in the right light. Again, people are trying to
help (however awkwardly) and in no way is that ridiculous.

Fair enough. Now that I think of it most of my frustration is that these
details are only getting looked at now, when I have a week (and an
otherwise very busy week) to fix all this stuff, when this module has
been officially in review for the past two weeks and unofficially for
several months. I would be much more open to this process if the issues
raised could be fixed at my leisure rather than on a hard and tight
deadline. This is exacerbated by the fact that I have another important,
unrelated deadline, also next Friday.
At the same time, though, I'm afraid that if we didn't fix a vote date
and put some urgency into things, the pace of the reviews would be
glacial at best, like it was for the first two weeks of official review
and the months of unofficial review.

Exactly. I understand. Well here's a suggestion. How about we "git
stash" this review? I recall another submission was vying for the queue
so we can proceed with that while you have time to operate the changes
you want. If not, we can simply wait 2-3 weeks and then have you
resubmit for a shorter process (1-2 weeks) that would gather more
feedback (hopefully minor that time) and votes.

Furthermore, you should expect that the review process will prompt
changes. My perception is that you consider the submission more or less
final modulo possibly a few minor nits. You shouldn't. I'm convinced you
know much more about SMP than most or all others in this group, but in
no way that means your design has reached perfection and is beyond
improvement even from a non-expert.

In addition the the deadline issues already mentioned and resolved, I
did misunderstand the review process somewhat. I didn't participate in
the reviews for std.datetime (because I know nothing about what makes a
good date/time lib) or for std.unittest (because I was fairly ambivalent
about it), so I didn't learn anything from them. I was under the
impression that the module is **expected** to be very close to its final
form and that, if a lot of issues are found, then that basically means
the proposal is going to be rejected.

Furthermore, you should expect that the review process will prompt
changes. My perception is that you consider the submission more or less
final modulo possibly a few minor nits. You shouldn't. I'm convinced you
know much more about SMP than most or all others in this group, but in
no way that means your design has reached perfection and is beyond
improvement even from a non-expert.

In addition the the deadline issues already mentioned and resolved, I
did misunderstand the review process somewhat. I didn't participate in
the reviews for std.datetime (because I know nothing about what makes a
good date/time lib) or for std.unittest (because I was fairly ambivalent
about it), so I didn't learn anything from them. I was under the
impression that the module is **expected** to be very close to its final
form and that, if a lot of issues are found, then that basically means
the proposal is going to be rejected.

Both std.datetime and std.unittests underwent a fair number of changes over the
course the review process. A lot of the stuff stayed the same, but a lot of it
changed too. On the whole, though, the end results were much better for it.
- Jonathan M Davis

Furthermore, you should expect that the review process will prompt
changes. My perception is that you consider the submission more or less
final modulo possibly a few minor nits. You shouldn't. I'm convinced you
know much more about SMP than most or all others in this group, but in
no way that means your design has reached perfection and is beyond
improvement even from a non-expert.

In addition the the deadline issues already mentioned and resolved, I
did misunderstand the review process somewhat. I didn't participate in
the reviews for std.datetime (because I know nothing about what makes a
good date/time lib) or for std.unittest (because I was fairly ambivalent
about it), so I didn't learn anything from them. I was under the
impression that the module is **expected** to be very close to its final
form and that, if a lot of issues are found, then that basically means
the proposal is going to be rejected.

Both std.datetime and std.unittests underwent a fair number of changes over the
course the review process. A lot of the stuff stayed the same, but a lot of it
changed too. On the whole, though, the end results were much better for it.
- Jonathan M Davis

Furthermore, you should expect that the review process will prompt
changes. My perception is that you consider the submission more or less
final modulo possibly a few minor nits. You shouldn't. I'm convinced you
know much more about SMP than most or all others in this group, but in
no way that means your design has reached perfection and is beyond
improvement even from a non-expert.

In addition the the deadline issues already mentioned and resolved, I
did misunderstand the review process somewhat. I didn't participate in
the reviews for std.datetime (because I know nothing about what makes a
good date/time lib) or for std.unittest (because I was fairly ambivalent
about it), so I didn't learn anything from them. I was under the
impression that the module is **expected** to be very close to its final
form and that, if a lot of issues are found, then that basically means
the proposal is going to be rejected.

Both std.datetime and std.unittests underwent a fair number of changes over the
course the review process. A lot of the stuff stayed the same, but a lot of it
changed too. On the whole, though, the end results were much better for it.
- Jonathan M Davis

* Still no synopsis example that illustrates in a catchy way the most
attractive artifacts.

I don't see what I could put here that isn't totally redundant with the
rest of the documentation. Anything I could think of would basically
just involve concatentating all the examples. Furthermore, none of the
other Phobos modules have this, so I don't know what one should look
like.

I'm thinking along the lines of:
http://www.digitalmars.com/d/2.0/phobos/std_exception.html
A nice synopsis would be the pi computation. Just move that up to the
synopsis. It's simple, clean, and easy to relate to. Generally, you'd
put here not all details but the stuff you think would make it easiest
for people to get into your library.

Good example, will do.

* "After creation, Task objects are submitted to a TaskPool for
execution." I understand it's possible to use Task straight as a
promise/future, so s/are/may be/.

No. The only way Task is useful is by submitting it to a pool to be
executed. (Though this may change, see below.)

I very much hope this does change. Otherwise the role of Task in the
design could be drastically reduced (e.g. nested type inside of
TaskPool) without prejudice. At the minimum I want to be able to create
a task, launch it, and check its result later without involving a pool.
A pool is when I have many tasks that may exceed the number of CPUs etc.
Simplicity would be great.
// start three reads
auto readFoo = task!readText("foo.txt");
auto readBar = task!readText("bar.txt");
auto readBaz = task!readText("baz.txt");
// join'em all
auto foo = readFoo.yieldWait();
auto bar = readBar.yieldWait();
auto baz = readBaz.yieldWait();

This is definitely feasible in principle. I'd like to implement it, but
there's a few annoying, hairy details standing in the way. For reasons
I detailed previously, we need both scoped and non-scoped tasks. We
also have alias vs. callable (i.e. function pointer or delegate) tasks.
Now we're adding pool vs. new-thread tasks. This is turning into a
combinatorial explosion and needs to be simplified somehow. I propose
the following:
1. I've reconsidered and actually like the idea of task() vs.
scopedTask(). task() returns a pointer on the heap. scopedTask()
returns a struct on the stack. Neither would be a member function of
TaskPool.
2. Non-scoped Task pointers would need to be explicitly submitted to
the task pool via the put() method. This means getting rid of
TaskPool.task().
3. The Task struct would grow a function runInNewThread() or something
similar. (If you think this would be a common case, even just execute()
might cut it.)
The work flow would now be that you call task() to get a heap-allocated
Task*, or scopedTask to get a stack-allocated Task. You then call
either TaskPool.put() to execute it on a pool or Task.runInNewThread()
to run it in a new thread. The creation of the Task is completely
orthogonal to how it's run.

There's no need at this level for a task pool. What would be nice would
be to have a join() that joins all tasks spawned by the current thread:
// start three reads
auto readFoo = task!readText("foo.txt");
auto readBar = task!readText("bar.txt");
auto readBaz = task!readText("baz.txt");
// join'em all
join();
// fetch results
auto foo = readFoo.spinWait();
auto bar = readBar.spinWait();
auto baz = readBaz.spinWait();

I don't understand how this would be a substantial improvement over the
first example, where you just call yieldWait() on all three.
Furthermore, implementing join() as shown in this example would require
some kind of central registry of all tasks/worker threads/task
pools/something similar, which would be a huge PITA to implement
efficiently.

Secondly, I think you're reading **WAY** too much into what was meant to
be a simple example to illustrate usage mechanics. This is another case
where I can't think of a small, cute example of where you'd really need
the pool. There are plenty of larger examples, but the smallest/most
self-contained one I can think of is a parallel sort. I decided to use
file reading because it was good enough to illustrate the mechanics of
usage, even if it didn't illustrate a particularly good use case.

It's impossible to not have a good small example. Sorting is great. You
have the partition primitive already in std.algorithm, then off you go
with tasks. Dot product on dense vectors is another good one. There's
just plenty of operations that people understand are important to make
fast.

I forgot about std.algorithm.partition. This makes a parallel quick
sort so trivial to implement (ignoring the issue of selecting a good
pivot, which I think can be safely ignored in example code) that it
might actually make a good example.

* "A goto from inside the parallel foreach loop to a label outside the
loop will result in undefined behavior." Would this be a bug in dmd?

No, it's because a goto of this form has no reasonable, useful
semantics. I should probably mention in the docs that the same applies
to labeled break and continue.
I have no idea what semantics these should have, and even if I did,
given the long odds that even one person would actually need them, I
think they'd be more trouble than they're worth to implement. For
example, once you break out of a parallel foreach loop to some arbitrary
address (and different threads can goto different labels, etc.), well,
it's no longer a parallel foreach loop. It's just a bunch of completely
unstructured threading doing god-knows-what.
Therefore, I slapped undefined behavior on it as a big sign that says,
"Just don't do it." This also has the advantage that, if anyone ever
thinks of any good, clearly useful semantics, these will be
implementable without breaking code later.

Yah, I was actually thinking of disabling goto outside a local delegate
everywhere.

See the discussion I'm having with Michael Fortin. My latest idea is
that break, labeled break/continue, return, and goto should all throw
exceptions when found inside a parallel foreach loop. They affect the
flow of subsequent iterations, and "subsequent iterations" only makes
sense when the loop is being executed in serial.

I **STRONGLY** believe this does not belong in API documentation because
it's too machine specific, compiler specific, stack alignment specific,
etc. and almost any benchmark worth doing takes up more space than an
example should. Furthermore, anyone who wants to know this can easily
time it themselves. I have absolutely no intention of including this.
While in general I appreciate and have tried to accommodate your
suggestions, this is one I'll be standing firm on.

If scalability information is present in however a non-committal form,
then people would be compelled ("ok, so this shape of the loop would
actually scale linearly with CPUs... neat").

Ok, I thought you were asking for something much more rigorous than
this. I therefore didn't want to provide it because I figured that, no
matter what I did, someone would be able to say that the benchmark is
flawed some how, yada, yada, yada. Given how inexact a science
benchmarking is, I'm still hesitant to put such results in API docs, but
I can see where you're coming from here.

Speaking of efficiency, I assume parallel foreach uses opApply with a
delegate and the inherent overhead. So I'm thinking that a practical way
to take advantage of parallel foreach would be to parallelize at some
block level and then do a regular foreach inside the block?
foreach (i, ref elem; taskPool.parallel(logs)) {
foreach (???)
elem = log(i + 1.0);
}
How can I arrange things such that I compute e.g. 64 logs serially
inside each pass?

Three things here:
1. For this kind of nano-parallelism, map() might be better suited. To
fill an existing array, just use map's explicit buffer feature.
taskPool.map!log(iota(1, logs.length + 1), logs);
The option of using map() for nano-parallelism is part of my rationale
for keeping the pretty but mildly inefficient delegate call in parallel
foreach.
2. You're severely overestimating the overhead of the delegate call.
log() is a pretty cheap function and even so speedups are decent with
parallel foreach compared to a regular foreach loop.
import std.stdio, std.parallelism, std.datetime, std.math;
void main() {
auto logs = new float[10_000_000];
taskPool(); // Initialize TaskPool before timing.
auto sw = StopWatch(autoStart);
foreach(i, ref elem; logs) {
elem = log(i + 1);
}
writeln("Serial: ", sw.peek.msecs);
sw.reset();
foreach(i, ref elem; parallel(logs)) {
elem = log(i + 1);
}
writeln("Parallel Foreach: ", sw.peek.msecs);
}
Results:
Serial: 619
Parallel Foreach: 388
I'd include parallel map, too, but for some reason (probably stack
alignment or something) including it changes the results for parallel
foreach.
3. If you really want to do as you suggest, just make a chunks range or
something (maybe this should be in std.range) that iterates over all
non-overlapping size N slices of a sliceable range. Use this for the
parallel loop, then loop over the individual elements of the slice inside.

I'm confused here. I use join() pretty much all the time. I launch stuff
(e.g. run a large matrix-vector multiplication for distinct vectors) and
then I join that. Then again and again. The thread of execution has a
repeated hourglass shape - it fans out and then becomes single-threaded
at join points. I assume some of those computations are indeed the
charter of parallel foreach, but I'd guess not all. But then what do I
know - I'm the customer :o).

Yes, parallel foreach is idiomatic here, or maybe parallel map.
In the pre-release versions of std.parallelism (my experimentation with
parallelism libraries in D goes back to mid-2008, over a year before I
released the first version as Parallelfuture), only the task primitive
existed. I discovered that, in practice, creating Task objects in a
loop is almost always a PITA. It can also be inefficient in that, in a
naive implementation all of these exist in memory simultaneously when
this might be completely unnecessary. I decided to create higher level
primitives to handle all the use cases I could think of where you might
otherwise want to create Task objects in a loop. If you're explicitly
creating Task objects in a loop in std.parallelism, I can just about
guarantee that there's an easier and at least equally efficient way to
accomplish what you want to accomplish. If there are any important use
cases I missed, I'd be much more interested in creating a few more
high-level primitives rather than making it easier to work with Task
objects created in a loop.

I **STRONGLY** believe this does not belong in API documentation because
it's too machine specific, compiler specific, stack alignment specific,
etc. and almost any benchmark worth doing takes up more space than an
example should. Furthermore, anyone who wants to know this can easily
time it themselves. I have absolutely no intention of including this.
While in general I appreciate and have tried to accommodate your
suggestions, this is one I'll be standing firm on.

If scalability information is present in however a non-committal form,
then people would be compelled ("ok, so this shape of the loop would
actually scale linearly with CPUs... neat").

Ok, I thought you were asking for something much more rigorous than
this. I therefore didn't want to provide it because I figured that, no
matter what I did, someone would be able to say that the benchmark is
flawed some how, yada, yada, yada. Given how inexact a science
benchmarking is, I'm still hesitant to put such results in API docs, but
I can see where you're coming from here.

I tried putting a few benchmarks in for the performance of parallel
foreach, map and reduce. Basically, I just put the timings in for the
examples on my dual core machine, and for an equivalent serial version
that is not shown in the interest of brevity but is easily inferred. I
didn't do this for the pipelining or future/promise primitives because
these aren't as capable of automatically taking full advantage of as
many cores as you can throw at them as map, reduce and foreach. This
means that coming up with a benchmark that shows where a good speedup
can be obtained is a much more difficult art.

* "workUnitSize: The number of elements to evaluate in a single Task.
Must be less than or equal to bufSize, and in practice should be a
fraction of bufSize such that all worker threads can be used." Then why
not specify a different parameter such a multiplier or something? The
dependence between bufSize and worUnitSize is a sign that these two
should be improved. If you have good reasons that the user must have the
parameters in this form, give an example substantiating that.

I did this for consistency with all the other functions, where work unit
size is specified directly. I don't want lazyMap to be the oddball
function where it's specified in a completely different way.

I know you'd have no problem finding the right voice in this
discussion if you only frame it in the right light. Again, people are
trying to help (however awkwardly) and in no way is that ridiculous.

Fair enough. Now that I think of it most of my frustration is that
these details are only getting looked at now, when I have a week (and
an otherwise very busy week) to fix all this stuff, when this module
has been officially in review for the past two weeks and unofficially
for several months. I would be much more open to this process if the
issues raised could be fixed at my leisure rather than on a hard and
tight deadline. This is exacerbated by the fact that I have another
important, unrelated deadline, also next Friday.
At the same time, though, I'm afraid that if we didn't fix a vote date
and put some urgency into things, the pace of the reviews would be
glacial at best, like it was for the first two weeks of official review
and the months of unofficial review.

Exactly. I understand. Well here's a suggestion. How about we "git
stash" this review? I recall another submission was vying for the queue
so we can proceed with that while you have time to operate the changes
you want. If not, we can simply wait 2-3 weeks and then have you
resubmit for a shorter process (1-2 weeks) that would gather more
feedback (hopefully minor that time) and votes.
Lars?

* "Most of this module completely subverts..." Vague characterizations
("most", "completely", "some") don't belong in a technical
documentation. (For example there's either subversion going on or there
isn't.) Also, std.concurrency and std.parallelism address different
needs so there's little competition between them. Better: "Unless
explicitly marked as $(D trusted) or $(D safe), artifacts in this
module are not provably memory-safe and cannot be used with SafeD. If
used as documented, memory safety is guaranteed."

Actually, I think this is a bad description of what it subverts. What it
subverts isn't the memory-safety that SafeD provides, but the safety
against low-level races that even unsafe D protects against unless you
cast shared away. For instance:
void main() {
int sum = 0;
foreach (int value; taskPool.parallel([0,2,3,6,1,4,6,3,3,3,6])) {
sum += value;
}
writeln(sum);
}
The "+=" would need to be an atomic operation to avoid low-level races.
I think that ParallelForeach's opApply should only accept a shared
delegate. I define shared delegate as a delegate that does not reference
any non-shared variables of its outer scope. The problem is that DMD
currently doesn't know how to determine whether a delegate literal is
shared or not, thus a delegate literal is never shared and if
ParallelForeach's opApply asked a shared delegate as it should it would
just not work. Fix DMD to create shared delegate literals where
appropriate and everything can be guarantied race-free.

Technically, you'd only need a shared delegate or a const delegate to
guarantee race safety for parallel foreach.