Improve this page Quickly fork, edit online, and submit a pull request for this page.
Requires a signed-in GitHub account. This works well for small changes.
If you'd like to make larger changes you may want to consider using
local clone.
Page wiki View or edit the community-maintained wiki page associated with this page.

std.parallelism

std.parallelism implements high-level primitives for SMP parallelism.
These include parallel foreach, parallel reduce, parallel eager map, pipelining
and future/promise parallelism. std.parallelism is recommended when the
same operation is to be executed in parallel on different data, or when a
function is to be executed in a background thread and its result returned to a
well-defined main thread. For communication between arbitrary threads, see
std.concurrency.

std.parallelism is based on the concept of a Task. A Task is an
object that represents the fundamental unit of work in this library and may be
executed in parallel with any other Task. Using Task
directly allows programming with a future/promise paradigm. All other
supported parallelism paradigms (parallel foreach, map, reduce, pipelining)
represent an additional level of abstraction over Task. They
automatically create one or more Task objects, or closely related types
that are conceptually identical but not part of the public API.

After creation, a Task may be executed in a new thread, or submitted
to a TaskPool for execution. A TaskPool encapsulates a task queue
and its worker threads. Its purpose is to efficiently map a large
number of Tasks onto a smaller number of threads. A task queue is a
FIFO queue of Task objects that have been submitted to the
TaskPool and are awaiting execution. A worker thread is a thread that
is associated with exactly one task queue. It executes the Task at the
front of its queue when the queue has work available, or sleeps when
no work is available. Each task queue is associated with zero or
more worker threads. If the result of a Task is needed before execution
by a worker thread has begun, the Task can be removed from the task queue
and executed immediately in the thread where the result is needed.

Warning:
Unless marked as @trusted or @safe, artifacts in
this module allow implicit data sharing between threads and cannot
guarantee that client code is free from low level data races.

Task represents the fundamental unit of work. A Task may be
executed in parallel with any other Task. Using this struct directly
allows future/promise parallelism. In this paradigm, a function (or delegate
or other callable) is executed in a thread other than the one it was called
from. The calling thread does not block while the function is being executed.
A call to workForce, yieldForce, or spinForce is used to
ensure that the Task has finished executing and to obtain the return
value, if any. These functions and done also act as full memory barriers,
meaning that any memory writes made in the thread that executed the Task
are guaranteed to be visible in the calling thread after one of these functions
returns.

Function results are returned from yieldForce, spinForce and
workForce by ref. If fun returns by ref, the reference will point
to the returned reference of fun. Otherwise it will point to a
field in this struct.

Copying of this struct is disabled, since it would provide no useful semantics.
If you want to pass this struct around, you should do so by reference or
pointer.

Bugs:

Changes to ref and out arguments are not propagated to the
call site, only to args in this struct.

alias args = _args[1 .. __dollar];

The arguments the function was called with. Changes to out and
ref arguments will be visible here.

alias ReturnType = typeof(fun(_args));

The return type of the function called by this Task. This can be
void.

@property ref @trusted ReturnType spinForce();

If the Task isn't started yet, execute it in the current thread.
If it's done, return its return value, if any. If it's in progress,
busy spin until it's done, then return the return value. If it threw
an exception, rethrow that exception.

This function should be used when you expect the result of the
Task to be available on a timescale shorter than that of an OS
context switch.

@property ref @trusted ReturnType yieldForce();

If the Task isn't started yet, execute it in the current thread.
If it's done, return its return value, if any. If it's in progress,
wait on a condition variable. If it threw an exception, rethrow that
exception.

This function should be used for expensive functions, as waiting on a
condition variable introduces latency, but avoids wasted CPU cycles.

@property ref @trusted ReturnType workForce();

If this Task was not started yet, execute it in the current
thread. If it is finished, return its result. If it is in progress,
execute any other Task from the TaskPool instance that
this Task was submitted to until this one
is finished. If it threw an exception, rethrow that exception.
If no other tasks are available or this Task was executed using
executeInNewThread, wait on a condition variable.

Create a new thread for executing this Task, execute it in the
newly created thread, then terminate the thread. This can be used for
future/promise parallelism. An explicit priority may be given
to the Task. If one is provided, its value is forwarded to
core.thread.Thread.priority. See std.parallelism.task for
usage example.

Creates a Task on the GC heap that calls a function pointer, delegate, or
class/struct with overloaded opCall.

Examples:

// Read two files in at the same time again,
// but this time use a function pointer instead
// of an alias to represent std.file.read.
import std.file;
void main()
{
// Create and execute a Task for reading
// foo.txt.
auto file1Task = task(&read, "foo.txt");
file1Task.executeInNewThread();
// Read bar.txt in parallel.
auto file2Data = read("bar.txt");
// Get the results of reading foo.txt.
auto file1Data = file1Task.yieldForce;
}

Notes:
This function takes a non-scope delegate, meaning it can be
used with closures. If you can't allocate a closure due to objects
on the stack that have scoped destruction, see scopedTask, which
takes a scope delegate.

Version of task usable from @safe code. Usage mechanics are
identical to the non-@safe case, but safety introduces some restrictions:

1. fun must be @safe or @trusted.

2. F must not have any unshared aliasing as defined by
std.traits.hasUnsharedAliasing. This means it
may not be an unshared delegate or a non-shared class or struct
with overloaded opCall. This also precludes accepting template
alias parameters.

3. Args must not have unshared aliasing.

4. fun must not return by reference.

5. The return type must not have unshared aliasing unless fun is
pure or the Task is executed via executeInNewThread instead
of using a TaskPool.

These functions allow the creation of Task objects on the stack rather
than the GC heap. The lifetime of a Task created by scopedTask
cannot exceed the lifetime of the scope it was created in.

scopedTask might be preferred over task:

1. When a Task that calls a delegate is being created and a closure
cannot be allocated due to objects on the stack that have scoped
destruction. The delegate overload of scopedTask takes a scope
delegate.

2. As a micro-optimization, to avoid the heap allocation associated with
task or with the creation of a closure.

Usage is otherwise identical to task.

Notes:Task objects created using scopedTask will automatically
call Task.yieldForce in their destructor if necessary to ensure
the Task is complete before the stack frame they reside on is destroyed.

immutable uint totalCPUs;

The total number of CPU cores available on the current machine, as reported by
the operating system.

class TaskPool;

This class encapsulates a task queue and a set of worker threads. Its purpose
is to efficiently map a large number of Tasks onto a smaller number of
threads. A task queue is a FIFO queue of Task objects that have been
submitted to the TaskPool and are awaiting execution. A worker thread is a
thread that executes the Task at the front of the queue when one is
available and sleeps when the queue is empty.

This class should usually be used via the global instantiation
available via the std.parallelism.taskPool property.
Occasionally it is useful to explicitly instantiate a TaskPool:

1. When you want TaskPool instances with multiple priorities, for example
a low priority pool and a high priority pool.

2. When the threads in the global task pool are waiting on a synchronization
primitive (for example a mutex), and you want to parallelize the code that
needs to run before these threads can be resumed.

@trusted this();

Default constructor that initializes a TaskPool with
totalCPUs - 1 worker threads. The minus 1 is included because the
main thread will also be available to do work.

Implements a parallel foreach loop over a range. This works by implicitly
creating and submitting one Task to the TaskPool for each worker
thread. A work unit is a set of consecutive elements of range to
be processed by a worker thread between communication with any other
thread. The number of elements processed per work unit is controlled by the
workUnitSize parameter. Smaller work units provide better load
balancing, but larger work units avoid the overhead of communicating
with other threads frequently to fetch the next work unit. Large work
units also avoid false sharing in cases where the range is being modified.
The less time a single iteration of the loop takes, the larger
workUnitSize should be. For very expensive loop bodies,
workUnitSize should be 1. An overload that chooses a default work
unit size is also available.

In the case of non-random access ranges, parallel foreach buffers lazily
to an array of size workUnitSize before executing the parallel portion
of the loop. The exception is that, if a parallel foreach is executed
over a range returned by asyncBuf or map, the copying is elided
and the buffers are simply swapped. In this case workUnitSize is
ignored and the work unit size is set to the buffer size of range.

A memory barrier is guaranteed to be executed on exit from the loop,
so that results produced by all threads are visible in the calling thread.

Exception Handling:

When at least one exception is thrown from inside a parallel foreach loop,
the submission of additional Task objects is terminated as soon as
possible, in a non-deterministic manner. All executing or
enqueued work units are allowed to complete. Then, all exceptions that
were thrown by any work unit are chained using Throwable.next and
rethrown. The order of the exception chaining is non-deterministic.

template amap(functions...)

auto amap(Args...)(Args args) if (isRandomAccessRange!(Args[0]));

Eager parallel map. The eagerness of this function means it has less
overhead than the lazily evaluated TaskPool.map and should be
preferred where the memory requirements of eagerness are acceptable.
functions are the functions to be evaluated, passed as template alias
parameters in a style similar to std.algorithm.map. The first
argument must be a random access range. For performance reasons, amap
will assume the range elements have not yet been initialized. Elements will
be overwritten without calling a destructor nor doing an assignment. As such,
the range must not contain meaningful data: either un-initialized objects, or
objects in their .init state.

Immediately after the range argument, an optional work unit size argument
may be provided. Work units as used by amap are identical to those
defined for parallel foreach. If no work unit size is provided, the
default work unit size is used.

An output range for returning the results may be provided as the last
argument. If one is not provided, an array of the proper type will be
allocated on the garbage collected heap. If one is provided, it must be a
random access range with assignable elements, must have reference
semantics with respect to assignment to its elements, and must have the
same length as the input range. Writing to adjacent elements from
different threads must be safe.

Note:
A memory barrier is guaranteed to be executed after all results are written
but before returning so that results produced by all threads are visible
in the calling thread.

Tips:
To perform the mapping operation in place, provide the same range for the
input and output range.

To parallelize the copying of a range with expensive to evaluate elements
to an array, pass an identity function (a function that just returns
whatever argument is provided to it) to amap.

Exception Handling:

When at least one exception is thrown from inside the map functions,
the submission of additional Task objects is terminated as soon as
possible, in a non-deterministic manner. All currently executing or
enqueued work units are allowed to complete. Then, all exceptions that
were thrown from any work unit are chained using Throwable.next and
rethrown. The order of the exception chaining is non-deterministic.

A semi-lazy parallel map that can be used for pipelining. The map
functions are evaluated for the first bufSize elements and stored in a
buffer and made available to popFront. Meanwhile, in the
background a second buffer of the same size is filled. When the first
buffer is exhausted, it is swapped with the second buffer and filled while
the values from what was originally the second buffer are read. This
implementation allows for elements to be written to the buffer without
the need for atomic operations or synchronization for each write, and
enables the mapping function to be evaluated efficiently in parallel.

map has more overhead than the simpler procedure used by amap
but avoids the need to keep all results in memory simultaneously and works
with non-random access ranges.

Parameters:

S source

The input range to be mapped. If source is not random
access it will be lazily buffered to an array of size bufSize before
the map function is evaluated. (For an exception to this rule, see Notes.)

size_t bufSize

The size of the buffer to store the evaluated elements.

size_t workUnitSize

The number of elements to evaluate in a single
Task. Must be less than or equal to bufSize, and
should be a fraction of bufSize such that all worker threads can be
used. If the default of size_t.max is used, workUnitSize will be set to
the pool-wide default.

Returns:

An input range representing the results of the map. This range
has a length iff source has a length.

Notes:
If a range returned by map or asyncBuf is used as an input to
map, then as an optimization the copying from the output buffer
of the first range to the input buffer of the second range is elided, even
though the ranges returned by map and asyncBuf are non-random
access ranges. This means that the bufSize parameter passed to the
current call to map will be ignored and the size of the buffer
will be the buffer size of source.

Any exceptions thrown while iterating over source
or computing the map function are re-thrown on a call to popFront or,
if thrown during construction, are simply allowed to propagate to the
caller. In the case of exceptions thrown while computing the map function,
the exceptions are chained as in TaskPool.amap.

auto asyncBuf(S)(S source, size_t bufSize = 100) if (isInputRange!S);

Given a source range that is expensive to iterate over, returns an
input range that asynchronously buffers the contents of
source into a buffer of bufSize elements in a worker thread,
while making previously buffered elements from a second buffer, also of size
bufSize, available via the range interface of the returned
object. The returned range has a length iff hasLength!S.
asyncBuf is useful, for example, when performing expensive operations
on the elements of ranges that represent data on a disk or network.

Given a callable object next that writes to a user-provided buffer and
a second callable object empty that determines whether more data is
available to write via next, returns an input range that
asynchronously calls next with a set of size nBuffers of buffers
and makes the results available in the order they were obtained via the
input range interface of the returned object. Similarly to the
input range overload of asyncBuf, the first half of the buffers
are made available via the range interface while the second half are
filled and vice-versa.

Parameters:

C1 next

A callable object that takes a single argument that must be an array
with mutable elements. When called, next writes data to
the array provided by the caller.

C2 empty

A callable object that takes no arguments and returns a type
implicitly convertible to bool. This is used to signify
that no more data is available to be obtained by calling next.

size_t initialBufSize

The initial size of each buffer. If next takes its
array by reference, it may resize the buffers.

Any exceptions thrown while iterating over range are re-thrown on a
call to popFront.

Warning:
Using the range returned by this function in a parallel foreach loop
will not work because buffers may be overwritten while the task that
processes them is in queue. This is checked for at compile time
and will result in a static assertion failure.

template reduce(functions...)

auto reduce(Args...)(Args args);

Parallel reduce on a random access range. Except as otherwise noted, usage
is similar to std.algorithm.reduce. This function works by splitting
the range to be reduced into work units, which are slices to be reduced in
parallel. Once the results from all work units are computed, a final serial
reduction is performed on these results to compute the final answer.
Therefore, care must be taken to choose the seed value appropriately.

Because the reduction is being performed in parallel,
functions must be associative. For notational simplicity, let # be an
infix operator representing functions. Then, (a # b) # c must equal
a # (b # c). Floating point addition is not associative
even though addition in exact arithmetic is. Summing floating
point numbers using this function may give different results than summing
serially. However, for many practical purposes floating point addition
can be treated as associative.

Note that, since functions are assumed to be associative, additional
optimizations are made to the serial portion of the reduction algorithm.
These take advantage of the instruction level parallelism of modern CPUs,
in addition to the thread-level parallelism that the rest of this
module exploits. This can lead to better than linear speedups relative
to std.algorithm.reduce, especially for fine-grained benchmarks
like dot products.

An explicit seed may be provided as the first argument. If
provided, it is used as the seed for all work units and for the final
reduction of results from all work units. Therefore, if it is not the
identity value for the operation being performed, results may differ from
those generated by std.algorithm.reduce or depending on how many work
units are used. The next argument must be the range to be reduced.

If no explicit seed is provided, the first element of each work unit
is used as a seed. For the final reduction, the result from the first
work unit is used as the seed.

// Find the sum of a range in parallel, using the first
// element of each work unit as the seed.
auto sum = taskPool.reduce!"a + b"(nums);

An explicit work unit size may be specified as the last argument.
Specifying too small a work unit size will effectively serialize the
reduction, as the final reduction of the result of each work unit will
dominate computation time. If TaskPool.size for this instance
is zero, this parameter is ignored and one work unit is used.

Struct for creating worker-local storage. Worker-local storage is
thread-local storage that exists only for worker threads in a given
TaskPool plus a single thread outside the pool. It is allocated on the
garbage collected heap in a way that avoids false sharing, and doesn't
necessarily have global scope within any thread. It can be accessed from
any worker thread in the TaskPool that created it, and one thread
outside this TaskPool. All threads outside the pool that created a
given instance of worker-local storage share a single slot.

Since the underlying data for this struct is heap-allocated, this struct
has reference semantics when passed between functions.

The main uses cases for WorkerLocalStorageStorage are:

1. Performing parallel reductions with an imperative, as opposed to
functional, programming style. In this case, it's useful to treat
WorkerLocalStorageStorage as local to each thread for only the parallel
portion of an algorithm.

Get the current thread's instance. Returns by ref.
Note that calling get from any thread
outside the TaskPool that created this instance will return the
same reference, so an instance of worker-local storage should only be
accessed from one thread outside the pool that created it. If this
rule is violated, undefined behavior will result.

If assertions are enabled and toRange has been called, then this
WorkerLocalStorage instance is no longer worker-local and an assertion
failure will result when calling this method. This is not checked
when assertions are disabled for performance reasons.

@property void get(T val);

Assign a value to the current thread's instance. This function has
the same caveats as its overload.

@property WorkerLocalStorageRange!T toRange();

Returns a range view of the values for all threads, which can be used
to further process the results of each thread after running the parallel
part of your algorithm. Do not use this method in the parallel portion
of your algorithm.

Calling this function sets a flag indicating that this struct is no
longer worker-local, and attempting to use the get method again
will result in an assertion failure if assertions are enabled.

struct WorkerLocalStorageRange(T);

Range primitives for worker-local storage. The purpose of this is to
access results produced by each worker thread from a single thread once you
are no longer using the worker-local storage from multiple threads.
Do not use this struct in the parallel portion of your algorithm.

The proper way to instantiate this object is to call
WorkerLocalStorage.toRange. Once instantiated, this object behaves
as a finite random-access range with assignable, lvalue elements and
a length equal to the number of worker threads in the TaskPool that
created it plus 1.

Creates an instance of worker-local storage, initialized with a given
value. The value is lazy so that you can, for example, easily
create one instance of a class for each worker. For usage example,
see the WorkerLocalStorage struct.

@trusted void stop();

Signals to all worker threads to terminate as soon as they are finished
with their current Task, or immediately if they are not executing a
Task. Tasks that were in queue will not be executed unless
a call to Task.workForce, Task.yieldForce or Task.spinForce
causes them to be executed.

Use only if you have waited on every Task and therefore know the
queue is empty, or if you speculatively executed some tasks and no longer
need the results.

@trusted void finish(bool blocking = false);

Signals worker threads to terminate when the queue becomes empty.

If blocking argument is true, wait for all worker threads to terminate
before returning. This option might be used in applications where
task results are never consumed-- e.g. when TaskPool is employed as a
rudimentary scheduler for tasks which communicate by means other than
return values.

Warning:
Calling this function with blocking = true from a worker
thread that is a member of the same TaskPool that
finish is being called on will result in a deadlock.

Put a Task object on the back of the task queue. The Task
object may be passed by pointer or reference.

Example:

import std.file;
// Create a task.
auto t = task!read("foo.txt");
// Add it to the queue to be executed.
taskPool.put(t);

Notes:
@trusted overloads of this function are called for Tasks if
std.traits.hasUnsharedAliasing is false for the Task's
return type or the function the Task executes is pure.
Task objects that meet all other requirements specified in the
@trusted overloads of task and scopedTask may be created
and executed from @safe code via Task.executeInNewThread but
not via TaskPool.

While this function takes the address of variables that may
be on the stack, some overloads are marked as @trusted.
Task includes a destructor that waits for the task to complete
before destroying the stack frame it is allocated on. Therefore,
it is impossible for the stack frame to be destroyed before the task is
complete and no longer referenced by a TaskPool.

These properties control whether the worker threads are daemon threads.
A daemon thread is automatically terminated when all non-daemon threads
have terminated. A non-daemon thread will prevent a program from
terminating as long as it has not terminated.

If any TaskPool with non-daemon threads is active, either stop
or finish must be called on it before the program can terminate.

The worker treads in the TaskPool instance returned by the
taskPool property are daemon by default. The worker threads of
manually instantiated task pools are non-daemon by default.

Note:
For a size zero pool, the getter arbitrarily returns true and the
setter has no effect.

These functions allow getting and setting the OS scheduling priority of
the worker threads in this TaskPool. They forward to
core.thread.Thread.priority, so a given priority value here means the
same thing as an identical priority value in core.thread.

Note:
For a size zero pool, the getter arbitrarily returns
core.thread.Thread.PRIORITY_MIN and the setter has no effect.

@property @trusted TaskPool taskPool();

Returns a lazily initialized global instantiation of TaskPool.
This function can safely be called concurrently from multiple non-worker
threads. The worker threads in this pool are daemon threads, meaning that it
is not necessary to call TaskPool.stop or TaskPool.finish before
terminating the main thread.

These properties get and set the number of worker threads in the TaskPool
instance returned by taskPool. The default value is totalCPUs - 1.
Calling the setter after the first call to taskPool does not changes
number of worker threads in the instance returned by taskPool.