Scope

This Technical Specification describes requirements for implementations of an
interface that computer programs written in the C++ programming language may
use to invoke algorithms with parallel execution. The algorithms described by
this Technical Specification are realizable across a broad class of
computer architectures.

This Technical Specification is non-normative. Some of the functionality
described by this Technical Specification may be considered for standardization
in a future version of C++, but it is not currently part of any C++ standard.
Some of the functionality in this Technical Specification may never be
standardized, and other functionality may be standardized in a substantially
changed form.

The goal of this Technical Specification is to build widespread existing
practice for parallelism in the C++ standard algorithms library. It gives
advice on extensions to those vendors who wish to provide them.

2

Normative references

The following referenced document is indispensable for the
application of this document. For dated references, only the
edition cited applies. For undated references, the latest edition
of the referenced document (including any amendments) applies.

ISO/IEC 14882:2017,
Programming Languages — C++

ISO/IEC 14882:2017 is herein called the C++ Standard.
The library described in ISO/IEC 14882:2017 clauses 20-33 is herein called
the C++ Standard Library. The C++ Standard Library components described in
ISO/IEC 14882:2017 clauses 28, 29.8 and 23.10.10 are herein called the C++ Standard
Algorithms Library.

Unless otherwise specified, the whole of the C++ Standard's Library
introduction (C++17 §20) is included into this
Technical Specification by reference.

General

Namespaces and headers

Since the extensions described in this Technical Specification are
experimental and not part of the C++ Standard Library, they should not be
declared directly within namespace std. Unless otherwise specified, all
components described in this Technical Specification are declared in namespace
std::experimental::parallelism_v2.

[ Note:
Once standardized, the components described by this Technical Specification are expected to be promoted to namespace std.
— end note ]

Unless otherwise specified, references to such entities described in this
Technical Specification are assumed to be qualified with
std::experimental::parallelism_v2, and references to entities described in the C++
Standard Library are assumed to be qualified with std::.

Extensions that are expected to eventually be added to an existing header
<meow> are provided inside the <experimental/meow> header,
which shall include the standard contents of <meow> as if by

Unsequenced execution policy

The class unsequenced_policy
is an execution policy type used as a unique type to disambiguate
parallel algorithm overloading and indicate that a parallel algorithm's
execution may be vectorized, e.g., executed on a single thread using
instructions that operate on multiple data items.

The invocations of element access functions in parallel algorithms invoked with an execution policy of type unsequenced_policy
are permitted to execute in an unordered fashion in the calling thread,
unsequenced with respect to one another within the calling thread.
[ Note:
This means that multiple function object invocations may be interleaved on a single thread.
— end note ]

[ Note:
This overrides the usual guarantee from the C++ Standard, C++17 §4.6 [intro.execution] that function executions do not overlap with one another.
— end note ]

During the execution of a parallel algorithm with the experimental::execution::unsequenced_policy policy, if the invocation of an element access function exits via an uncaught exception, terminate() will be called.

6.3

Vector execution policy

The class vector_policy
is an execution policy type used as a unique type to disambiguate
parallel algorithm overloading and indicate that a parallel algorithm's
execution may be vectorized. Additionally, such vectorization will
result in an execution that respects the sequencing constraints of
wavefront application ([parallel.alg.general.wavefront]). [ Note:
The implementation thus makes stronger guarantees than for unsequenced_policy, for example.
— end note ]

The invocations of element access functions in parallel algorithms invoked with an execution policy of type vector_policy
are permitted to execute in unordered fashion in the calling thread,
unsequenced with respect to one another within the calling thread,
subject to the sequencing constraints of wavefront application ([parallel.alg.general.wavefront]) for the last argument to for_loop, for_loop_n, for_loop_strided, or for_loop_strided_n.

During the execution of a parallel algorithm with the experimental::execution::vector_policy policy, if the invocation of an element access function exits via an uncaught exception, terminate() will be called.

Parallel algorithms

Wavefront Application

For the purposes of this section, an evaluation is a value computation or side effect of
an expression, or an execution of a statement. Initialization of a temporary object is considered a
subexpression of the expression that necessitates the temporary object.

An evaluation A contains an evaluation B if:

A and B are not potentially concurrent ([intro.races]); and

the start of A is the start of B or the start of A is sequenced before the start of B; and

the completion of B is the completion of A or the completion of B is sequenced before the completion of A.

[ Note:
This includes evaluations occurring in function invocations.
— end note ]

An evaluation A is ordered before an evaluation B if A is deterministically
sequenced before B. [ Note:
If A is indeterminately sequenced with respect to B
or A and B are unsequenced, then A is not ordered before B and B is not ordered
before A. The ordered before relationship is transitive.
— end note ]

For an evaluation A ordered before an evaluation B, both contained in the same
invocation of an element access function, A is a vertical antecedent of B if:

there exists an evaluation S such that:

S contains A, and

S contains all evaluations C (if any) such that A is ordered before C and C is ordered before B,

but S does not contain B, and

control reached B from A without executing any of the following:

a goto statement or asm declaration that jumps to a statement outside of S, or

a switch statement executed within S that transfers control into a substatement of a nested selection or iteration statement, or

a throw[ Note:
even if caught
— end note ], or

a longjmp.

[ Note:
Vertical antecedent is an irreflexive, antisymmetric, nontransitive relationship between two evaluations.
Informally, A is a vertical antecedent of B if A is sequenced immediately before B or A is nested zero or
more levels within a statement S that immediately precedes B.
— end note ]

In the following, Xi and Xj refer to evaluations of the same expression
or statement contained in the application of an element access function corresponding to the ith and
jth elements of the input sequence. [ Note:
There might be several evaluations Xk,
Yk, etc. of a single expression or statement in application k, for example, if the
expression or statement appears in a loop within the element access function.
— end note ]

Horizontally matched is an equivalence relationship between two evaluations of the same expression. An
evaluation Bi is horizontally matched with an evaluation Bj if:

both are the first evaluations in their respective applications of the element access function, or

there exist horizontally matched evaluations Ai and Aj that are vertical antecedents of evaluations Bi and Bj, respectively.

[ Note:Horizontally matched establishes a theoretical lock-step relationship between evaluations in different applications of an element access function.
— end note ]

Let f be a function called for each argument list in a sequence of argument lists.
Wavefront application of f requires that evaluation Ai be sequenced
before evaluation Bj if i < j and:

Ai is sequenced before some evaluation Bi and Bi is horizontally matched with Bj, or

Ai is horizontally matched with some evaluation Aj and Aj is sequenced before Bj.

[ Note:Wavefront application guarantees that parallel applications i and j execute such that progress on application j never gets ahead of application i.
— end note ][ Note:
The relationships between Ai and Bi and between Aj and Bj are sequenced before, not vertical antecedent.
— end note ]7.2

Reductions

Each of the function templates in this subclause ([parallel.alg.reductions]) returns a reduction object
of unspecified type having a reduction value type and encapsulating a reduction identity value for the reduction, a
combiner function object, and a live-out object from which the initial value is obtained and into which the final
value is stored.

An algorithm uses reduction objects by allocating an unspecified number of instances, known as accumulators, of the reduction value
type. [ Note:
An implementation might, for example, allocate an accumulator for each thread in its private thread pool.
— end note ]
Each accumulator is initialized with the object’s reduction
identity, except that the live-out object (which was initialized by the
caller) comprises one of the accumulators. The algorithm passes a
reference to an accumulator to each application of an element-access
function, ensuring that no two concurrently executing
invocations share the same accumulator. An accumulator can be shared
between two
applications that do not execute concurrently, but
initialization is performed only once per accumulator.

Modifications to the accumulator by the application of element
access functions accrue as partial results. At some point before the
algorithm
returns, the partial results are combined, two at a time, using
the reduction object’s combiner operation until a single value remains,
which
is then assigned back to the live-out object. [ Note:
in order to produce useful results, modifications to the accumulator should be limited
to commutative operations closely related to the combiner operation. For example if the combiner is plus<T>, incrementing
the accumulator would be consistent with the combiner but doubling it or assigning to it would not.
— end note ]

T shall meet the requirements of CopyConstructible and MoveAssignable.

Returns:

Aa reduction object of unspecified type having reduction value type T, reduction identity and combiner operation as specified in table Table 2 and using the object referenced by var as its live-out object.

Table 2 — Reduction identities and combiner operations

Function

Reduction Identity

Combiner Operation

reduction_plus

T()

x + y

reduction_multiplies

T(1)

x * y

reduction_bit_and

(~T())

X & y

reduction_bit_or

T()

x | y

reduction_bit_xor

T()

x ^ y

reduction_min

var

min(x, y)

reduction_max

var

max(x, y)

[ Example:
The following code updates each element of y and sets s to the sum of the squares.

Inductions

Each of the function templates in this section return an induction object of unspecified type having an induction
value type and encapsulating an initial value i of that type and, optionally, a stride.

For each element in the input range, an algorithm over input sequence S computes an induction value from an induction variable
and ordinal position p within S by the formula i + p * stride if a stride was specified or i + p otherwise. This induction value is
passed to the element access function.

An induction object may refer to a live-out object to hold the final value of the induction sequence. When the algorithm using the induction
object completes, the live-out object is assigned the value i + n * stride, where n is the number of elements in the input range.

Aan induction object with induction value type remove_cv_t<remove_reference_t<T>>,
initial value var, and (if specified) stride stride. If T is an lvalue reference
to non-const type, then the object referenced by var becomes the live-out object for the
induction object; otherwise there is no live-out object.

For the overloads with an ExecutionPolicy, I shall be an integral type
or meet the requirements of a forward iterator type; otherwise, I shall be an integral
type or meet the requirements of an input iterator type. Size shall be an integral type
and n shall be non-negative. S shall have integral type and stride
shall have non-zero value. stride shall be negative only if I has integral
type or meets the requirements of a bidirectional iterator. The rest parameter pack shall
have at least one element, comprising objects returned by invocations of reduction
([parallel.alg.reduction]) and/or induction ([parallel.alg.induction]) function templates
followed by exactly one invocable element-access function, f. For the overloads with an
ExecutionPolicy, f shall meet the requirements of CopyConstructible;
otherwise, f shall meet the requirements of MoveConstructible.

Effects:

Applies f to each element in the input sequence, as described below, with additional
arguments corresponding to the reductions and inductions in the rest parameter pack. The
length of the input sequence is:

n, if specified,

otherwise finish - start if neither n nor stride is specified,

otherwise 1 + (finish-start-1)/stride if stride is positive,

otherwise 1 + (start-finish-1)/-stride.

The first element in the input sequence is start. Each subsequent element is generated by adding
stride to the previous element, if stride is specified, otherwise by incrementing
the previous element. [ Note:
As described in the C++ standard, section [algorithms.general], arithmetic
on non-random-access iterators is performed using advance and distance.
— end note ][ Note:
The order of the
elements of the input sequence is important for determining ordinal position of an application of f,
even though the applications themselves may be unordered.
— end note ]

The first argument to f is an element from the input sequence. [ Note:
if I is an
iterator type, the iterators in the input sequence are not dereferenced before
being passed to f.
— end note ] For each member of the rest parameter pack
excluding f, an additional argument is passed to each application of f as follows:

If the pack member is an object returned by a call to a reduction function listed in section
[parallel.alg.reductions], then the additional argument is a reference to an accumulator of that reduction
object.

If the pack member is an object returned by a call to induction, then the additional argument is the
induction value for that induction object corresponding to the position of the application of f in the input
sequence.

Complexity:

Applies f exactly once for each element of the input sequence.

Remarks:

If f returns a result, the result is ignored.

7.2.5

No vec

Evaluates std::forward<F>(f)(). When invoked within an element access function
in a parallel algorithm using vector_policy, if two calls to no_vec are
horizontally matched within a wavefront application of an element access function over input
sequence S, then the execution of f in the application for one element in S is
sequenced before the execution of f in the application for a subsequent element in
S; otherwise, there is no effect on sequencing.

Returns:

Tthe result of f.

Notes:

If f exits via an exception, then terminate will be called, consistent
with all other potentially-throwing operations invoked with vector_policy execution.
[ Example:

An object of type ordered_update_t<T> is a proxy for an object of type T
intended to be used within a parallel application of an element access function using a
policy object of type vector_policy. Simple increments, assignments, and compound
assignments to the object are forwarded to the proxied object, but are sequenced as though
executed within a no_vec invocation.
[ Note:
The return-value deduction of the forwarded operations results in these operations returning by
value, not reference. This formulation prevents accidental collisions on accesses to the return
value.
— end note ]

The class task_cancelled_exception defines the type of objects thrown by
task_block::run or task_block::wait if they detect than an
exception is pending within the current parallel block. See 8.5, below.

The class task_block defines an interface for forking and joining parallel tasks. The define_task_block and define_task_block_restore_thread function templates create an object of type task_block and pass a reference to that object to a user-provided function object.

An object of class task_block cannot be constructed,
destroyed, copied, or moved except by the implementation of the task
block library. Taking the address of a task_block object via operator& is ill-formed. Obtaining its address by any other means (including addressof) results in a pointer with an unspecified value; dereferencing such a pointer results in undefined behavior.

A task_block is active if it was created by the nearest enclosingnearest enclosingtask blocktask block, where task block“task block” refers to an
invocation of define_task_block or define_task_block_restore_thread and nearest enclosing“nearest enclosing” means the most
recent invocation that has not yet completed. Code designated for execution in another thread by means other
than the facilities in this section (e.g., using thread or async) are not enclosed in the task block and a
task_block passed to (or captured by) such code is not active within that code. Performing any operation on a
task_block that is not active results in undefined behavior.

When the argument to task_block::run is called, no task_block is active, not even the task_block on which run was called.
(The function object should not, therefore, capture a task_block from the surrounding block.)

task_block member function template run

F shall be MoveConstructible. DECAY_COPY(std::forward<F>(f))() shall be a valid expression.

Preconditions:

*this shall be the active task_block.

Effects:

Evaluates DECAY_COPY(std::forward<F>(f))(), where DECAY_COPY(std::forward<F>(f))
is evaluated synchronously within the current thread. The call to the resulting copy of the function object is
permitted to run on an unspecified thread created by the implementation in an unordered fashion relative to
the sequence of operations following the call to run(f) (the continuation), or indeterminately sequenced
within the same thread as the continuation. The call to run synchronizes with the call to the function
object. The completion of the call to the function object synchronizes with the next invocation of wait on
the same task_block or completion of the nearest enclosing task block (i.e., the define_task_block or
define_task_block_restore_thread that created this task_block).

The run function may return on a thread other than the one on which it was called; in such cases,
completion of the call to run synchronizes with the continuation.
[ Note:
The return from run is ordered similarly to an ordinary function call in a single thread.
— end note ]

Remarks:

The invocation of the user-supplied function object f may be immediate or may be delayed until
compute resources are available. run might or might not return before the invocation of f completes.

The wait function may return on a thread other than the one on which it was called; in such cases, completion of the call to wait synchronizes with subsequent operations.
[ Note:
The return from wait is ordered similarly to an ordinary function call in a single thread.
— end note ][ Example:

The define_task_block function may return on a thread other than the one on which it was called
unless there are no task blocks active on entry to define_task_block (see 8.3), in which
case the function returns on the original thread. When define_task_block returns on a different thread,
it synchronizes with operations following the call. [ Note:
The return from define_task_blockdefine_task_block is ordered
similarly to an ordinary function call in a single thread.
— end note ] The define_task_block_restore_thread
function always returns on the same thread as the one on which it was called.

Notes:

It is expected (but not mandated) that f will (directly or indirectly) call tb.run(function-object).

8.5

Exception Handling

Every task_block has an associated exception list. When the task block starts, its associated exception list is empty.

When an exception is thrown from the user-provided function object passed to define_task_block or
define_task_block_restore_thread, it is added to the exception list for that task block. Similarly, when
an exception is thrown from the user-provided function object passed into task_block::run, the exception
object is added to the exception list associated with the nearest enclosing task block. In both cases, an
implementation may discard any pending tasks that have not yet been invoked. Tasks that are already in
progress are not interrupted except at a call to task_block::run or task_block::wait as described below.

If the implementation is able to detect that an exception has been thrown by another task within
the same nearest enclosing task block, then task_block::run or task_block::wait may throw
task_canceled_exception; these instances of task_canceled_exception are not added to the exception
list of the corresponding task block.

When a task block finishes with a non-empty exception list, the exceptions are aggregated into an exception_list object, which is then thrown from the task block.

The order of the exceptions in the exception_list object is unspecified.

9

Data-Parallel Types

General

The data-parallel library consists of data-parallel types and
operations on these types. A data-parallel type consists of elements of
an underlying arithmetic type, called the element type. The number of elements is a constant for each data-parallel type and called the width of that type.

Throughout this Clause, the term data-parallel type refers to all supported9.3.1 specializations of the simd and simd_mask class templates. A data-parallel object is an object of data-parallel type.

An element-wise operation applies a specified operation
to the elements of one or more data-parallel objects. Each such
application is unsequenced with respect to the others. A unary element-wise operation is an element-wise operation that applies a unary operation to each element of a data-parallel object. A binary element-wise operation is an element-wise operation that applies a binary operation to corresponding elements of two data-parallel objects.

Throughout this Clause, the set of vectorizable types for a data-parallel type comprises all cv-unqualified arithmetic types other than bool.

[ Note:
The intent is to support acceleration through data-parallel
execution resources, such as SIMD registers and instructions or
execution units driven by a common instruction decoder. If such
execution resources are unavailable, the interfaces support a
transparent fallback to sequential execution.
— end note ]

An ABI tag is a type in the std::experimental::parallelism_v2::simd_abi namespace that indicates a choice of size and binary representation for objects of data-parallel type. [ Note:
The intent is for the size and binary representation to depend on the target architecture.
— end note ] The ABI tag, together with a given element type implies a
number of elements. ABI tag types are used as the second template
argument to simd and simd_mask.

Use of the scalar tag type requires data-parallel types to store a single element (i.e., simd<T, simd_abi::scalar>::size() returns 1). [ Note:scalar is not an alias for fixed_size<1>.
— end note ]

The value of max_fixed_size<T> is at least 32.

Use of the simd_abi::fixed_size<N> tag type requires data-parallel types to store N elements (i.e. simd<T, simd_abi::fixed_size<N>>::size() is N). simd<T, fixed_size<N>> and simd_mask<T, fixed_size<N>> with N > 0 and N <= max_fixed_size<T> shall be supported. Additionally, for every supported simd<T, Abi> (see 9.3.1), where Abi is an ABI tag that is not a specialization of simd_abi::fixed_size, N == simd<T, Abi>::size() shall be supported.

[ Note:
It is unspecified whether simd<T, fixed_size<T, fixed_size<N>> with N > max_fixed_size<T> is supported. The value of max_fixed_size<T> can depend on compiler flags and can change between different compiler versions.
— end note ]

[ Note:
An implementation can forego ABI compatibility between differently compiled translation units for simd and simd_mask specializations using the same simd_abi::fixed_size<N> tag. Otherwise, the efficiency of simd<T, Abi> is likely to be better than for simd<T, fixed_size<simd_size_v<T, Abi>>> (with Abi not a specialization of simd_abi::fixed_size).
— end note ]

An implementation may define additional extended ABI tag types in the std::experimental::parallelism_v2::simd_abi namespace, to support other forms of data-parallel computation.

compatible<T> is an implementation-defined alias for an ABI tag. [ Note:
The intent is to use the ABI tag producing the most efficient data-parallel execution for the element type T that ensures ABI compatibility between translation units on the target architecture.
— end note ]

[ Example: Consider a target architecture supporting the extended ABI tags __simd128 and __simd256, where the __simd256 type requires an optional ISA extension on said architecture. Also, the target architecture does not support long double with either ABI tag. The implementation therefore defines

compatible<T> is an alias for __simd128 for all vectorizable T, except long double, and

compatible<long double> as an alias for scalar.

— end example ]

native<T> is an implementation-defined alias for an ABI tag. [ Note:
The intent is to use the ABI tag producing the most efficient data-parallel execution for the element type T that is supported on the currently targeted system. For target architectures without ISA extensions, the native<T> and compatible<T> aliases will likely be the same. For target architectures with ISA extensions, compiler flags may influence the native<T> alias while compatible<T> will be the same independent of such flags.
— end note ]

[ Example: Consider a target architecture supporting the extended ABI tags __simd128 and __simd256, where hardware support for __simd256 only exists for floating-point types. The implementation therefore defines native<T> as an alias for

If N is 1, the member typedef type is simd_abi::scalar. Otherwise, if there are multiple ABI tag types that satisfy the constraints, the member typedef type is implementation-defined. [ Note:
It is expected that extended ABI tags can produce better optimizations and thus are preferred over simd_abi::fixed_size<N>.
— end note ]

The behavior of a program that adds specializations for deduce is undefined.

If value is present, the type simd_size<T, Abi> is a BinaryTypeTrait with a BaseCharacteristic of integral_constant<size_t, N> with N equal to the number of elements in a simd<T, Abi> object. [ Note:
If simd<T, Abi> is not supported for the currently targeted system, simd_size<T, Abi>::value produces the value simd<T, Abi>::size() would return if it were supported.
— end note ]

The behavior of a program that adds specializations for simd_size is undefined.

If value is present, the type memory_alignment<T, U> is a BinaryTypeTrait with a BaseCharacteristic of integral_constant<size_t, N> for some implementation-defined N (see 9.3.4 and 9.5.3). [ Note:value identifies the alignment restrictions on pointers used for (converting) loads and stores for the give type T on arrays of type U.
— end note ]

The behavior of a program that adds specializations for memory_alignment is undefined.

A copy of data with the indicated unary operator applied to all selected elements.

Throws:

Nothing.

template<class U, class Flags> void copy_to(U* mem, Flags) const &&;

Requires:

If the template parameter Flags is vector_aligned_tag, mem shall point to storage aligned by memory_alignment_v<T, U>. If the template parameter Flags is overaligned_tag<N>, mem shall point to storage aligned by N. If the template parameter Flags is element_aligned_tag, mem shall point to storage aligned by alignof(U). If M is not bool, the largest i ∊ [0, M::size()) where mask[i] is true is less than the number of values pointed to by mem.

Effects:

Copies the selected elements as if mem[i] = static_cast<U>(data[i]) for all selected indices i.

Throws:

Nothing.

Remarks:

This function shall not participate in overload resolution unless

is_simd_flag_type_v<Flags> is true, and

either

U is bool and value_type is bool, or

U is a vectorizable type and value_type is not bool.

template<class U> void operator=(U&& x) &&;

Effects:

Replaces data[i] with static_cast<T>(std::forward<U>(x))[i] for all selected indices i.

Remarks:

This operator shall not participate in overload resolution unless U is convertible to T.

Each of these operators shall not participate in overload resolution unless the return type of data @ std::forward<U>(x) is convertible to T.
It is unspecified whether the binary operator, implied by the compound
assignment operator, is executed on all elements or only on the selected
elements.

If the template parameter Flags is vector_aligned_tag, mem shall point to storage aligned by memory_alignment_v<T, U>. If the template parameter Flags is overaligned_tag<N>, mem shall point to storage aligned by N. If the template parameter Flags is element_aligned_tag, mem shall point to storage aligned by alignof(U). If is_simd_flag_type_v<U> is true, for all selected indices i, i shall be less than the number of values pointed to by mem.

Effects:

Replaces the selected elements as if data[i] = static_cast<value_type>(mem[i]) for all selected indices i.

The class template simd is a data-parallel type. The width of a given simd specialization is a constant expression, determined by the template parameters.

Every specialization of simd shall be a complete type. The specialization simd<T, Abi> is supported if T is a vectorizable type and

Abi is simd_abi::scalar, or

Abi is simd_abi::fixed_size<N>, with N is constrained as defined in 9.2.1.

If Abi is an extended ABI tag, it is implementation-defined whether simd<T, Abi> is supported. [ Note:
The intent is for implementations to decide on the basis of the currently targeted system.
— end note ]

If simd<T, Abi> is not supported, the
specialization shall have a deleted default constructor, deleted
destructor, deleted copy constructor, and deleted copy assignment.

[ Example:
Consider an implementation that defines the extended ABI tags __simd_x and __gpu_y. When the compiler is invoked to translate to a machine that has support for the __simd_x ABI tag for all arithmetic types other than long double and no support for the __gpu_y ABI tag, then:

simd<T, simd_abi::__gpu_y> is not supported for any T and has a deleted constructor.

simd<long double, simd_abi::__simd_x> is not supported and has a deleted constructor.

simd<double, simd_abi::__simd_x> is supported.

simd<long double, simd_abi::scalar> is supported.

— end example ]

Default intialization performs no initialization of the elements; value-initialization initializes each element with T(). [ Note:
Thus, default initialization leaves the elements in an indeterminate state.
— end note ]

static constexpr size_t size() noexcept;

Returns:

The width of simd<T, Abi>.

Implementations should enable explicit conversion from and to
implementation-defined types. This adds one or more of the following
declarations to class simd:

[ Example:
Consider an implementation that supports the type __vec4f and the function __vec4f _vec4f_addsub(__vec4f, __vec4f) for the currently targeted system.
A user may require the use of _vec4f_addsub for maximum performance and thus writes:

Constructs an object where the i-th element equals static_cast<T>(x[i]) for all i ∊ [0, size()).

Remarks:

This constructor shall not participate in overload resolution unless

abi_type is simd_abi::fixed_size<size()>, and

every possible value of U can be represented with type value_type, and

if both U and value_type are integral, the integer conversion rank [conv.rank] of value_type is greater than the integer conversion rank of U.

template<class G> simd(G&& gen);

Effects:

Constructs an object where the i-th element is initialized to gen(integral_constant<size_t, i>()).

Remarks:

This constructor shall not participate in overload resolution unless simd(gen(integral_constant<size_t, i>())) is well-formed for all i ∊ [0, size()).

The calls to gen are unsequenced with respect to each other. Vectorization-unsafe standard library functions may not be invoked by gen ([algorithms.parallel.exec]).

template<class U, class Flags> simd(const U* mem, Flags);

Requires:

If the template parameter Flags is vector_aligned_tag, mem shall point to storage aligend by memory_alignment_v<simd, U>. If the template parameter Flags is overaligned_tag<N>, mem shall point to storage aligned by N. If the template parameter Flags is element_aligned_tag, mem shall point to storage aligned by alignof(U). [mem, mem + size()) is a valid range.

Effects:

Constructs an object where the i-th element is initialized to static_cast<T>(mem[i]) for all i ∊ [0, size()).

Remarks:

This constructor shall not participate in overload resolution unless

is_simd_flag_type_v<Flags> is true, and

U is a vectorizable type.

9.3.4

Copy functions

If the template parameter Flags is vector_aligned_tag, mem shall point to storage aligned by memory_alignment_v<simd, U>. If the template parameter Flags is overaligned_tag<N>, mem shall point to storage aligend by N. If the template parameter Flags is element_aligned_tag, mem shall point to storage aligned by alignof(U). [mem, mem + size()) is a valid range.

Effects:

Replaces the elements of the simd object such that the i-th element is assigned with static_cast<T>(mem[i]) for all i ∊ [0, size()).

Remarks:

This function shall not participate in overload resolution unless

is_simd_flag_type_v<Flags> is true, and

U is a vectorizable type.

template<class U, class Flags> void copy_to(U* mem, Flags) const;

Requires:

If the template parameter Flags is vector_aligned_tag, mem shall point to storage aligned by memory_alignment_v<simd, U>. If the template parameter Flags is overaligned_tag<N>, mem shall point to storage aligned by N. If the template parameter Flags is element_aligned_tag, mem shall point to storage aligned by alignof(U). [mem, mem + size()) is a valid range.

Effects:

Copies all simd elements as if mem[i] = static_cast<U>(operator[](i)) for all i ∊ [0, size()).

binary_op shall be callable with two arguments of type T returning T, or callable with two arguments of type simd<T, A1> returning simd<T, A1> for every A1 that is an ABI tag type. The results of binary_op(identity_element, x) and binary_op(x, identity_element) shall be equal to x for all finite values x representable by V::value_type.

A data-parallel object initialized with the concatenated values in the xs pack of data-parallel objects: The i-th simd/simd_mask element of the j-th parameter in the xs pack is copied to the return value's element with index i + the sum of the width of the first j parameters in the xs pack.

Math library

For each set of overloaded functions within <cmath>, there shall be additional overloads sufficient to ensure that if any argument corresponding to a double parameter has type simd<T, Abi>, where is_floating_point_v<T> is true, then:

All arguments corresponding to double parameters shall be convertible to simd<T, Abi>.

All arguments corresponding to double* parameters shall be of type simd<T, Abi>*.

All arguments corresponding to parameters of integral type U shall be convertible to fixed_size_simd<U, simd_size_v<T, Abi>>.

All arguments corresponding to U*, where U is integral, shall be of type fixed_size_simd<U, simd_size_v<T, Abi>>*.

If the corresponding return type is double, the return type of the additional overloads is simd<T, Abi>. Otherwise, if the corresponding return type is bool, the return type of the additional overload is simd_mask<T, Abi>. Otherwise, the return type is fixed_size_simd<R, simd_size_v<T, Abi>>, with R denoting the corresponding return type.

It is unspecified whether a call to these overloads with arguments that are all convertible to simd<T, Abi> but are not of type simd<T, Abi> is well-formed.

Each function overload produced by the above rules applies the indicated <cmath>
function element-wise. The results per element are not required to be
bitwise equal to the application of the function which is overloaded for
the element type.

The behavior is undefined if a domain, pole, or range error
occurs when the input argument(s) are applied to the indicated <cmath> function.

If abs is called with an argument of type simd<X, Abi> for which is_unsigned_v<X> is true, the program is ill-formed.

The class template simd_mask is a data-parallel type with the element type bool. The width of a given simd_mask specialization is a constant expression, determined by the template parameters. Specifically, simd_mask<T, Abi>::size() == simd<T, Abi>::size().

Every specialization of simd_mask shall be a complete type. The specialization simd_mask<T, Abi> is supported if T is a vectorizable type and

Abi is simd_abi::scalar, or

Abi is simd_abi::fixed_size<N>, with N constrained as defined in (9.2.1).

If Abi is an extended ABI tag, it is implementation-defined whether simd_mask<T, Abi> is supported. [ Note:
The intent is for implementations to decide on the basis of the currently targeted system.
— end note ]
If simd_mask<T, Abi> is not supported, the
specialization shall have a deleted default constructor, deleted
destructor, deleted copy constructor, and deleted copy assignment.

Default initialization performs no intialization of the elements; value-initialization initializes each element with false. [ Note:
Thus, default initialization leaves the elements in an indeterminate state.
— end note ]

static constexpr size_t size() noexcept;

Returns:

The width of simd<T, Abi>.

Implementations should enable explicit conversion from and to
implementation-defined types. This adds one or more of the following
declarations to class simd_mask:

Constructs an object of type simd_mask where the i-th element equals x[i] for all i ∊ [0, size()).

Remarks:

This constructor shall not participate in overload resolution unless abi_type is simd_abi::fixed_size<size()>.

template<class Flags> simd_mask(const value_type* mem, Flags);

Requires:

If the template parameter Flags is vector_aligned_tag, mem shall point to storage aligned by memory_alignment_v<simd_mask>. If the template parameter Flags is overaligned_tag<N>, mem shall point to storage aligned by N. If the template parameter Flags is element_aligned_tag, mem shall point to storage aligned by alignof(U). [mem, mem + size()) is a valid range.

Effects:

Constructs an object where the i-th element is initialized to mem[i] for all i ∊ [0, size()).

Remarks:

This constructor shall not participate in overload resolution unless is_simd_flag_type_v<Flags> is true.

9.5.3

Copy functions

If the template parameter Flags is vector_aligned_tag, mem shall point to storage aligned by memory_alignment_v<simd_mask>. If the template parameter Flags is overaligned_tag<N>, mem shall point to storage aligned by N. If the template parameter Flags is element_aligned_tag, mem shall point to storage aligned by alignof(U). [mem, mem + size()) is a valid range.

Effects:

Replaces the elements of the simd_mask object such that the i-th element is replaced with mem[i] for all i ∊ [0, size()).

Remarks:

This function shall not participate in overload resolution unless is_simd_flag_type_v<Flags> is true.

template<class Flags> void copy_to(value_type* mem, Flags);

Requires:

If the template parameter Flags is vector_aligned_tag, mem shall point to storage aligned by memory_alignment_v<simd_mask>. If the template parameter Flags is overaligned_tag<N>, mem shall point to storage aligned by N. If the template parameter Flags is element_aligned_tag, mem shall point to storage aligned by alignof(U). [mem, mem + size()) is a valid range.

Effects:

Copies all simd_mask elements as if mem[i] = operator[](i) for all i ∊ [0, size()).

Remarks:

This function shall not participate in overload resolution unless is_simd_flag_type_v<Flags> is true.