the C++ template mechanism would allow to define a hybrid SoA
container class: Similar to std::vector which abstracts a
traditional C array, one could implement a wrapper around a T
block[N]*:<

// scalar context throughout this example
struct vec3 { float x, y, z; };
// vec3 block[N]* pointing to ceil(n/N) elements
hsoa <vec3 > vecs(n);
// preferred vector length of vec3 automatically derived
static const int N = hsoa <vec3 >::vector_length;
int i = /*...*/
hsoa <vec3 >::block_index ii = /*...*/
vec3 v = vecs[i]; // gather
vecs[i] = v; // scatter
vec3 block[N] w = vecs[ii]; // fetch whole block
hsoa <vec3 >::ref r = vecs[i]; // get proxy to a scalar
r = v; // pipe through proxy
// for each element
vecs.foreach([](vec3& scalar v) { /*...*/ });
Regardless of the other ideas of their C-like language, a similar
struct should be added to Phobos once a bit higher level SIMD
support is in better shape in D. Supporting Hybrid-SoA and few
operations on it will be an important but probably quite short
and simple addition to Phobos collections (it's essentially an
struct that acts like an array, with few simple extra operations).
I think no commonly used language allows both very simple and
quite efficient SIMD programming (Scala, CUDA, C, C++, C#, Java,
Go, and currently Rust too, are not able to support SIMD
programming well. I think currently Haskell too is not supporting
it well, but Haskell is very flexible, and it's compiled by a
native compiler, so such things are maybe possible to add). So
supporting it well in D will be an interesting selling point of
D. (Supporting a very simple SIMD coding in D will make D more
widespread, but such kind of programming will probably keep being
a small niche).
Bye,
bearophile

I don't know what that code does. I think the if statement is always
true. Try compiling it in D.
test.d(8): Error: 4 < x must be parenthesized when next to operator &
test.d(8): Error: x <= 8 must be parenthesized when next to operator &
Making that an error was such a good idea.
<g>

I don't know what that code does. I think the if statement is always
true.

No, the code is fine.

Try compiling it in D.
test.d(8): Error: 4 < x must be parenthesized when next to operator &
test.d(8): Error: x <= 8 must be parenthesized when next to operator &
Making that an error was such a good idea.
<g>

the C++ template mechanism would allow to define a hybrid SoA
container class: Similar to std::vector which abstracts a
traditional C array, one could implement a wrapper around a T
block[N]*:<

// scalar context throughout this example
struct vec3 { float x, y, z; };
// vec3 block[N]* pointing to ceil(n/N) elements
hsoa <vec3 > vecs(n);
// preferred vector length of vec3 automatically derived
static const int N = hsoa <vec3 >::vector_length;
int i = /*...*/
hsoa <vec3 >::block_index ii = /*...*/
vec3 v = vecs[i]; // gather
vecs[i] = v; // scatter
vec3 block[N] w = vecs[ii]; // fetch whole block
hsoa <vec3 >::ref r = vecs[i]; // get proxy to a scalar
r = v; // pipe through proxy
// for each element
vecs.foreach([](vec3& scalar v) { /*...*/ });
Regardless of the other ideas of their C-like language, a
similar struct should be added to Phobos once a bit higher
level SIMD support is in better shape in D. Supporting
Hybrid-SoA and few operations on it will be an important but
probably quite short and simple addition to Phobos collections
(it's essentially an struct that acts like an array, with few
simple extra operations).
I think no commonly used language allows both very simple and
quite efficient SIMD programming (Scala, CUDA, C, C++, C#,
Java, Go, and currently Rust too, are not able to support SIMD
programming well. I think currently Haskell too is not
supporting it well, but Haskell is very flexible, and it's
compiled by a native compiler, so such things are maybe
possible to add). So supporting it well in D will be an
interesting selling point of D. (Supporting a very simple SIMD
coding in D will make D more widespread, but such kind of
programming will probably keep being a small niche).
Bye,
bearophile

Actually, I am yet to see any language that has SIMD as part of
the language standard and not as an extension where each vendor
does its own way.

class: Similar to std::vector which abstracts a traditional C array, one
could implement a wrapper around a T block[N]*:<

// scalar context throughout this example
struct vec3 { float x, y, z; };
// vec3 block[N]* pointing to ceil(n/N) elements
hsoa <vec3 > vecs(n);
// preferred vector length of vec3 automatically derived
static const int N = hsoa <vec3 >::vector_length;
int i = /*...*/
hsoa <vec3 >::block_index ii = /*...*/
vec3 v = vecs[i]; // gather
vecs[i] = v; // scatter
vec3 block[N] w = vecs[ii]; // fetch whole block
hsoa <vec3 >::ref r = vecs[i]; // get proxy to a scalar
r = v; // pipe through proxy
// for each element
vecs.foreach([](vec3& scalar v) { /*...*/ });
Regardless of the other ideas of their C-like language, a similar struct
should be added to Phobos once a bit higher level SIMD support is in better
shape in D. Supporting Hybrid-SoA and few operations on it will be an
important but probably quite short and simple addition to Phobos
collections (it's essentially an struct that acts like an array, with few
simple extra operations).
I think no commonly used language allows both very simple and quite
efficient SIMD programming (Scala, CUDA, C, C++, C#, Java, Go, and
currently Rust too, are not able to support SIMD programming well. I think
currently Haskell too is not supporting it well, but Haskell is very
flexible, and it's compiled by a native compiler, so such things are maybe
possible to add). So supporting it well in D will be an interesting selling
point of D. (Supporting a very simple SIMD coding in D will make D more
widespread, but such kind of programming will probably keep being a small
niche).
Bye,
bearophile

Actually, I am yet to see any language that has SIMD as part of the
language standard and not as an extension where each vendor does its own
way.

HLSL, GLSL, Cg? :)
I don't think it's possible considering that D is designed to plug on to
various backends.
D already has what's required to do some fairly nice (by comparison) simd
stuff with good supporting libraries.
One thing I can think of that would really improve simd (and not only simd)
would be a way to define compound operators.
If the library could detect/hook sequences of operations and implement them
more efficiently as a compound, that would make some very powerful
optimisations available.
Simple example:
T opCompound(string seq)(T a, T b, T c) if(seq == "* +") { return
_madd(a, b, c); }
T opCompound(string seq)(T a, T b, T c) if(seq == "+ *") { return
_madd(b, c, a); }

I agree that if is kinda neat, but it's probably not out of the question
for future extension. All the other stuff here is possible.
That said, it's not necessarily optimal either, just conveniently written.
The compiler would have to do some serious magic to optimise that;
flattening both sides of the if into parallel expressions, and then
applying the mask to combine...
I'm personally not in favour of SIMD constructs that are anything less than
optimal (but I appreciate I'm probably in the minority here).
(The simple benchmarks of the paper show a 5-15% performance loss compared

The compiler would have to do some serious magic to optimise
that;
flattening both sides of the if into parallel expressions, and
then applying the mask to combine...

I think it's a small amount of magic.
The simple features shown in that paper are fully focused on SIMD
programming, so they aren't introducing things clearly not
efficient.

I'm personally not in favour of SIMD constructs that are
anything less than
optimal (but I appreciate I'm probably in the minority here).
(The simple benchmarks of the paper show a 5-15% performance
loss compared

to handwritten SIMD code.)

Right, as I suspected.

15% is a very small performance loss, if for the programmer the
alternative is writing scalar code, that is 2 or 3 times slower
:-)
The SIMD programmers that can't stand a 1% loss of performance
use the intrinsics manually (or write in asm) and they ignore all
other things.
A much larger population of system programmers wish to use modern
CPUs efficiently, but they don't have time (or skill, this means
their programs are too much often buggy) for assembly-level
programming. Currently they use smart numerical C++ libraries,
use modern Fortran versions, and/or write C/C++ scalar code (or
Fortran), add "restrict" annotations, and take a look at the
produced asm hoping the modern compiler back-ends will vectorize
it. This is not good enough, and it's far from a 15% loss.
This paper shows a third way, making such kind of programming
simpler and approachable for a wider audience, with a small
performance loss compared to handwritten code. This is what
language designers do since 60+ years :-)
Bye,
bearophile

Manu:
The compiler would have to do some serious magic to optimise that;

flattening both sides of the if into parallel expressions, and then
applying the mask to combine...

I think it's a small amount of magic.
The simple features shown in that paper are fully focused on SIMD
programming, so they aren't introducing things clearly not efficient.
I'm personally not in favour of SIMD constructs that are anything less

than
optimal (but I appreciate I'm probably in the minority here).
(The simple benchmarks of the paper show a 5-15% performance loss compared

to handwritten SIMD code.)

Right, as I suspected.

15% is a very small performance loss, if for the programmer the
alternative is writing scalar code, that is 2 or 3 times slower :-)
The SIMD programmers that can't stand a 1% loss of performance use the
intrinsics manually (or write in asm) and they ignore all other things.
A much larger population of system programmers wish to use modern CPUs
efficiently, but they don't have time (or skill, this means their programs
are too much often buggy) for assembly-level programming. Currently they
use smart numerical C++ libraries, use modern Fortran versions, and/or
write C/C++ scalar code (or Fortran), add "restrict" annotations, and take
a look at the produced asm hoping the modern compiler back-ends will
vectorize it. This is not good enough, and it's far from a 15% loss.
This paper shows a third way, making such kind of programming simpler and
approachable for a wider audience, with a small performance loss compared
to handwritten code. This is what language designers do since 60+ years :-)

I don't disagree with you, it is fairly cool!
I can't can't imagine D adopting those sort of language features any time
soon, but it's probably possible.
I guess the keys are defining the bool vector concept, and some tech to
flatten both sides of a vector if statement, but that's far from simple...
Particularly so if someone puts some unrelated code in those if blocks.
Chances are it offers too much freedom that wouldn't be well used or
understood by the average programmer, and that still leaves you in a
similar land of only being particularly worthwhile in the hands of a fairly
advanced/competent user.
The main error that most people make is thinking SIMD code is faster by
nature. Truth is, in the hands of someone who doesn't know precisely what
they're doing, SIMD code is almost always slower.
There are some cool new expressions offered here, fairly convenient
(although easy[er?] to write in other ways too), but I don't think it would
likely change that fundamental premise for the average programmer beyond
some very simple parallel constructs that the compiler can easily get right.
I'd certainly love to see it, but is it realistic that someone would take
the time to do all of that any time soon when benefits
are controversial? It may even open the possibility for un-skilled people
to write far worse code.
Let's consider your example above for instance, I would rewrite (given
existing syntax):
// vector length of context = 1; current_mask = T
int4 v = [0,3,4,1];
int4 w = 3; // [3,3,3,3] via broadcast
uint4 m = maskLess(v, w); // [T,F,F,T] (T == ones, F == zeroes)
v += int4(1); // [1,4,5,2]
// the if block is trivially rewritten:
int4 trueSide = v + int4(2);
int4 falseSize = v + int4(3);
v = select(m, trueSide, falseSide); // [3,7,8,4]
Or the whole thing further simplified:
int4 v = [0,3,4,1];
int4 w = 3; // [3,3,3,3] via broadcast
// one convenient function does the comparison and select accordingly
v = selectLess(v, w, v + int4(1 + 2), v + int4(1 + 3)); // combine the
prior few lines
I actually find this more convenient. I also find the if syntax you
demonstrate to be rather deceptive and possibly misleading. 'if' suggests a
branch, whereas the construct you demonstrate will evaluate both sides
every time. Inexperienced programmers may not really grasp that. Evaluating
the true side and the false side inline, and then perform the select
serially is more honest; it's actually what the computer will do, and I
don't really see it being particularly less convenient either.

The compiler would have to do some serious magic to optimise that;
flattening both sides of the if into parallel expressions, and then
applying the mask to combine...

I think it's a small amount of magic.
The simple features shown in that paper are fully focused on SIMD
programming, so they aren't introducing things clearly not efficient.

I'm personally not in favour of SIMD constructs that are anything less
than
optimal (but I appreciate I'm probably in the minority here).
(The simple benchmarks of the paper show a 5-15% performance loss
compared

to handwritten SIMD code.)

Right, as I suspected.

15% is a very small performance loss, if for the programmer the
alternative is writing scalar code, that is 2 or 3 times slower :-)
The SIMD programmers that can't stand a 1% loss of performance use the
intrinsics manually (or write in asm) and they ignore all other things.
A much larger population of system programmers wish to use modern CPUs
efficiently, but they don't have time (or skill, this means their programs
are too much often buggy) for assembly-level programming. Currently they use
smart numerical C++ libraries, use modern Fortran versions, and/or write
C/C++ scalar code (or Fortran), add "restrict" annotations, and take a look
at the produced asm hoping the modern compiler back-ends will vectorize it.
This is not good enough, and it's far from a 15% loss.
This paper shows a third way, making such kind of programming simpler and
approachable for a wider audience, with a small performance loss compared to
handwritten code. This is what language designers do since 60+ years :-)

I don't disagree with you, it is fairly cool!
I can't can't imagine D adopting those sort of language features any time
soon, but it's probably possible.
I guess the keys are defining the bool vector concept, and some tech to
flatten both sides of a vector if statement, but that's far from simple...
Particularly so if someone puts some unrelated code in those if blocks.
Chances are it offers too much freedom that wouldn't be well used or
understood by the average programmer, and that still leaves you in a similar
land of only being particularly worthwhile in the hands of a fairly
advanced/competent user.
The main error that most people make is thinking SIMD code is faster by
nature. Truth is, in the hands of someone who doesn't know precisely what
they're doing, SIMD code is almost always slower.
There are some cool new expressions offered here, fairly convenient
(although easy[er?] to write in other ways too), but I don't think it would
likely change that fundamental premise for the average programmer beyond
some very simple parallel constructs that the compiler can easily get right.
I'd certainly love to see it, but is it realistic that someone would take
the time to do all of that any time soon when benefits are controversial? It
may even open the possibility for un-skilled people to write far worse code.
Let's consider your example above for instance, I would rewrite (given
existing syntax):
// vector length of context = 1; current_mask = T
int4 v = [0,3,4,1];
int4 w = 3; // [3,3,3,3] via broadcast
uint4 m = maskLess(v, w); // [T,F,F,T] (T == ones, F == zeroes)
v += int4(1); // [1,4,5,2]
// the if block is trivially rewritten:
int4 trueSide = v + int4(2);
int4 falseSize = v + int4(3);
v = select(m, trueSide, falseSide); // [3,7,8,4]

I think this is far more convenient than any crazy 'if' syntax
:) .. It's
also perfectly optimal on all architectures I know aswell!

You should show more respect for them and their work. Their ideas
seem very far from being crazy. They have also proved their type
system to be sound. This kind of work is lightyears ahead of the
usual sloppy designs you see in D features, where design holes
are found only years later, when sometimes it's too much late to
fix them :-)
That if syntax (that is integrated in a type system that manages
the masks, plus implicit polymorphism that allows the same
function to be used both in a vectorized or scalar context) works
with larger amounts of code too, while you are just doing a
differential assignment.
Bye,
bearophile

Manu:
I think this is far more convenient than any crazy 'if' syntax :) .. It's

also perfectly optimal on all architectures I know aswell!

You should show more respect for them and their work. Their ideas seem
very far from being crazy. They have also proved their type system to be
sound. This kind of work is lightyears ahead of the usual sloppy designs
you see in D features, where design holes are found only years later, when
sometimes it's too much late to fix them :-)

I think I said numerous times in my former email that it's really cool, and
certainly very interesting.
I just can't imagine it appearing in D any time soon. We do have some ways
to conveniently do lots of that stuff right now, and make some improvement
on other competing languages in the area.
I'd like to see more realistic case studies of their approach where it
significantly simplifies particular workloads?
That if syntax (that is integrated in a type system that manages the masks,

plus implicit polymorphism that allows the same function to be used both in
a vectorized or scalar context) works with larger amounts of code too,
while you are just doing a differential assignment.

And that's likely where it all starts getting very complicated. If the
branches start doing significant (and unbalanced) work, an un-skilled
programmer will have a lot of trouble understanding what sort of mess they
may be making.
And as usual, x86 will be the most tolerant, so they may not even know when
profiling.
I've said before, it's very interesting, but it also sound potentially very
dangerous. It's probably also an awful lot of work I'd wager... I doubt
we'll see those expressions any time soon.

I think this is far more convenient than any crazy 'if' syntax :) .. It's
also perfectly optimal on all architectures I know aswell!

You should show more respect for them and their work. Their ideas seem
very far from being crazy. They have also proved their type system to be
sound. This kind of work is lightyears ahead of the usual sloppy designs
you see in D features, where design holes are found only years later,
when sometimes it's too much late to fix them :-)

The part with respect for one and one's work applies right back at you.
Andrei

I think this is far more convenient than any crazy 'if' syntax :) .. It's
also perfectly optimal on all architectures I know aswell!

You should show more respect for them and their work. Their ideas seem very far
from being crazy. They have also proved their type system to be sound. This
kind
of work is lightyears ahead of the usual sloppy designs you see in D features,
where design holes are found only years later, when sometimes it's too much
late
to fix them :-)
That if syntax (that is integrated in a type system that manages the masks,
plus
implicit polymorphism that allows the same function to be used both in a
vectorized or scalar context) works with larger amounts of code too, while you
are just doing a differential assignment.

The interesting thing about SIMD code is that if you just read the data sheets
for SIMD instructions, and write some SIMD code based on them, you're going to
get lousy results. I know this from experience (see the array op SIMD
implementations in the D runtime library).
Making SIMD code that delivers performance turns out to be a highly quirky and
subtle exercise, one that is resistant to formalization. Despite the
availability of SIMD hardware, there is a terrible lack of quality information
on how to do it right on the internet by people who know what they're talking
about.
Manu is on the daily front lines of doing competitive, real world SIMD
programming. He leads a team doing SIMD work. Hence, I am going to strongly
weight his opinions on any high level SIMD design constructs.
Interestingly, both of us have rejected the "auto-vectorization" approach
popular in C/C++ compilers, for very different reasons.

Making SIMD code that delivers performance turns out to be a
highly quirky and subtle exercise, one that is resistant to
formalization.

I have written some SIMD code, with mixed results, so I
understand part of such problems, despite my total experience on
such things is limited.
Despite those problems and their failures I think it's important
to support computer scientists that try to invent languages that
try to offer medium-level means to write such kind of code :-)
Reading and studying CS papers is important.

Manu is on the daily front lines of doing competitive, real
world SIMD programming. He leads a team doing SIMD work. Hence,
I am going to strongly weight his opinions on any high level
SIMD design constructs.

I respect both Manu and his work (and you Walter are the one at
the top of my list of programming heroes).

Interestingly, both of us have rejected the
"auto-vectorization" approach popular in C/C++ compilers, for
very different reasons.

The authors of that paper too have rejected it. It doesn't give
enough semantics to the compilers. They have explored a different
solution.
Bye,
bearophile

class: Similar to std::vector which abstracts a traditional
C array, one
could implement a wrapper around a T block[N]*:<

// scalar context throughout this example
struct vec3 { float x, y, z; };
// vec3 block[N]* pointing to ceil(n/N) elements
hsoa <vec3 > vecs(n);
// preferred vector length of vec3 automatically derived
static const int N = hsoa <vec3 >::vector_length;
int i = /*...*/
hsoa <vec3 >::block_index ii = /*...*/
vec3 v = vecs[i]; // gather
vecs[i] = v; // scatter
vec3 block[N] w = vecs[ii]; // fetch whole block
hsoa <vec3 >::ref r = vecs[i]; // get proxy to a scalar
r = v; // pipe through proxy
// for each element
vecs.foreach([](vec3& scalar v) { /*...*/ });
Regardless of the other ideas of their C-like language, a
similar struct
should be added to Phobos once a bit higher level SIMD
support is in better
shape in D. Supporting Hybrid-SoA and few operations on it
will be an
important but probably quite short and simple addition to
Phobos
collections (it's essentially an struct that acts like an
array, with few
simple extra operations).
I think no commonly used language allows both very simple and
quite
efficient SIMD programming (Scala, CUDA, C, C++, C#, Java,
Go, and
currently Rust too, are not able to support SIMD programming
well. I think
currently Haskell too is not supporting it well, but Haskell
is very
flexible, and it's compiled by a native compiler, so such
things are maybe
possible to add). So supporting it well in D will be an
interesting selling
point of D. (Supporting a very simple SIMD coding in D will
make D more
widespread, but such kind of programming will probably keep
being a small
niche).
Bye,
bearophile

Actually, I am yet to see any language that has SIMD as part
of the
language standard and not as an extension where each vendor
does its own
way.

HLSL, GLSL, Cg? :)

I was thinking about general purpose programming languages, not
domain specific ones.
--
Paulo

It may be useful to have a way to define compound operators for
other things (although you can already write expression
templates), but this is an optimization that the compiler back
end can do. If you compile this code:
float4 foo(float4 a, float4 b, float4 c){ return a * b + c; }
With gdc with flags -O2 -fma, you get:
0000000000000000 <_D3tmp3fooFNhG4fNhG4fNhG4fZNhG4f>:
0: c4 e2 69 98 c1 vfmadd132ps xmm0,xmm2,xmm1
5: c3 ret

It may be useful to have a way to define compound operators for other
things (although you can already write expression templates), but this is
an optimization that the compiler back end can do. If you compile this code:
float4 foo(float4 a, float4 b, float4 c){ return a * b + c; }
With gdc with flags -O2 -fma, you get:
0000000000000000 <_**D3tmp3fooFNhG4fNhG4fNhG4fZNhG4**f>:
0: c4 e2 69 98 c1 vfmadd132ps xmm0,xmm2,xmm1
5: c3 ret

Right, I suspected GDC might do that, but it was just an example. You can
extend that to many more complicated scenarios.
What does it do on less mature architectures like MIPS, PPC, ARM?

One thing I can think of that would really improve simd (and
not only simd)
would be a way to define compound operators.
If the library could detect/hook sequences of operations and
implement them
more efficiently as a compound, that would make some very
powerful
optimisations available.
Simple example:
T opCompound(string seq)(T a, T b, T c) if(seq == "* +") {
return
_madd(a, b, c); }
T opCompound(string seq)(T a, T b, T c) if(seq == "+ *") {
return
_madd(b, c, a); }

I thought about that before and it might be nice to have that
level of control in the language, but ultimately, like jerro
said, I think it would be better suited for the compiler's
backend optimization. Unfortunately I don't think more complex
patterns, such as Matrix multiplications, are found and optimized
by GCC/LLVM... I could be wrong, but these are area where my
hand-tuned code always outperforms basic math code.
I think having that in the back-end makes a lot of sense, because
your code is easier to read and understand, without sacrificing
performance. Plus, it would be difficult to map a sequence as
complex as matrix multiplication to a single compound operator.
That being said, I do think something similar would be useful in
general:
struct Vector
{
...
static float distance(Vector a, Vector b) {...}
static float distanceSquared(Vector a, Vector b) {...}
float opSequence(string funcs...)(Vector a, Vector b)
if (funcs[0] == "Math.sqrt" &&
funcs[1] == "Vector.distance")
{
return distanceSquared(a, b);
}
}
void main()
{
auto a = Vector.random( ... );
auto b = Vector.random( ... );
// Below is turned into a 'distanceSquared()' call
float dis = Math.sqrt(Vector.distance(a, b));
}
Since distance requires a 'Math.sqrt()', this pseudo-code could
avoid the operation entirely by calling 'distanceSquared()' even
if the programmer is a noob and doesn't know to do it explicitly.