>> Yes, the approach taken in this patch looks really good to me. There>> should be no code differences with your patch, but let's ask HJ for>> his opinion on intrinsics header changes.>> Hi Kirill,>> Can you take a look? Thanks.
Hi guys, I like changes in intrinsics header.
Thanks, K

Hi guys,
Could I ask several questions just to clarify the things up?
1) Does the root problem lay in the fact that even for scalar
additions we perform the addition on the whole vector and only then
drop the higher parts of the vector? I.e. to fix the test from the PR
we need to replace plus on vector mode with plus on scalar mode?
2) Is one of the main requirements having the same pattern for V4SF
and V2DF version?
3) I don't see vec_concat in patterns from your patches, is it
explicitly generated by some x86-expander?
Anyway, I really like the idea of having some uniformity in describing
patterns for scalar instructions, so thank you for the work!
On 6 December 2012 17:42, Kirill Yukhin <kirill.yukhin@gmail.com> wrote:
>>> Yes, the approach taken in this patch looks really good to me. There>>> should be no code differences with your patch, but let's ask HJ for>>> his opinion on intrinsics header changes.>>>> Hi Kirill,>>>> Can you take a look? Thanks.>> Hi guys, I like changes in intrinsics header.>> Thanks, K

On Fri, Dec 7, 2012 at 7:49 AM, Michael Zolotukhin
<michael.v.zolotukhin@gmail.com> wrote:
> Hi guys,> Could I ask several questions just to clarify the things up?>> 1) Does the root problem lay in the fact that even for scalar> additions we perform the addition on the whole vector and only then> drop the higher parts of the vector? I.e. to fix the test from the PR> we need to replace plus on vector mode with plus on scalar mode?
Yes, existing pattern is used to implement intrinsics, and it is
modelled with vector operand 2. But we in fact emit scalar operation,
so we would like to model the pattern with a scalar operand 2. This
way, the same pattern can be used to emit intrinsics _and_ can be used
to optimize the code from the testcase at the same time. Also, please
note that alignment requirements for vector operand and scalar
operands are different.
> 2) Is one of the main requirements having the same pattern for V4SF> and V2DF version?
It is not required, but having macroized pattern avoids pattern
explosion, eases maintenance (it is easier to understand similar
functionality if it is described in some uniform way), and in some
cases, macroization opportunities force author to rethink the RTL
description, making the patterns more "universal".
Uros.

On Fri, 7 Dec 2012, Michael Zolotukhin wrote:
> 1) Does the root problem lay in the fact that even for scalar> additions we perform the addition on the whole vector and only then> drop the higher parts of the vector? I.e. to fix the test from the PR> we need to replace plus on vector mode with plus on scalar mode?
The root problem is that we model the subs[sd] instructions as taking a
128-bit second operand, when Intel's documentation says they take a
32/64-bit operand, which is an important difference for memory operands
(and constants). Writing a pattern that reconstructs the result from a
scalar operation also seems more natural than pretending we are doing a
parallel operation and dropping most of it (easier for recog and friends).
(note: I think the insn was written to support the intrinsic, which does
take a 128-bit argument, so it did a good job for that)
> 2) Is one of the main requirements having the same pattern for V4SF> and V2DF version?
Uros seems to think that would be best.
> 3) I don't see vec_concat in patterns from your patches, is it> explicitly generated by some x86-expander?
It is generated by ix86_expand_vector_set.
> Anyway, I really like the idea of having some uniformity in describing> patterns for scalar instructions, so thank you for the work!
For 2-element vectors, vec_concat does seem more natural than vec_merge.
If we chose vec_merge as the canonical representation, we should chose it
for setting an element in a vector (ix86_expand_vector_set) everywhere,
not just those scalarish operations.
So it would be good to have rth's opinion on this (svn blame seems to
indicate he is the one who chose to use vec_concat specifically for V2DF
instead of vec_merge).

Thanks for the explanation!
By the way, if we decide to have one pattern for V4SF instructions and
another for V2DF, we could try to use recently introduced define_subst
here. It won't reduce number of actual patterns (I mean number of
patterns after iterators and subst expanding), but it could help to
make sse.md more compact.
On 7 December 2012 12:49, Marc Glisse <marc.glisse@inria.fr> wrote:
> On Fri, 7 Dec 2012, Michael Zolotukhin wrote:>>> 1) Does the root problem lay in the fact that even for scalar>> additions we perform the addition on the whole vector and only then>> drop the higher parts of the vector? I.e. to fix the test from the PR>> we need to replace plus on vector mode with plus on scalar mode?>>> The root problem is that we model the subs[sd] instructions as taking a> 128-bit second operand, when Intel's documentation says they take a> 32/64-bit operand, which is an important difference for memory operands (and> constants). Writing a pattern that reconstructs the result from a scalar> operation also seems more natural than pretending we are doing a parallel> operation and dropping most of it (easier for recog and friends).>> (note: I think the insn was written to support the intrinsic, which does> take a 128-bit argument, so it did a good job for that)>>>> 2) Is one of the main requirements having the same pattern for V4SF>> and V2DF version?>>> Uros seems to think that would be best.>>>> 3) I don't see vec_concat in patterns from your patches, is it>> explicitly generated by some x86-expander?>>> It is generated by ix86_expand_vector_set.>>>> Anyway, I really like the idea of having some uniformity in describing>> patterns for scalar instructions, so thank you for the work!>>> For 2-element vectors, vec_concat does seem more natural than vec_merge. If> we chose vec_merge as the canonical representation, we should chose it for> setting an element in a vector (ix86_expand_vector_set) everywhere, not just> those scalarish operations.>> So it would be good to have rth's opinion on this (svn blame seems to> indicate he is the one who chose to use vec_concat specifically for V2DF> instead of vec_merge).>> --> Marc Glisse

On 2012-12-07 02:49, Marc Glisse wrote:
> The root problem is that we model the subs[sd] instructions as taking> a 128-bit second operand, when Intel's documentation says they take a> 32/64-bit operand, which is an important difference for memory> operands (and constants). Writing a pattern that reconstructs the> result from a scalar operation also seems more natural than> pretending we are doing a parallel operation and dropping most of it> (easier for recog and friends).
I agree this is a problem with the current representation.
> For 2-element vectors, vec_concat does seem more natural than> vec_merge. If we chose vec_merge as the canonical representation, we> should chose it for setting an element in a vector> (ix86_expand_vector_set) everywhere, not just those scalarish> operations.
I'd hate to enshrine vec_merge over vec_concat for the benefit of x86,
and to the detriment of e.g. mips. There are plenty of embedded simd
implementations that are V2xx only.
If we simply pull the various x86 patterns into one common form, set
and extract included, does that buy us most of what we'd get for
playing games in combine?
As for your xmmintrin.h changes, I'd like to see a test case that verifies
that _mm_add_ss(a, b) does not add extra insns to extract __B[0].
> +(define_insn "<sse>_vm<plusminus_insn><mode>3<vec_merge_or_concat>"> [(set (match_operand:VF_128 0 "register_operand" "=x,x")> (vec_merge:VF_128> - (plusminus:VF_128> - (match_operand:VF_128 1 "register_operand" "0,x")> - (match_operand:VF_128 2 "nonimmediate_operand" "xm,xm"))> + (vec_duplicate:VF_128> + (plusminus:<ssescalarmode>> + (vec_select:<ssescalarmode>> + (match_operand:VF_128 1 "register_operand" "0,x")> + (parallel [(const_int 0)]))> + (match_operand:<ssescalarmode> 2 "nonimmediate_operand" "xm,xm")))> (match_dup 1)> (const_int 1)))]> "TARGET_SSE"> "@> <plusminus_mnemonic><ssescalarmodesuffix>\t{%2, %0|%0, %2}> v<plusminus_mnemonic><ssescalarmodesuffix>\t{%2, %1, %0|%0, %1, %2}"> [(set_attr "isa" "noavx,avx")> (set_attr "type" "sseadd")> (set_attr "prefix" "orig,vex")> (set_attr "mode" "<ssescalarmode>")])
Did this really trigger as a substitution? It's not supposed to have, since
you didn't add (set_attr "replace_vec_merge_with_vec_concat" "yes")...
r~

On 2012-12-07 08:47, Jakub Jelinek wrote:
> That was the older proposal, the current way to trigger it is using> the substitution attr somewhere, typically in pattern name> - <vec_merge_or_concat> in the above case.
Ah right.
("My mind is going, Dave. I can feel it.")
r~

On Fri, 7 Dec 2012, Richard Henderson wrote:
> On 2012-12-07 02:49, Marc Glisse wrote:>> For 2-element vectors, vec_concat does seem more natural than>> vec_merge. If we chose vec_merge as the canonical representation, we>> should chose it for setting an element in a vector>> (ix86_expand_vector_set) everywhere, not just those scalarish>> operations.>> I'd hate to enshrine vec_merge over vec_concat for the benefit of x86,> and to the detriment of e.g. mips. There are plenty of embedded simd> implementations that are V2xx only.>> If we simply pull the various x86 patterns into one common form, set> and extract included, does that buy us most of what we'd get for> playing games in combine?
I'm sorry, could you be more precise? I don't see clearly what you are
suggesting.
> As for your xmmintrin.h changes, I'd like to see a test case that verifies> that _mm_add_ss(a, b) does not add extra insns to extract __B[0].
Yes, good idea, thanks.

On 2012-12-07 09:00, Marc Glisse wrote:
> On Fri, 7 Dec 2012, Richard Henderson wrote:> >> On 2012-12-07 02:49, Marc Glisse wrote:>>> For 2-element vectors, vec_concat does seem more natural than>>> vec_merge. If we chose vec_merge as the canonical representation, we>>> should chose it for setting an element in a vector>>> (ix86_expand_vector_set) everywhere, not just those scalarish>>> operations.>>>> I'd hate to enshrine vec_merge over vec_concat for the benefit of x86,>> and to the detriment of e.g. mips. There are plenty of embedded simd>> implementations that are V2xx only.>>>> If we simply pull the various x86 patterns into one common form, set>> and extract included, does that buy us most of what we'd get for>> playing games in combine?> > I'm sorry, could you be more precise? I don't see clearly what you are suggesting.
Don't change combine?
Have I lost the plot somewhere here?
r~

On Fri, 7 Dec 2012, Richard Henderson wrote:
> On 2012-12-07 09:00, Marc Glisse wrote:>> On Fri, 7 Dec 2012, Richard Henderson wrote:>>>>> On 2012-12-07 02:49, Marc Glisse wrote:>>>> For 2-element vectors, vec_concat does seem more natural than>>>> vec_merge. If we chose vec_merge as the canonical representation, we>>>> should chose it for setting an element in a vector>>>> (ix86_expand_vector_set) everywhere, not just those scalarish>>>> operations.>>>>>> I'd hate to enshrine vec_merge over vec_concat for the benefit of x86,>>> and to the detriment of e.g. mips. There are plenty of embedded simd>>> implementations that are V2xx only.>>>>>> If we simply pull the various x86 patterns into one common form, set>>> and extract included, does that buy us most of what we'd get for>>> playing games in combine?>>>> I'm sorry, could you be more precise? I don't see clearly what you are suggesting.>> Don't change combine?
but change ix86_expand_vector_set and others to generate vec_merge and
have only the vec_merge define_insn in sse.md? I guess it would buy a
large part of it. That's a pretty invasive change, I'll have to try...

On 2012-12-07 09:12, Marc Glisse wrote:
> but change ix86_expand_vector_set and others to generate vec_merge> and have only the vec_merge define_insn in sse.md? I guess it would> buy a large part of it. That's a pretty invasive change, I'll have to> try...
Is it really that invasive? Anyway, it's something worth trying for 4.9...
r~

On Fri, 7 Dec 2012, Richard Henderson wrote:
> On 2012-12-07 09:12, Marc Glisse wrote:>> but change ix86_expand_vector_set and others to generate vec_merge>> and have only the vec_merge define_insn in sse.md? I guess it would>> buy a large part of it. That's a pretty invasive change, I'll have to>> try...>> Is it really that invasive?
No, changing only V2DF, I seem to have the basic pieces in place changing
just 6 patterns in sse.md and a couple functions in i386.c. Now I need to
test it and see how much it affects the generated code...
> Anyway, it's something worth trying for 4.9...
Should I take it that if it ends up looking manageable, you prefer
changing all the V2DF operations to vec_merge, rather than just adapting
the addsd pattern? Won't it complicate things for generic optimizations if
different platforms have different canonical patterns for the same
operation on V2DF? It seems to me that it would be good if we agreed on a
pattern common to all platforms so we could canonicalize as in the last
part of:
http://gcc.gnu.org/ml/gcc-patches/2012-12/msg00243.html
(or possibly the reverse transformation depending on the pattern we settle
on).
(those 2 patches didn't touch the generic simplify-rtx code either, if I
remember just the "don't touch combine" part of your suggestion ;-)
http://gcc.gnu.org/ml/gcc-patches/2012-12/msg00028.html
http://gcc.gnu.org/ml/gcc-patches/2012-12/msg00492.html )