Comments

On Thu, Jun 28, 2012 at 09:17:55AM +0200, Jakub Jelinek wrote:
> I'll look at using MULT_HIGHPART_EXPR in the pattern recognizer and> vectorizing it as either of the sequences next.
And here is corresponding pattern recognizer and vectorizer patch.
Bootstrapped/regtested on x86_64-linux and i686-linux, ok for trunk?
Unfortunately the addition of the builtin_mul_widen_* hooks on i?86 seems
to pessimize the generated code for gcc.dg/vect/pr51581-3.c
testcase (at least with -O3 -mavx) compared to when the hooks aren't
present, because i?86 has more natural support for widen mult lo/hi
compoared to widen mult even/odd, but I assume that on powerpc it is the
other way around. So, how should I find out if both VEC_WIDEN_MULT_*_EXPR
and builtin_mul_widen_* are possible for the particular vectype which one
will be cheaper?
2012-06-28 Jakub Jelinek <jakub@redhat.com>
PR tree-optimization/51581
* tree-vect-stmts.c (permute_vec_elements): Add forward decl.
(vectorizable_operation): Handle vectorization of MULT_HIGHPART_EXPR
also using VEC_WIDEN_MULT_*_EXPR or builtin_mul_widen_* plus
VEC_PERM_EXPR if vector MULT_HIGHPART_EXPR isn't supported.
* tree-vect-patterns.c (vect_recog_divmod_pattern): Use
MULT_HIGHPART_EXPR instead of VEC_WIDEN_MULT_*_EXPR and shifts.
* gcc.dg/vect/pr51581-4.c: New test.
Jakub

On 2012-06-28 07:05, Jakub Jelinek wrote:
> Unfortunately the addition of the builtin_mul_widen_* hooks on i?86 seems> to pessimize the generated code for gcc.dg/vect/pr51581-3.c> testcase (at least with -O3 -mavx) compared to when the hooks aren't> present, because i?86 has more natural support for widen mult lo/hi> compoared to widen mult even/odd, but I assume that on powerpc it is the> other way around. So, how should I find out if both VEC_WIDEN_MULT_*_EXPR> and builtin_mul_widen_* are possible for the particular vectype which one> will be cheaper?
I would assume that if the builtin exists, then it is cheaper.
I disagree about "x86 has more natural support for hi/lo". The basic sse2 multiplication is even. One shift per input is needed to generate odd. On the other hand, one interleave per input is required for both hi/lo. So 4 setup insns for hi/lo, and 2 setup insns for even/odd. And on top of all that, XOP includes multiply odd at least for signed V4SI.
I'll have a look at the test case you mention while I re-look at the patches...
r~

On Thu, Jun 28, 2012 at 8:57 AM, Richard Henderson <rth@redhat.com> wrote:
> On 2012-06-28 07:05, Jakub Jelinek wrote:>> Unfortunately the addition of the builtin_mul_widen_* hooks on i?86 seems>> to pessimize the generated code for gcc.dg/vect/pr51581-3.c>> testcase (at least with -O3 -mavx) compared to when the hooks aren't>> present, because i?86 has more natural support for widen mult lo/hi>> compoared to widen mult even/odd, but I assume that on powerpc it is the>> other way around. So, how should I find out if both VEC_WIDEN_MULT_*_EXPR>> and builtin_mul_widen_* are possible for the particular vectype which one>> will be cheaper?>> I would assume that if the builtin exists, then it is cheaper.>> I disagree about "x86 has more natural support for hi/lo". The basic sse2 multiplication is even. One shift per input is needed to generate odd. On the other hand, one interleave per input is required for both hi/lo. So 4 setup insns for hi/lo, and 2 setup insns for even/odd. And on top of all that, XOP includes multiply odd at least for signed V4SI.>> I'll have a look at the test case you mention while I re-look at the patches...>
The upper 128-bit of 256-bit AVX instructions aren't a good fit with the
current vectorizer infrastructure.

On 2012-06-28 09:20, Jakub Jelinek wrote:
> Perhaps the problem is then that the permutation is much more expensive> for even/odd. With even/odd the f2 routine is:
...
> vpshufb %xmm2, %xmm5, %xmm5> vpshufb %xmm1, %xmm4, %xmm4> vpor %xmm4, %xmm5, %xmm4
...
> and with lo/hi it is:> vshufps $221, %xmm2, %xmm3, %xmm2
Hmm. That second has a reformatting delay.
Last week when I pulled the mulv4si3 routine out to i386.c,
I experimented with a few different options, including that
interleave+shufps sequence seen here for lo/hi. See the
comment there discussing options and timing.
This also shows a deficiency in our vec_perm logic:
0L 0H 2L 2H 1L 1H 3L 3H
0H 2H 0H 2H 1H 3H 1H 3H 2*pshufd
0H 1H 2H 3H punpckldq
without the permutation constants in memory.
r~

On Thu, Jun 28, 2012 at 6:44 PM, H.J. Lu <hjl.tools@gmail.com> wrote:
> On Thu, Jun 28, 2012 at 8:57 AM, Richard Henderson <rth@redhat.com> wrote:>> On 2012-06-28 07:05, Jakub Jelinek wrote:>>> Unfortunately the addition of the builtin_mul_widen_* hooks on i?86 seems>>> to pessimize the generated code for gcc.dg/vect/pr51581-3.c>>> testcase (at least with -O3 -mavx) compared to when the hooks aren't>>> present, because i?86 has more natural support for widen mult lo/hi>>> compoared to widen mult even/odd, but I assume that on powerpc it is the>>> other way around. So, how should I find out if both VEC_WIDEN_MULT_*_EXPR>>> and builtin_mul_widen_* are possible for the particular vectype which one>>> will be cheaper?>>>> I would assume that if the builtin exists, then it is cheaper.>>>> I disagree about "x86 has more natural support for hi/lo". The basic sse2 multiplication is even. One shift per input is needed to generate odd. On the other hand, one interleave per input is required for both hi/lo. So 4 setup insns for hi/lo, and 2 setup insns for even/odd. And on top of all that, XOP includes multiply odd at least for signed V4SI.>>>> I'll have a look at the test case you mention while I re-look at the patches...>>>> The upper 128-bit of 256-bit AVX instructions aren't a good fit with the> current vectorizer infrastructure.
Indeed - the lack of cross-sub-128bit-word operations makes it very much
expensive for some vectorizations. Initially we added the patterns for
vectorization of the hi/lo and interleave stuff because we didn't want
regressions
for vectorizing with 256bit vectors vs. 128bit vectors in the
vectorizer testsuite.
But now as we have support for vectorizing with both sizes we could consider
not advertising the really not existing intstructions for 256bit vectors. Or at
least properly model their cost.
Richard.
>> --> H.J.