Comments

Hi,
This patch serves as a proposal to turn on software prefetching by default at -O3.
Software prefetching pass has been introduced into gcc for a long time. It has been
observed that the prefetch pass is quite stable now and can noticeably improve
program performance where locality is a concern. As a result, we think it is time
to enable it under -O3 so that we can get the benefit out of it.
We have collected the data that shows the impact of prefetch on cpu2006 performance
under -O3 with gcc 4.6. There is a ~2% improvement on both integer and floating programs
and there is no apparent degradation. Programs listed below are those that have at least
(+/-)1% variation:
434.zeusmp (+1.77%), 436.cactusADM (+2.4%), 450.soplex (+1.28%),
459.GemsFDTD (+5.48%), 470.lbm (+31.84%), 482.sphnix3 (+1.01%),
458.sjeng (-1.27%), 462.libquantum (+19.23%).
The patch passed bootstrapping with -O3 -g. We have observed 2 failues in regression tests:
(1) gcc.dg/torture/stackalign/setjmp-1.c -O3 -fomit-frame-pointer (internal compiler error)
This is bug 44503. It is a CFG problem associate with _builtin_prefetch and we have had
a fix for it.
(2) gcc.dg/tree-ssa/loop-19.c scan-tree-dump-times optimized "MEM.(base: &|symbol: )a," 2
With prefetching turned on, the memory count is different. We can easily fix the test case after
this patch is accepted.
We realize that prefetch in gcc still has room for improvement. But we think it is good enough
now to be turned on.
Is it acceptable to commit this patch?
Thanks,
Changpeng

On Tue, Jun 15, 2010 at 2:34 PM, Fang, Changpeng <Changpeng.Fang@amd.com> wrote:
>> Hi,>> This patch serves as a proposal to turn on software prefetching by default at -O3.>> Software prefetching pass has been introduced into gcc for a long time. It has been> observed that the prefetch pass is quite stable now and can noticeably improve> program performance where locality is a concern. As a result, we think it is time> to enable it under -O3 so that we can get the benefit out of it.>> We have collected the data that shows the impact of prefetch on cpu2006 performance> under -O3 with gcc 4.6. There is a ~2% improvement on both integer and floating programs> and there is no apparent degradation. Programs listed below are those that have at least> (+/-)1% variation:> 434.zeusmp (+1.77%), 436.cactusADM (+2.4%), 450.soplex (+1.28%),> 459.GemsFDTD (+5.48%), 470.lbm (+31.84%), 482.sphnix3 (+1.01%),> 458.sjeng (-1.27%), 462.libquantum (+19.23%).>> The patch passed bootstrapping with -O3 -g. We have observed 2 failues in regression tests:> (1) gcc.dg/torture/stackalign/setjmp-1.c -O3 -fomit-frame-pointer (internal compiler error)> This is bug 44503. It is a CFG problem associate with _builtin_prefetch and we have had> a fix for it.> (2) gcc.dg/tree-ssa/loop-19.c scan-tree-dump-times optimized "MEM.(base: &|symbol: )a," 2> With prefetching turned on, the memory count is different. We can easily fix the test case after> this patch is accepted.>> We realize that prefetch in gcc still has room for improvement. But we think it is good enough> now to be turned on.>> Is it acceptable to commit this patch?>
Have you measured performance impact on Intel Core i7?

>Have you measured performance impact on Intel Core i7?
Noy yet. I should have mentioned that the performance data I posted in based
on the test on amd-linux64 system and gcc 4.6 r160766. (which includes my
last change to fix tonto regression).
Thanks,
Changpeng

On Tue, Jun 15, 2010 at 3:10 PM, Fang, Changpeng <Changpeng.Fang@amd.com> wrote:
>>Have you measured performance impact on Intel Core i7?>> Noy yet. I should have mentioned that the performance data I posted in based> on the test on amd-linux64 system and gcc 4.6 r160766. (which includes my> last change to fix tonto regression).>
Is that rate or speed? I would like to see numbers on Intel Core i7 if you are
proposing enable it by default for x86.
Thanks.

>On Tue, Jun 15, 2010 at 3:10 PM, Fang, Changpeng <Changpeng.Fang@amd.com> wrote:>>>Have you measured performance impact on Intel Core i7?>>>> Noy yet. I should have mentioned that the performance data I posted in based>> on the test on amd-linux64 system and gcc 4.6 r160766. (which includes my>> last change to fix tonto regression).>>
>Is that rate or speed? I would like to see numbers on Intel Core i7 if you are>proposing enable it by default for x86.
Speed run. Hopefully prefetching could benefit all targets that support sw prefetch.
Thanks,
Changpeng

Fang, Changpeng wrote:
>> Is that rate or speed? I would like to see numbers on Intel Core i7 if you are>> proposing enable it by default for x86.
HJ, are you in position to test this on Core i7? Since you work at
Intel, it seems like it might be easier for you to do it than for the
folks at AMD to do it?
Maxim, is this something that you could easily test on some of our
embedded hardware, assuming that the machine descriptions define the
relevant prefetch instructions? For example, I think that ARM, MIPS,
and Power all have theses instructions in the ISA, and we could look at
CoreMark numbers to at least convince ourselves that this does no harm.
Thanks,

On Tue, Jun 15, 2010 at 3:47 PM, Mark Mitchell <mark@codesourcery.com> wrote:
> Fang, Changpeng wrote:>>>> Is that rate or speed? I would like to see numbers on Intel Core i7 if you are>>> proposing enable it by default for x86.>> HJ, are you in position to test this on Core i7? Since you work at> Intel, it seems like it might be easier for you to do it than for the> folks at AMD to do it?
We never tried software prefetch since Intel processors rarely need
it. We will try it. It will take some times.

H.J. Lu wrote:
> We never tried software prefetch since Intel processors rarely need> it. We will try it. It will take some times.
OK.
The other thing we can certainly do, once we get some input about how
well it does on various architectures, is to enable it conditionally.
There is no reason that -O3 has to enable the same things on all
architectures. If prefetching is a win on AMD CPUs, then -O3 should
enable it on AMD CPUs -- but if it hurts Intel CPUs, or MIPS CPUs, or
whatever, then it shouldn't be enabled on those architectures.
So, let's see how generalizable Changpeng's numbers are to other
architectures, and then we can decide where the setting goes. I think
there's no question from the numbers that -O3 should enable this
optimization on AMD CPUs.
Thanks,

Am Mittwoch 16 Juni 2010, 01:03:20 schrieb Mark Mitchell:
> H.J. Lu wrote:> > > We never tried software prefetch since Intel processors rarely need> > it. We will try it. It will take some times.> > OK.> > The other thing we can certainly do, once we get some input about how> well it does on various architectures, is to enable it conditionally.> There is no reason that -O3 has to enable the same things on all> architectures. If prefetching is a win on AMD CPUs, then -O3 should> enable it on AMD CPUs -- but if it hurts Intel CPUs, or MIPS CPUs, or> whatever, then it shouldn't be enabled on those architectures.> > So, let's see how generalizable Changpeng's numbers are to other> architectures, and then we can decide where the setting goes. I think> there's no question from the numbers that -O3 should enable this> optimization on AMD CPUs.
We have also seen an overall win (around +1% for int and float) with
-fprefetch-loop-arrays on s390 and activated it on O3 (see config/s390/s390.c
override_options) for 4.6.
So we (s390) are fine with enabling loop prefetching in general for O3.
Changpeng, can you confirm that -O3 -fno-prefetch-loop-arrays does indeed
deactivate prefetching?
Christian

Hi,
> This patch serves as a proposal to turn on software prefetching by default at -O3.
On S/390 I ran into a problem with the runtime of the prefetching pass when using
aggressive loop unrolling. There is an algorithm with quadratic runtime regarding the
memory references which causes a very long compile time. Christian already opened a
bugzilla for that (PR 44576). It probably would make sense to have a closer look at this
before enabling the pass by default. Perhaps we could simply limit the miss rate
computation to a certain amount of memory references?
Besides of that enabling the prefetching pass by default is ok with me. I think it would
help getting more people into improving it.
Bye,
-Andreas-

Christian Borntraeger wrote:
> It also might be worth to investigate if overriding the parameters per> -mtune=XXX results in an overall win for -fprefetch-loop-arrays. We did> that on s390 since the default values were not ideal
Yes, that might be a good idea for i7.
But, in the meantime, I think we should get a version of the patch that
turns on prefetching on AMD CPUs with -O3. There's no reason to demand
consistency for all CPUs and it clearly benefits the AMD CPUs.
Changpeng, would you please submit a patch that activates this
optimization only with tuning for AMD CPUs?
Thanks,

Mark Mitchell <mark@codesourcery.com> writes:
>> Yes, that might be a good idea for i7.>> But, in the meantime, I think we should get a version of the patch that> turns on prefetching on AMD CPUs with -O3. There's no reason to demand> consistency for all CPUs and it clearly benefits the AMD CPUs.> Changpeng, would you please submit a patch that activates this> optimization only with tuning for AMD CPUs?
On i7 it will be likely still a win if hardware prefetch is disabled
in the BIOS. Most BIOS have a setting like this.
Unfortunately this cannot be easily detected by a program.
-Andi

Hi, Mark:
>> It also might be worth to investigate if overriding the parameters per>> -mtune=XXX results in an overall win for -fprefetch-loop-arrays. We did>> that on s390 since the default values were not ideal
>Yes, that might be a good idea for i7.
>But, in the meantime, I think we should get a version of the patch that>turns on prefetching on AMD CPUs with -O3. There's no reason to demand>consistency for all CPUs and it clearly benefits the AMD CPUs.>Changpeng, would you please submit a patch that activates this>optimization only with tuning for AMD CPUs?
It seems I have a problem in setting prefetching default at -O3 only for AMD CPUs.
When we know it is AMD CPUS after the command line options have been parsed,
we don't know whether -fno-prefetch-loop-arrays has been specified in the command line
or not. If we turn on prefetching at -O3, then we could not explicitly turn it off if we want.
What I want is something like:
if (!OPTION_SET_P (flag_prefetch_loop_arrays))
flag_prefetch_loop_arrays = 1;
An alternative way is parsing -m options before other options.
What do you think?
Thanks,
Changpeng

Am Mittwoch 23 Juni 2010, 00:05:57 schrieb Fang, Changpeng:
> Hi, Mark:> > >> It also might be worth to investigate if overriding the parameters per> >> -mtune=XXX results in an overall win for -fprefetch-loop-arrays. We did> >> that on s390 since the default values were not ideal> > >Yes, that might be a good idea for i7.> > >But, in the meantime, I think we should get a version of the patch that> >turns on prefetching on AMD CPUs with -O3. There's no reason to demand> >consistency for all CPUs and it clearly benefits the AMD CPUs.> >Changpeng, would you please submit a patch that activates this> >optimization only with tuning for AMD CPUs?> > It seems I have a problem in setting prefetching default at -O3 only for AMD CPUs.> When we know it is AMD CPUS after the command line options have been parsed,> we don't know whether -fno-prefetch-loop-arrays has been specified in the command line> or not. If we turn on prefetching at -O3, then we could not explicitly turn it off if we want.> > What I want is something like:> if (!OPTION_SET_P (flag_prefetch_loop_arrays))> flag_prefetch_loop_arrays = 1;
I think having an OPTION_SET_P (or maybe name that FLAG_SET_P) makes a lot of sense and
would match the PARAM_SET_P way of doing things.
Christian

On Wed, 23 Jun 2010, Christian Borntraeger wrote:
> Am Mittwoch 23 Juni 2010, 00:05:57 schrieb Fang, Changpeng:> > Hi, Mark:> > > > >> It also might be worth to investigate if overriding the parameters per> > >> -mtune=XXX results in an overall win for -fprefetch-loop-arrays. We did> > >> that on s390 since the default values were not ideal> > > > >Yes, that might be a good idea for i7.> > > > >But, in the meantime, I think we should get a version of the patch that> > >turns on prefetching on AMD CPUs with -O3. There's no reason to demand> > >consistency for all CPUs and it clearly benefits the AMD CPUs.> > >Changpeng, would you please submit a patch that activates this> > >optimization only with tuning for AMD CPUs?> > > > It seems I have a problem in setting prefetching default at -O3 only for AMD CPUs.> > When we know it is AMD CPUS after the command line options have been parsed,> > we don't know whether -fno-prefetch-loop-arrays has been specified in the command line> > or not. If we turn on prefetching at -O3, then we could not explicitly turn it off if we want.> > > > What I want is something like:> > if (!OPTION_SET_P (flag_prefetch_loop_arrays))> > flag_prefetch_loop_arrays = 1;> > I think having an OPTION_SET_P (or maybe name that FLAG_SET_P) makes a lot of sense and> would match the PARAM_SET_P way of doing things.
We use tri-states for this elsewhere in the compiler.
Richard.

On Wed, 23 Jun 2010, Richard Guenther wrote:
> > > What I want is something like:> > > if (!OPTION_SET_P (flag_prefetch_loop_arrays))> > > flag_prefetch_loop_arrays = 1;> > > > I think having an OPTION_SET_P (or maybe name that FLAG_SET_P) makes a lot of sense and> > would match the PARAM_SET_P way of doing things.> > We use tri-states for this elsewhere in the compiler.
We use a mixture of tri-states and separate *_set variables. When my
patch series for option handling and multilib selection reaches the point
of an options structure replacing separate variables for each option, I
expect to use a separate such structure to record which options have been
set explicitly. That is:
if (!global_options_explicit.opt.prefetch_loop_arrays)
global_options.opt.prefetch_loop_arrays = 1;
(but hopefully using pointers to the relevant structures rather than
hardcoding global_options and global_options_explicit).

>> fprefetch-loop-arrays> ! Common Report Var(flag_prefetch_loop_arrays) Init(2) Optimization
For consistency with other options, I would suggest to use Init(-1)
for an "uninitialized" option. As Joseph said, this should be done
differently in the future, but for now using -1 is the usual setting.
I didn't understand how (or if) Joseph will implement the framework to
set default values. As I understand it, Init(X) should only set the
value X to the Var if it is not set by anything else before. Then, we
can move the default values back to the *.opt files. Otherwise, such a
patch would be an improvement.
Cheers,
Manuel.