Myles Watson <mylesgw at gmail.com> writes:
> I was having trouble with stack corruption. Using memset (C) instead of
> clear_memory(asm) speeds it up by almost a factor of 2 for a 1M region.
>> TSC difference with clear_memory 0xFA884D
> TSC difference with memset 0x826742
That's odd. I just recently sent a patch to the list ("ulzma delay")
that did pretty much the opposite, as I was seeing really bad
performance for the C memset function on my Opteron (Istanbul) boxes.
memset would take minutes to do what ran in a handful of ms using "rep
stosb", by all accounts because of instruction cache thrashing.
I see clear_memory was using "stosl", but apart from that it looks
very similar to the variant I ended up with to improve performance.
Could you see if you experience stack corruption with the "rep stosb"
patch I posted for memset as well? I'd like to see that go in, but of
course it's a problem if it results in a performance degradation on
other platforms. Perhaps we could enable it only for the platforms where
instruction footprint/fetches is known to be an issue, ie fam10?
--
Arne.