If you haven't removed the -O1 option applied to QBVH, then the bug is not there because optimizations are disabled, not because it has been fixed in gcc.

Jeanphi

no I commented the whole line, so qbvh was compiled with the default options (-O3 ...). I specifically checked the compile command, set cmake to be verbose, etc... Furthermore with flto flags, the real optimization is performed at the end link (you must pass -flto -Oxx when linking)

Listing all compiler (and especially unofficial gcc variant) will not be easy, finding old gcc will be tricky. I guess that setting and supporting a minimal version would be easier (I guess the oldest still alive is on mac with "apple" gcc 4.2.x). Does the bug also affect windows ? setting -O1 optimize for size on windows, "roughly" equivalent to Os which is a higher optimization level than -O1 on gcc world..

The reason I did reopen the case was that profiling showed a lot of time is spent inside that file.Thought it might be the best place to gather additional information to try and find thespecific bug with gcc bugtracking. After all such a big problem must have been noticed/tracked/fixed.

Then we'd know exactly which GCC versions are broken and which are not.

foobarbarian wrote:The reason I did reopen the case was that profiling showed a lot of time is spent inside that file.

This is pretty expected, it is the "ray tracing" part of the code and you can expect a ray tracer to spend there most of the time. The true is that LuxRender should spend there even more time: other part of the code are quite expansive compared to the cost of tracing rays. It also true that this is a general trend seen in the last years in about every "ray tracer" (i.e. sampling, shading, image pipiline are more expansive than in the past).

Note: QBVH is mostly hand written SSE code, it is nearly like if it was written in assembler so the level of compiler optimization doesn't change too much the level of performance.

Dade wrote:Note: QBVH is mostly hand written SSE code, it is nearly like if it was written in assembler so the level of compiler optimization doesn't change too much the level of performance.

This is from memory as my eval copy of the Intel Suite has run out... but I try to get my facts straight Give the eval a try if you haven't done yet - it does provide an amazing amount of information whilestill maintaining a somewhat coherent view. Shame its not free for GPL work on windows.

While most part of part of the QBVHAccel is hand coded the Intel compiler is still able to add a sizable boost to QBVHAccel::Intersectover MSVC10 Pro builds - roughly 20% for the whole luxmark benchmark (cpu only) which does not contain any computationally expensive material.

It does so mainly via more clever instruction reordering to keep the pipelines filled. I don't think it did any auto vectorization on its own there.

Still, there is an issue of branch mispredicts in that part of the code, execution units where not utilized at 100% but far from that.

Also, while the code in QBVHAccel::Intersect from luxrays does look uglier it should offer better performance as the code most likely will end upwith less branches.

On a side note, I've now read multiple times that loop unrolling will hurt performance on Sandy-Bridge gen cpus but so far I haven't had a chance to test that myself.

If we talk about a 7 day render for a frame or multiple frames adding 2% performance means >3 hours saved.

J the Ninja wrote:Btw, we use Clang on OS X for the core libs, Apple GCC is only used for the apps themselves, and even that is only because Qt doesn't play nice with Clang.

Just for my curiosity and as I don't have any mac near me at home (only at work..) for testing by myself, did the use of CLang give a measurable performance boost ? I've heard anyway that clang optimization is not on the same level as recent gcc. BTW the mac lag behind with 4.2... And if rebuilding gcc is easy, even on mac, it take a lot of time to have a 4.6.1 running on this platforms, and having framework support is a real nightmare when using pure gnu tools.

It did, something like 10% actually, IIRC. Flipping on link-time optimization pushes this up another 3-5% or so, but for some reason the resulting build only works on newer Macs (first gen C2D-era chips can't run it). Might be a bug in Clang slipping sse4 instructions in or something (no idea really, that explanation just makes some amount of sense)

Dade wrote:Note: QBVH is mostly hand written SSE code, it is nearly like if it was written in assembler so the level of compiler optimization doesn't change too much the level of performance.

I agree, but this is not incompatible with gain boost with higher optimization level since I guess the 2% gain is mostly due to auto inlining that -O3 turns on. The code is possibly not faster, but the code "flow" maybe. And 2% +2% +2% +2% begins to give a lot of %.

I think a profile guided optimization pass should even be better, it should even be easy to do, as the render process is quite simple: input data -> render engine -> result. x264 encoder use this kind of build procedure for its release and it is working well.

I will try to make a prototype once cmake files have been stabilized and/or if I have some times this week. The first step would be to gather some test scene collection that cover 80-90% of the real hot code path to have accurate result. Since I'm very new to luxrender world (and in 3d modeling to ), do you have any clue about where to find that and which render settings should I use ?

Then i've upgraded toolchain to gcc-4.6.1 and ld-2.21.1 (glibc and cloog-ppl was left intact). I recompiled Lux dependencies with the new toolchain and these flags (the same flags that was used previously except for removed -fPIC):

Each luxconsole binary was in fact slightly smaller with LTO (12-13MB vs 15.5MB), however their performance was also lower: in the range of 84.5-88.0 kS/s (luxtime) and 44.5-46.0 (schoolcorridor) with different sets of flags. Really close, but clearly no improvement over gcc-4.5.2 without LTO.

daidai67, would you be so kind to show the actual list of flags you're passing for Lux compilation? Or point me please where i failed that miserably.

P.S. I also tried LTO with dependencies as well, but the final results was almost the same.