Parallella Community

Current status

Re: Current status

Posted: Mon Jul 03, 2017 4:27 pm

by upcFrost

Spent quite a lot of time trying to make the Load/Store optimization pass work as it should. Still a bunch of bugs might pop up.

Not exactly sure how to proceed with volatile memory references - on one hand, the compiler should not touch those. On the other hand, merging two loads or stores does not change the logic flow itself (it should be preserved in any case, volatile or not). Maybe I'll just make an additional flag

Re: Current status

Posted: Wed Jul 05, 2017 3:39 pm

by jar

upcFrost wrote:Not exactly sure how to proceed with volatile memory references - on one hand, the compiler should not touch those. On the other hand, merging two loads or stores does not change the logic flow itself (it should be preserved in any case, volatile or not). Maybe I'll just make an additional flag

Re: Current status

Every access (both read and write) made through an lvalue expression of volatile-qualified type is considered an observable side effect for the purpose of optimization and is evaluated strictly according to the rules of the abstract machine (that is, all writes are completed at some time before the next sequence point). This means that within a single thread of execution, a volatile access cannot be optimized out or reordered relative to another visible side effect that is separated by a sequence point from the volatile access.

4) There is a sequence point after the evaluation of a full expression (an expression that is not a subexpression: typically something that ends with a semicolon or a controlling statement of if/switch/while/do) and before the next full expression.

An example where combining sequenced writes to "volatile" memory would be in the case of I/O. The sequence of I/O to registers is often significant. Converting two 32-bit writes to a 64-bit write could break the sequencing depending upon the "endianness" of the architecture.

Re: Current status

That question makes more sense and it shouldn't be done according to what Gregg posted. I can't think of a real code that this might break, but there is probably an edge case somewhere. If you do this, make it an optional flag.

However, if it's NOT a pointer to volatile data, I would appreciate the compiler optimizing it to a single double-word store (if 'p' is 8-byte aligned). This would improve unrolled code where two consecutive results or two inputs can be stored or read in a single instruction.

Re: Current status

Posted: Thu Jul 06, 2017 5:22 pm

by GreggChandler

jar wrote:it shouldn't be done according to what Gregg posted

@jar, are you suggesting that the implementor ignore the specs?

The referenced document clearly articulates a relationship to "sequence points", and as I read the quoted pages, statements/expressions separated via semi-colons define sequence points. Optimizing across a "sequence point" does not appear to be per the spec when volatile data is involved. Did I really mis-interpret or mis-read that?

The purpose of specifications is to avoid statements such as

jar wrote: I can't think of ...

Re: Current status

Posted: Thu Jul 06, 2017 9:09 pm

by jar

GreggChandler wrote:

jar wrote:it shouldn't be done according to what Gregg posted

@jar, are you suggesting that the implementor ignore the specs?

I think you misunderstood me. I meant:

jar wrote:[combining two stores to volatile memory into one double word operation] shouldn't be done

I also meant if upcFrost does implement this behavior that he should make it an optional flag so that it doesn't break code.

I feel like you're trying to misrepresent what I said...

jar wrote:there is probably an edge case somewhere

I meant that the spec should be followed because there will be a code that is broken even though I can't think of a specific real code that would break.

I believe we're in agreement, Gregg.

Re: Current status

Posted: Tue Jul 11, 2017 3:50 pm

by upcFrost

Callee-saved regs now use paired loads and stores. Optimized scheduler a bit. In general, matmul-16 currently gives out 158ms against 130ms on gcc, purely because of suboptimal scheduling. Will work on it tomorrow.Also found a couple of bugs in Load/Store optimization pass, fixed all except one.

Gregg, about volatile access - that's exactly as @jar said. It is possible to implement (basically skipping one "if-then" case), but as it might, and it will in some cases, break the code.Still, in some cases it can improve performance quite a bit. I'm planning to make additional flag to allow it, but it will be set to false by default.So the default behavior is not to touch volatile ldr/str at all, just as in spec

Re: Current status

Posted: Mon Sep 11, 2017 8:53 pm

by upcFrost

Took a long vacation to switch jobs and move to the different country

On the compiler side - working on vectorization, or strd/ldrd to be precise, plus i64/f64 support. Atm most tests build fine, and actually arithmetic runs a bit faster than on GCC (not much, as i'm still using precompiled libs from gcc bundle). I want to fix matmul-16 test which fails on memory mapping (sections overlap) before moving on.

Another big question is CI and distribution. Travis has build time limitation of 1 hour, which is not enough to build full LLVM stack, and ccache size of 512 mb, which is not enough to use ccache with LLVM build. Semaphore has disk quote, which is also too small to fit the build. Will try to look for solution, or maybe i'll just mail Semaphore or Travis and ask them to raise quote for my build (iirc they're ok doing it for FOSS projects )