[llvm-dev] RFC: Atomic LL/SC loops in LLVM revisited

[llvm-dev] RFC: Atomic LL/SC loops in LLVM revisited

# RFC: Atomic LL/SC loops in LLVM revisited

## Summary

This proposal gives a brief overview of the challenges of lowering to LL/SC
loops and details the approach I am taking for RISC-V. Beyond getting feedback
on that work, my intention is to find consensus on moving other backends
towards a similar approach and sharing common code where feasible. Scroll down
to 'Questions' for a summary of the issues I think need feedback and
agreement.

I'd like to thank James Knight, JF Bastien, and Eli Friedman for being so
generous with their review feedback on this atomics work so far.

## Background: Atomics in LLVM

See the documentation for full details <https://llvm.org/docs/Atomics.html>.
In short: LLVM defines memory ordering constraints to match the C11/C++11
memory model (unordered, monotonic, acquire, release, acqrel, seqcst).
These can be given as parameters to the atomic operations supported in LLVM
IR:

* Fences with the fence instruction
* Atomic load and store with the 'load atomic' and 'store atomic' variants of
the load/store instructions..
* Fetch-and-binop / read-modify-write operations through the atomicrmw
instruction.
* Compare and exchange via the cmpxchg instruction. Takes memory ordering for
both success and failure cases. Can also specify a 'weak' vs 'strong' cmpxchg,
where the weak variant allows spurious failure

For those not familiar, RISC-V is a modular ISA, with standard extensions
indicated by single letters. Baseline 'RV32I' or 'RV64I' instruction sets
don't support atomic operations beyond fences. However the RV32A and RV64A
instruction set extensions introduce AMOs (Atomic Memory Operations) and LR/SC
(load-linked/store-conditional on other architectures). 32-bit atomic
operations are supported natively on RV32, and both 32 and 64-bit atomic
operations support natively on RV64.

AMOs such as 'amoadd.w' implement simple fetch-and-dobinop behaviour. For
LR/SC: LR loads a word and registers a reservation on source memory address.
SC writes the given word to the memory address and writes success (zero) or
failure (non-zero) into the destination register. LR/SC can be used to
implement compare-and-exchange or to implement AMOs that don't have a native
instruction. To do so, you would typically perform LR and SC in a loop.
However, there are strict limits on the instructions that can be placed
between a LR and an SC while still guaranteeing forward progress:

"""
The static code for the LR/SC sequence plus the code to retry the sequence in
case of failure must comprise at most 16 integer instructions placed
sequentially in memory. For the sequence to be guaranteed to eventually
succeed, the dynamic code executed between the LR and SC instructions can only
contain other instructions from the base “I” subset, excluding loads, stores,
backward jumps or taken backward branches, FENCE, FENCE.I, and SYSTEM
instructions. The code to retry a failing LR/SC sequence can contain backward
jumps and/or branches to repeat the LR/SC sequence, but otherwise has the same
constraints.
"""

The native AMOs and LR/SC allow ordering constraints to be specified in the
instruction. This isn't possible for load/store instructions, so fences must
be inserted to represent the ordering constraints. 8 and 16-bit atomic
load/store are therefore supported using 8 and 16-bit load/store plus
appropriate fences.

See Table A.6 on page 187 in the linked specification for a mapping from C/C++
atomic constructs to RISC-V instructions.

## Background: Lowering atomics in LLVM

The AtomicExpandPass can help you support atomics for your taraget in a number
of ways. e.g. inserting fences around atomic loads/stores, or converting an
atomicrmw/cmpxchg to a LL/SC loop. It operates as an IR-level pass, meaning
the latter ability is problematic - there is no way to guarantee that the
invariants for the LL/SC loop required by the target architecture will be
maintained. This shows up most frequently when register spills are introduced
at O0, but spills could theoretically still occur at higher optimisation
levels and there are other potential sources of issues: inappropriate
instruction selection, machine block placement, machine outlining (though see
D47654 and D47655), and likely more.

* ARM and AArch64 expand to LL/SC loops in IR using AtomicExpandPass for O1
and above but use a custom post-regalloc expansion for O0
* MIPS doesn't use AtomicExpandPass, but selects atomic pseudoinstructions
which it expands to LL/SC loops in EmitInstrWithCustomInserter. This still has
the problems described above, so MIPS is in the process of moving towards a
two-stage lowering, with the LL/SC loop lowered after register allocation. See
D31287 <https://reviews.llvm.org/D31287>.
* Hexagon unconditionally expands to LL/SC loops in IR using AtomicExpandPass.

Lowering a word-size atomic operations to an LL/SC loop is typically trivial,
requiring little surrounding code. Part-word atomics require additional
shifting and masking as a word-size access is used. It would be beneficial if
the code to generate this shifting and masking could be shared between
targets, and if the operations that don't need to be in the LL/SC loop are
exposed for LLVM optimisation.

The path forwards is very clearly to treat the LL/SC loop as an indivisible
operation which is expanded as late as possible (and certainly after register
allocation). However, there are a few ways of achieving this.

If atomic operations of a given size aren't supported, then calls should be
created to the helper functions in libatomic, and this should be done
consistently for all atomic operations of that size. I actually found GCC is
buggy in that respect <https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86005>.

## Proposed lowering strategy (RISC-V)

Basic principles:
* The LL/SC loop should be treated as a black box, and expanded post-RA.
* Don't introduce intrinsics that simply duplicate existing IR instructions
* If code can be safely expanded in the IR, do it there. [I'd welcome feedback
on this one - should I be taking a closer look at expansion in SelectionDAG
legalisation?]

The above can be achieved by extending AtomicExpandPass to support a 'Custom'
expansion method, which uses a TargetLowering call to expand to custom IR,
including an appropriate intrinsic representing the LL+SC loop.

Atomic operations are lowered in the following ways:

* Atomic load/store: Allow AtomicExpandPass to insert appropriate fences
* Word-sized AMO supported by a native instruction: Leave the IR unchanged and
use the normal instruction selection mechanism
* Word-sized AMO without a native instruction: Select a pseudo-instruction
using the normal instruction selection mechanism. This pseudo-instruction will
be expanded after register allocation.
* Part-word AMO without a native instruction: Shifting and masking that occurs
outside of the LL/SC loop is expanded in the IR, and a call to a
target-specific intrinsic to implement the LL/SC loop is inserted (e.g.
llvm.riscv.masked.atomicrmw.add.i32). The intrinsic is matched to a
pseudo-instruction which is expanded after register allocation.
* Part-word AMO without a native instruction that can be implemented by a
native word-sized AMO: 8 and 16-bit atomicrmw {and,or,xor} can be implemented
by 32-bit amoand, amoor, amoxor. Perform this conversion as an IR
transformation.
* Word-sized compared-and-exchange: Lower to a pseudo-instruction using the
normal instruction selection mechanism. This pseudo-instruction will be
expanded after register allocation.
* Part-word compare-and-exchange: Handled similarly to part-word AMOs, calling
llvm.riscv.masked.cmpxchg.i32.

Scratch registers for these pseudo-instructions are modelled as in ARM and
AArch64, by specifying multiple outputs and specifying an @earlyclobber
constraint to ensure the register allocator assigns unique registers. e.g.:

Note that there are additional complications with cmpxchg such as supporting
weak cmpxchg (which requires returning a success value), or supporting
different failure orderings. It looks like the differentiation between
strong/weak cmpxchg doesn't survive the translation to SelectionDAG right now.
Supporting only strong cmpxchg and using the success ordering for the failure
case is conservative but correct I believe.

In the RISC-V case, the LL/SC loop pseudo-instructions are lowered at the
latest possible moment. The RISCVExpandPseudoInsts pass is registered with
addPreEmitPass2.

The main aspect I'm unhappy with in this approach is the need to introduce new
intrinsics. Ideally these would be documented as not for use by frontends and
subject to future removal or alteration - is there precedent for this?
Alternatively, see the suggestion below to introduce target-independent masked
AMO intrinsics.

## Alternative options

1. Don't expand anything in IR, and lower to a single monolithic
pseudo-instruction that is expanded at the last minute.
2. Don't expand anything in IR, and lower to pseudo-instructions in stages.
Lower to a monolithic pseudo-instruction where any logic outside of the LL/SC
loop is expanded in EmitInstrWithCustomInserter but the LL/SC loop is
represented by a new pseudoinstruction. This final pseudoinstruction is then
expanded after register allocation. This minimises the possibility for sharing
logic between backends, but does mean we don't need to expose new intrinsics.
Mips adopts this approach in D31287.
3. Target-independent SelectionDAG expansion code converts unsupported atomic
operations. e.g. rather than converting `atomicrmw add i8` to AtomicLoadAdd,
expand to nodes that align the address and calculate the mask as well as an
AtomicLoadAddMasked node. I haven't looked at this in great detail.
4. Introducing masked atomic operations to the IR. Mentioned for completeness,
I don't think anyone wants this.
5. Introduce target-independent intrinsics for masked atomic operations. This
seems worthy of consideration.

For 1. and 2. the possibility for sharing logic between backends is minimised
and the address calculation, masking and shifting logic is mostly hidden from
optimisations (though option 2. allows e.g. MachineCSE). There is the
advantage of avoiding the need for new intrinsics.

## Patches up for review

I have patches up for review which implement the described strategy. More
could be done to increase the potential for code reuse across targets, but I
thought it would be worth getting feedback on the path forwards first.

* How do you feel about the described lowering strategy? Am I unfairly
overlooking a SelectionDAG approach?
* How much enthusiasm is there for moving ARM, AArch64, Mips, Hexagon, and
other architectures to use such an approach?
* If there is enthusiasm, how worthwhile is it to share logic for generation
of masks+shifts needed for part-word atomics?
* I'd like to see ARM+AArch64+Hexagon move away from the problematic
expansion in IR and to have that code deleted from AtomicExpandPass. Are
there any objections?
* What are your thoughts on the introduction of new target-independent
intrinsics for masked atomics?

Re: [llvm-dev] RFC: Atomic LL/SC loops in LLVM revisited

One detail I hadn't noticed in my previous investigation is that the "compare_exchange_weak" (or our IR "cmpxchg weak") operation is theoretically unsound -- and NECESSARILY so.

As mentioned in this thread, there are architectural guarantees for forward progress of an LLSC loop adhering to certain constraints. If your loop does not meet those constraints, it may well livelock vs another CPU executing the same loop, or even spuriously fail deterministically every time it's executed on its own. This thread is all about a proposal to ensure that LLVM emits LLSC loops such that they're guaranteed to meet the architectural guarantees and avoid the possibility of those bad situations.

There is not to my knowledge any architecture which makes any guarantees that an LLSC standing alone, without a containing loop, will not spuriously fail. It may in fact fail every time it's executed. Of course, the typical usage of compare_exchange_weak is to embed it within a loop containing other user-code, but given that it contains arbitrary user code, there's simply no way we can guarantee that this larger-loop meets the LLSC-loop construction requirements. Therefore, it's entirely possible for some particular compare_exchange_weak-based loop to livelock or to deterministically spuriously fail. That seems poor, and exactly the kind of thing that we'd very much like to avoid...

So, now I wonder -- is there actually a measurable performance impact (on, say, ARM) if we were to just always emit the "strong" cmpxchg loop? I'm feeling rather inclined to say we should not implement the weak variant at all. (And, if others agree with my diagnosis here, also propose to deprecate it from the C/C++ languages...)

Re: [llvm-dev] RFC: Atomic LL/SC loops in LLVM revisited

> Great writeup, thanks!
>
> I'll respond to the rest in a bit, but I have a digression first...
>
> One detail I hadn't noticed in my previous investigation is that the
> "compare_exchange_weak" (or our IR "cmpxchg weak") operation is
> theoretically unsound -- and NECESSARILY so.
>
> As mentioned in this thread, there are architectural guarantees for forward
> progress of an LLSC loop adhering to certain constraints. If your loop does
> not meet those constraints, it may well livelock vs another CPU executing
> the same loop, or even spuriously fail deterministically every time it's
> executed on its own. This thread is all about a proposal to ensure that LLVM
> emits LLSC loops such that they're guaranteed to meet the architectural
> guarantees and avoid the possibility of those bad situations.
>
> There is not to my knowledge any architecture which makes any guarantees
> that an LLSC standing alone, without a containing loop, will not spuriously
> fail. It may in fact fail every time it's executed. Of course, the typical
> usage of compare_exchange_weak is to embed it within a loop containing other
> user-code, but given that it contains arbitrary user code, there's simply no
> way we can guarantee that this larger-loop meets the LLSC-loop construction
> requirements. Therefore, it's entirely possible for some particular
> compare_exchange_weak-based loop to livelock or to deterministically
> spuriously fail. That seems poor, and exactly the kind of thing that we'd
> very much like to avoid...

I hadn't noticed that either - excellent point! The RISC-V requirement
for forward progress encompass the LR, the SC, the instructions
between it, and the code to retry the LR/SC. A weak cmpxchg by
definition separates the LR/SC from the code to retry it and so
suffers from the same sort problems that we see when exposing LL and
SC individually as IR intrinsics. I agree with your diagnosis as far
as RISC-V is concerned at least.

Re: [llvm-dev] RFC: Atomic LL/SC loops in LLVM revisited

On 13 June 2018 at 23:37, James Y Knight via llvm-dev
<[hidden email]> wrote:
> So, now I wonder -- is there actually a measurable performance impact (on,
> say, ARM) if we were to just always emit the "strong" cmpxchg loop? I'm
> feeling rather inclined to say we should not implement the weak variant at
> all. (And, if others agree with my diagnosis here, also propose to deprecate
> it from the C/C++ languages...)

I'm not sure deprecation is justified. A "Sufficiently Smart Compiler"
could make use of the knowledge that a weak cmpxchg was considered
sufficient by the programmer, and use the weak cmpxchg form if the
surrounding loop can be shown to meet the forward progress guarantees
of the architecture. In LLVM, this could be done by analyzing the
surrounding basic block when expanding the cmpxchg pseudo after
regalloc. It's unlikely it would be worth the hassle, but it does seem
a viable approach.

Re: [llvm-dev] RFC: Atomic LL/SC loops in LLVM revisited

> * I'd like to see ARM+AArch64+Hexagon move away from the problematic
> expansion in IR and to have that code deleted from AtomicExpandPass. Are
> there any objections?

I think it would be a great shame and I'd like to avoid it if at all
possible, though I'm also uncomfortable with the current situation.

The problem with late expansion is that even the simplest variants add
some rather unpleasant pseudo-instructions, and I've never even seen
anyone attempt to get good CodeGen out of it (simplest example being
"add xD, xD, #1" in the loop for an increment but there are obviously
more). Doing so would almost certainly involve duplicating a lot of
basic arithmetic instructions into AtomicRMW variants.

Added to that is the fact that actual CPU implementations are often a
lot less restrictive about what can go into a loop than is required
(for example even spilling is fine on ours as long as the atomic
object is not in the same cache-line; I suspect that's normal). That
casts the issue in a distinctly theoretical light -- we've been doing
this for years and as far as I'm aware nobody has ever hit the issue
in the real world, or even had CodeGen go wrong in a way that *could*
do so outside the -O0 situation.

OTOH that *is* an argument for performance over correctness when you
get right down to it, so I'm not sure I can be too forceful about it.
At least not without a counter-proposal to restore guaranteed
correctness.

Re: [llvm-dev] RFC: Atomic LL/SC loops in LLVM revisited

On 14 June 2018 at 10:28, Tim Northover via llvm-dev
<[hidden email]> wrote:
>> * I'd like to see ARM+AArch64+Hexagon move away from the problematic
>> expansion in IR and to have that code deleted from AtomicExpandPass. Are
>> there any objections?
>
> I think it would be a great shame and I'd like to avoid it if at all
> possible, though I'm also uncomfortable with the current situation.

Thanks for the reply Tim. It's definitely fair to highlight that ARM
and AArch64 has few documented restrictions on code within an LL/SC
loop when compared to e.g. RISC-V or MIPS. I'm not sure about Hexagon.

> The problem with late expansion is that even the simplest variants add
> some rather unpleasant pseudo-instructions, and I've never even seen
> anyone attempt to get good CodeGen out of it (simplest example being
> "add xD, xD, #1" in the loop for an increment but there are obviously
> more). Doing so would almost certainly involve duplicating a lot of
> basic arithmetic instructions into AtomicRMW variants.

Let's separate the concerns here:
1) Quality of codegen within the LL/SC loop
* What is the specific concern here? The LL/SC loop contains a very
small number of instructions, even for the masked atomicrmw case. Are
you worried about an extra arithmetic instruction or two? Sub-optimal
control-flow? Something else?
2) Number of new pseudo-instructions which must be introduced
* You'd need new pseudos for each word-sized atomicrmw which expands
to an ll/sc loop, and an additional one for the masked form of the
operation. You could reduce the number of pseudos by taking the
AtomicRMWInst::BinOp as a parameter. The code to map the atomic op to
the appropriate instruction is tedious but straight-forward.

> Added to that is the fact that actual CPU implementations are often a
> lot less restrictive about what can go into a loop than is required
> (for example even spilling is fine on ours as long as the atomic
> object is not in the same cache-line; I suspect that's normal). That
> casts the issue in a distinctly theoretical light -- we've been doing
> this for years and as far as I'm aware nobody has ever hit the issue
> in the real world, or even had CodeGen go wrong in a way that *could*
> do so outside the -O0 situation.

That's true as long as the "Exclusives Reservation Granule" == 1
cache-line and you don't deterministically cause the reservation to be
cleared in some other way: e.g. repeatable geenrating a conflict miss
or triggering a trap etc. I don't think any Arm cores ship with
direct-mapped caches so I'll admit this is unlikely.

"""
Forward progress can only be made using LoadExcl/StoreExcl loops if,
for any LoadExcl/StoreExcl loop within a single thread of execution if
both of the following are true:
• There are no explicit memory accesses, pre-loads, direct or indirect
register writes, cache maintenance instructions, SVC instructions, or
exception returns between the Load-Exclusive and the Store-Exclusive.
• The following conditions apply between the Store-Exclusive having
returned a fail result and the retry of the Load-Exclusive:
– There are no stores to any location within the same Exclusives
reservation granule that the StoreExclusive is accessing.
– There are no direct or indirect register writes, other than
changes to the flag fields in APSR or FPSCR, caused by data processing
or comparison instructions.
– There are no direct or indirect cache maintenance instructions,
SVC instructions, or exception returns
"""

> OTOH that *is* an argument for performance over correctness when you
> get right down to it, so I'm not sure I can be too forceful about it.
> At least not without a counter-proposal to restore guaranteed
> correctness.

I suppose a machine-level pass could at least scan for any intervening
loads/stores in an LL/SC loop and check some other invariants, then
warn/fail if they occur. As you point out, this would be conservative
for many Arm implementations.

Re: [llvm-dev] RFC: Atomic LL/SC loops in LLVM revisited

> On 14 June 2018 at 10:28, Tim Northover via llvm-dev
> <[hidden email]> wrote:
>> The problem with late expansion is that even the simplest variants add
>> some rather unpleasant pseudo-instructions, and I've never even seen
>> anyone attempt to get good CodeGen out of it (simplest example being
>> "add xD, xD, #1" in the loop for an increment but there are obviously
>> more). Doing so would almost certainly involve duplicating a lot of
>> basic arithmetic instructions into AtomicRMW variants.
>
> Let's separate the concerns here:
> 1) Quality of codegen within the LL/SC loop
> * What is the specific concern here? The LL/SC loop contains a very
> small number of instructions, even for the masked atomicrmw case. Are
> you worried about an extra arithmetic instruction or two? Sub-optimal
> control-flow? Something else?

Oh I see what you're saying, it's the fact that by bypassing
instruction selection we're missing cases where an ADDI could be
selected rather than an ADD, which would potentially free up a
register and save the instruction generated for materialising the
constant. I don't like to see the compiler generate code that's
obviously dumber than what a human would write, but in this case do we
really think there would be any sort of measurable impact on
performance?

Re: [llvm-dev] RFC: Atomic LL/SC loops in LLVM revisited

On Thu, 14 Jun 2018 at 13:45, Alex Bradbury <[hidden email]> wrote:
> Oh I see what you're saying, it's the fact that by bypassing
> instruction selection we're missing cases where an ADDI could be
> selected rather than an ADD, which would potentially free up a
> register and save the instruction generated for materialising the
> constant.

Yes.

> I don't like to see the compiler generate code that's
> obviously dumber than what a human would write, but in this case do we
> really think there would be any sort of measurable impact on
> performance?

It's certainly going to be marginal, but then so is the benefit of
late expansion.

There's also the barrier that this genuinely is a place where people
are willing to hand-code assembly, and go rooting around in the
compiler's output to check we're doing what they expect even if they
don't use assembly. The assembly is possibly mostly for historical
reasons, but it's still out there and we want to convince people to
move away from it.

I think I'm going to try to implement that verifier pass you mentioned
and run it across an iOS build.

Re: [llvm-dev] RFC: Atomic LL/SC loops in LLVM revisited

Hexagon has a very few restrictions on what will cause loss of a
reservation: those are stores to the same address (a 64-bit granule) or
any sort of exception/interrupt/system call. Other than that the
reservation should stay. The architecture doesn't explicitly guarantee
that though, but in the absence of the elements listed above, a program
with LL/SC can be expected to make progress.

Consequently, the best way for us to handle LL/SC would be to expand
them early and let them be optimized as any other code. The usual
optimization restrictions should be sufficient to prevent introduction
of factors causing a loss of reservation.

With the constraints on LL/SC varying wildly between architectures,
maybe we should have several options available for different targets?

-Krzysztof

On 6/13/2018 10:42 AM, Alex Bradbury wrote:

> # RFC: Atomic LL/SC loops in LLVM revisited
>
> ## Summary
>
> This proposal gives a brief overview of the challenges of lowering to LL/SC
> loops and details the approach I am taking for RISC-V. Beyond getting feedback
> on that work, my intention is to find consensus on moving other backends
> towards a similar approach and sharing common code where feasible. Scroll down
> to 'Questions' for a summary of the issues I think need feedback and
> agreement.
>
> For the original discussion of LL/SC lowering, please refer to James
> Knight's 2016 thread on the topic:
> http://lists.llvm.org/pipermail/llvm-dev/2016-May/099490.html>
> I'd like to thank James Knight, JF Bastien, and Eli Friedman for being so
> generous with their review feedback on this atomics work so far.
>
> ## Background: Atomics in LLVM
>
> See the documentation for full details <https://llvm.org/docs/Atomics.html>.
> In short: LLVM defines memory ordering constraints to match the C11/C++11
> memory model (unordered, monotonic, acquire, release, acqrel, seqcst).
> These can be given as parameters to the atomic operations supported in LLVM
> IR:
>
> * Fences with the fence instruction
> * Atomic load and store with the 'load atomic' and 'store atomic' variants of
> the load/store instructions..
> * Fetch-and-binop / read-modify-write operations through the atomicrmw
> instruction.
> * Compare and exchange via the cmpxchg instruction. Takes memory ordering for
> both success and failure cases. Can also specify a 'weak' vs 'strong' cmpxchg,
> where the weak variant allows spurious failure
>
> ## Background: Atomics in RISC-V
>
> For full details see a recent draft of the ISA manual
> <https://github.com/riscv/riscv-isa-manual/releases/download/draft-20180612-548fd40/riscv-spec.pdf>,
> which incorporates work from the Memory Consistency Model Task Group to define
> the memory model. RISC-V implements a weak memory model.
>
> For those not familiar, RISC-V is a modular ISA, with standard extensions
> indicated by single letters. Baseline 'RV32I' or 'RV64I' instruction sets
> don't support atomic operations beyond fences. However the RV32A and RV64A
> instruction set extensions introduce AMOs (Atomic Memory Operations) and LR/SC
> (load-linked/store-conditional on other architectures). 32-bit atomic
> operations are supported natively on RV32, and both 32 and 64-bit atomic
> operations support natively on RV64.
>
> AMOs such as 'amoadd.w' implement simple fetch-and-dobinop behaviour. For
> LR/SC: LR loads a word and registers a reservation on source memory address.
> SC writes the given word to the memory address and writes success (zero) or
> failure (non-zero) into the destination register. LR/SC can be used to
> implement compare-and-exchange or to implement AMOs that don't have a native
> instruction. To do so, you would typically perform LR and SC in a loop.
> However, there are strict limits on the instructions that can be placed
> between a LR and an SC while still guaranteeing forward progress:
>
> """
> The static code for the LR/SC sequence plus the code to retry the sequence in
> case of failure must comprise at most 16 integer instructions placed
> sequentially in memory. For the sequence to be guaranteed to eventually
> succeed, the dynamic code executed between the LR and SC instructions can only
> contain other instructions from the base “I” subset, excluding loads, stores,
> backward jumps or taken backward branches, FENCE, FENCE.I, and SYSTEM
> instructions. The code to retry a failing LR/SC sequence can contain backward
> jumps and/or branches to repeat the LR/SC sequence, but otherwise has the same
> constraints.
> """
>
> The native AMOs and LR/SC allow ordering constraints to be specified in the
> instruction. This isn't possible for load/store instructions, so fences must
> be inserted to represent the ordering constraints. 8 and 16-bit atomic
> load/store are therefore supported using 8 and 16-bit load/store plus
> appropriate fences.
>
> See Table A.6 on page 187 in the linked specification for a mapping from C/C++
> atomic constructs to RISC-V instructions.
>
> ## Background: Lowering atomics in LLVM
>
> The AtomicExpandPass can help you support atomics for your taraget in a number
> of ways. e.g. inserting fences around atomic loads/stores, or converting an
> atomicrmw/cmpxchg to a LL/SC loop. It operates as an IR-level pass, meaning
> the latter ability is problematic - there is no way to guarantee that the
> invariants for the LL/SC loop required by the target architecture will be
> maintained. This shows up most frequently when register spills are introduced
> at O0, but spills could theoretically still occur at higher optimisation
> levels and there are other potential sources of issues: inappropriate
> instruction selection, machine block placement, machine outlining (though see
> D47654 and D47655), and likely more.
>
> I highly encourage you to read James Knight's previous post on this topic
> which goes in to much more detail about the issues handling LL/SC
> <http://lists.llvm.org/pipermail/llvm-dev/2016-May/099490.html>. The situation
> remains pretty much the same:
>
> * ARM and AArch64 expand to LL/SC loops in IR using AtomicExpandPass for O1
> and above but use a custom post-regalloc expansion for O0
> * MIPS doesn't use AtomicExpandPass, but selects atomic pseudoinstructions
> which it expands to LL/SC loops in EmitInstrWithCustomInserter. This still has
> the problems described above, so MIPS is in the process of moving towards a
> two-stage lowering, with the LL/SC loop lowered after register allocation. See
> D31287 <https://reviews.llvm.org/D31287>.
> * Hexagon unconditionally expands to LL/SC loops in IR using AtomicExpandPass.
>
> Lowering a word-size atomic operations to an LL/SC loop is typically trivial,
> requiring little surrounding code. Part-word atomics require additional
> shifting and masking as a word-size access is used. It would be beneficial if
> the code to generate this shifting and masking could be shared between
> targets, and if the operations that don't need to be in the LL/SC loop are
> exposed for LLVM optimisation.
>
> The path forwards is very clearly to treat the LL/SC loop as an indivisible
> operation which is expanded as late as possible (and certainly after register
> allocation). However, there are a few ways of achieving this.
>
> If atomic operations of a given size aren't supported, then calls should be
> created to the helper functions in libatomic, and this should be done
> consistently for all atomic operations of that size. I actually found GCC is
> buggy in that respect <https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86005>.
>
> ## Proposed lowering strategy (RISC-V)
>
> Basic principles:
> * The LL/SC loop should be treated as a black box, and expanded post-RA.
> * Don't introduce intrinsics that simply duplicate existing IR instructions
> * If code can be safely expanded in the IR, do it there. [I'd welcome feedback
> on this one - should I be taking a closer look at expansion in SelectionDAG
> legalisation?]
>
> The above can be achieved by extending AtomicExpandPass to support a 'Custom'
> expansion method, which uses a TargetLowering call to expand to custom IR,
> including an appropriate intrinsic representing the LL+SC loop.
>
> Atomic operations are lowered in the following ways:
>
> * Atomic load/store: Allow AtomicExpandPass to insert appropriate fences
> * Word-sized AMO supported by a native instruction: Leave the IR unchanged and
> use the normal instruction selection mechanism
> * Word-sized AMO without a native instruction: Select a pseudo-instruction
> using the normal instruction selection mechanism. This pseudo-instruction will
> be expanded after register allocation.
> * Part-word AMO without a native instruction: Shifting and masking that occurs
> outside of the LL/SC loop is expanded in the IR, and a call to a
> target-specific intrinsic to implement the LL/SC loop is inserted (e.g.
> llvm.riscv.masked.atomicrmw.add.i32). The intrinsic is matched to a
> pseudo-instruction which is expanded after register allocation.
> * Part-word AMO without a native instruction that can be implemented by a
> native word-sized AMO: 8 and 16-bit atomicrmw {and,or,xor} can be implemented
> by 32-bit amoand, amoor, amoxor. Perform this conversion as an IR
> transformation.
> * Word-sized compared-and-exchange: Lower to a pseudo-instruction using the
> normal instruction selection mechanism. This pseudo-instruction will be
> expanded after register allocation.
> * Part-word compare-and-exchange: Handled similarly to part-word AMOs, calling
> llvm.riscv.masked.cmpxchg.i32.
>
> Scratch registers for these pseudo-instructions are modelled as in ARM and
> AArch64, by specifying multiple outputs and specifying an @earlyclobber
> constraint to ensure the register allocator assigns unique registers. e.g.:
>
> class PseudoCmpXchg
> : Pseudo<(outs GPR:$res, GPR:$scratch),
> (ins GPR:$addr, GPR:$cmpval, GPR:$newval, i32imm:$ordering), []> {
> let Constraints = "@earlyclobber $res,@earlyclobber $scratch";
> let mayLoad = 1;
> let mayStore = 1;
> let hasSideEffects = 0;
> }
>
> Note that there are additional complications with cmpxchg such as supporting
> weak cmpxchg (which requires returning a success value), or supporting
> different failure orderings. It looks like the differentiation between
> strong/weak cmpxchg doesn't survive the translation to SelectionDAG right now.
> Supporting only strong cmpxchg and using the success ordering for the failure
> case is conservative but correct I believe.
>
> In the RISC-V case, the LL/SC loop pseudo-instructions are lowered at the
> latest possible moment. The RISCVExpandPseudoInsts pass is registered with
> addPreEmitPass2.
>
> The main aspect I'm unhappy with in this approach is the need to introduce new
> intrinsics. Ideally these would be documented as not for use by frontends and
> subject to future removal or alteration - is there precedent for this?
> Alternatively, see the suggestion below to introduce target-independent masked
> AMO intrinsics.
>
> ## Alternative options
>
> 1. Don't expand anything in IR, and lower to a single monolithic
> pseudo-instruction that is expanded at the last minute.
> 2. Don't expand anything in IR, and lower to pseudo-instructions in stages.
> Lower to a monolithic pseudo-instruction where any logic outside of the LL/SC
> loop is expanded in EmitInstrWithCustomInserter but the LL/SC loop is
> represented by a new pseudoinstruction. This final pseudoinstruction is then
> expanded after register allocation. This minimises the possibility for sharing
> logic between backends, but does mean we don't need to expose new intrinsics.
> Mips adopts this approach in D31287.
> 3. Target-independent SelectionDAG expansion code converts unsupported atomic
> operations. e.g. rather than converting `atomicrmw add i8` to AtomicLoadAdd,
> expand to nodes that align the address and calculate the mask as well as an
> AtomicLoadAddMasked node. I haven't looked at this in great detail.
> 4. Introducing masked atomic operations to the IR. Mentioned for completeness,
> I don't think anyone wants this.
> 5. Introduce target-independent intrinsics for masked atomic operations. This
> seems worthy of consideration.
>
> For 1. and 2. the possibility for sharing logic between backends is minimised
> and the address calculation, masking and shifting logic is mostly hidden from
> optimisations (though option 2. allows e.g. MachineCSE). There is the
> advantage of avoiding the need for new intrinsics.
>
> ## Patches up for review
>
> I have patches up for review which implement the described strategy. More
> could be done to increase the potential for code reuse across targets, but I
> thought it would be worth getting feedback on the path forwards first.
>
> * D47587: [RISCV] Codegen support for atomic operations on RV32I.
> <https://reviews.llvm.org/D47587>. Simply adds support for lowering fences and
> uses AtomicExpandPass to generate libatomic calls otherwise. Committed in
> rL334590.
> * D47589: [RISCV] Add codegen support for atomic load/stores with RV32A.
> <https://reviews.llvm.org/D47589>. Use AtomicExpandPass to insert fences for
> atomic load/store. Committed in rL334591.
> * D47882: [RISCV] Codegen for i8, i16, and i32 atomicrmw with RV32A.
> <https://reviews.llvm.org/D47882>. Implements the lowering strategy described
> above for atomicrmw and adds a hook to allow custom atomicrmw expansion in IR.
> Under review.
> * D48129: [RISCV] Improved lowering for bit-wise atomicrmw {i8, i16} on RV32A.
> <https://reviews.llvm.org/D48129>. Uses 32-bit AMO{AND,OR,XOR} with
> appropriately manipulated operands to implement 8/16-bit AMOs. Under review.
> * D48130: [AtomicExpandPass]: Add a hook for custom cmpxchg expansion in IR.
> <https://reviews.llvm.org/D48130> Separated patch as this modifies the
> existing shouldExpandAtomicCmpXchgInIR interface. Under review.
> * D48141: [RISCV] Implement codegen for cmpxchg on RV32I.
> <https://reviews.llvm.org/D48131> Implements the lowering strategy described
> above. Under review.
>
> ## Questions
>
> To pick a few to get started:
>
> * How do you feel about the described lowering strategy? Am I unfairly
> overlooking a SelectionDAG approach?
> * How much enthusiasm is there for moving ARM, AArch64, Mips, Hexagon, and
> other architectures to use such an approach?
> * If there is enthusiasm, how worthwhile is it to share logic for generation
> of masks+shifts needed for part-word atomics?
> * I'd like to see ARM+AArch64+Hexagon move away from the problematic
> expansion in IR and to have that code deleted from AtomicExpandPass. Are
> there any objections?
> * What are your thoughts on the introduction of new target-independent
> intrinsics for masked atomics?
>
> Many thanks for your feedback,
>
> Alex Bradbury, lowRISC CIC
>

Re: [llvm-dev] RFC: Atomic LL/SC loops in LLVM revisited

Basic principles:* The LL/SC loop should be treated as a black box, and expanded post-RA.

It will probably come as no surprise, but I think this is 100% required for correctness. I'm quite skeptical that it can be correct on any existing architecture, without doing this. My first inclination upon hearing that some architecture may not require such careful handling is that their architecture documentation probably failed to document the careful handling required, not that it's not required.

However, in order to not block this on every architecture maintainer being persuaded, it's a good idea to introduce the new functionality into AtomicExpandPass, and switch architectures over to it as the arch maintainers are convinced it's a good idea.

* If code can be safely expanded in the IR, do it there. [I'd welcome feedbackon this one - should I be taking a closer look at expansion in SelectionDAGlegalisation?

Definitely in favor of IR expansion -- we should do as much in IR as is feasible, because it admits better optimization opportunities. Only the LL/SC-loop should be late-expanded.

The above can be achieved by extending AtomicExpandPass to support a 'Custom'

expansion method, which uses a TargetLowering call to expand to custom IR,including an appropriate intrinsic representing the LL+SC loop

I think it'd be better to sink more of the functionality into AtomicExpandPass itself (rather than adding a "Custom" hook). However, that ties into whether to introduce a common intrinsic that can be used across architectures...

* Part-word AMO without a native instruction that can be implemented by anative word-sized AMO: 8 and 16-bit atomicrmw {and,or,xor} can be implementedby 32-bit amoand, amoor, amoxor. Perform this conversion as an IRtransformation.

+1, and that transform should be in common code in AtomicExpandPass.

* Word-sized compared-and-exchange: Lower to a pseudo-instruction using thenormal instruction selection mechanism. This pseudo-instruction will beexpanded after register allocation.

On RISCV, implementing the whole thing in the pseudo is probably right, since you only really have the inner-loop.

But for other archs like ARMv7, I think it'll probably makes sense to continue to handle a bunch of the cmpxchg expansion in IR. There, the current cmpxchg expansion can be quite complex, but only loop really needs to be a primitive(we'd need two loop variants, both "strex, ldrex, loop" or "ldrex, strex, loop", depending on whether it generates an initial ldrex+barrier first). All the rest -- initial ldrex+barrier, clrex, barriers-- can all remain IR-level expanded.

I'll note that in all cases, both for RISCV and ARM and others, we _really_ would like to be able to have the compare-failure jump to a different address than success. That isn't possible for an intrinsic call at the moment, but I believe it will be possible to make that work soon, due to already ongoing work for "asm goto", which requires similar. Once we have that ability, I don't see any reason why the late-lowering cmpxchg pseudo should have any performance downside vs IR expansion, at least w.r.t. any correct optimizations.

Note that there are additional complications with cmpxchg such as supportingweak cmpxchg (which requires returning a success value), or supportingdifferent failure orderings. It looks like the differentiation betweenstrong/weak cmpxchg doesn't survive the translation to SelectionDAG right now.Supporting only strong cmpxchg and using the success ordering for the failurecase is conservative but correct I believe.

Yep. The distinction between strong/weak is only relevant to LL/SC architectures, where you get to avoid making a loop. AFAIK, all ISAs which have a native cmpxchg instruction implement the strong semantics for it, and since the LL/SC loops are expanded in IR right now, they didn't need the weak indicator in SelectionDAG. While strong doesn't _need_ a success indicator output, as noted above it'll generate nicer code when we can.

In the RISC-V case, the LL/SC loop pseudo-instructions are lowered at thelatest possible moment. The RISCVExpandPseudoInsts pass is registered withaddPreEmitPass2.

The main aspect I'm unhappy with in this approach is the need to introduce newintrinsics. Ideally these would be documented as not for use by frontends andsubject to future removal or alteration - is there precedent for this?Alternatively, see the suggestion below to introduce target-independent maskedAMO intrinsics.

I agree -- it'd be great to somehow name/annotate these intrinsics, such that it's clear that they're a private implementation detail _INSIDE_ the llvm target codegen passes, with no stability guaranteed. Not even "opt" passes should be emitting them, so they should never end up in a bitcode file where we'd need to provide backwards compatibility.

Maybe we can call them something like "llvm.INTERNAL_CODEGEN.f90d461eee5d32a1.masked.atomicrmw.add.i32" (where the random number in the middle is something that changes), and document that nobody must use intrinsics in the INTERNAL_CODEGEN, other than llvm CodeGen.

I think it's definitely worthwhile looking to see if the set of intrinsics for the atomicrmw operations (in particular, the set of additional arguments computed in the pre-loop section, and the return value) are likely going to be the same on the different architectures we have now. If so, I think it's definitely worthwhile making a commonly-named intrinsic. However, even so, it should remain for internal use only, and only implemented on targets which tell AtomicExpandPass to generate it. I'd like it to be common only to enhance code-reuse among the targets, not to provide a user-facing API.

Re: [llvm-dev] RFC: Atomic LL/SC loops in LLVM revisited

> * I'd like to see ARM+AArch64+Hexagon move away from the problematic
> expansion in IR and to have that code deleted from AtomicExpandPass. Are
> there any objections?

I think it would be a great shame and I'd like to avoid it if at all
possible, though I'm also uncomfortable with the current situation.

The problem with late expansion is that even the simplest variants add
some rather unpleasant pseudo-instructions, and I've never even seen
anyone attempt to get good CodeGen out of it (simplest example being
"add xD, xD, #1" in the loop for an increment but there are obviously
more). Doing so would almost certainly involve duplicating a lot of
basic arithmetic instructions into AtomicRMW variants.

I think it would be potentially-useful, and difficult, to also add pseudos for the atomicrmw operations with an immediate operand of the range allowed by the corresponding arithmetic operation. That doesn't seem like it'd be difficult. But I rather doubt whether any further refinement beyond that would be useful in practice.

Added to that is the fact that actual CPU implementations are often a
lot less restrictive about what can go into a loop than is required
(for example even spilling is fine on ours as long as the atomic
object is not in the same cache-line; I suspect that's normal). That
casts the issue in a distinctly theoretical light -- we've been doing
this for years and as far as I'm aware nobody has ever hit the issue
in the real world, or even had CodeGen go wrong in a way that *could*
do so outside the -O0 situation

So, there's two separate issues --

1. Deterministic failure of a loop even if it's the only thing running. For example, from register spilling to the same cacheline within the loop. As you note, that has been observed, but only at -O0.

2. Livelock between two such loops on different CPUs. AFAIK this has not been reported, but I expect it would be very difficult to observe, because one CPU will likely receive an OS interrupt eventually, and thus break out of the livelock situation. This would likely exhibit simply as worse performance than expected, which I expect would be very difficult to track down and report.

I'd suspect that generally, most of the violations of the architectural requirements would only cause #2, not #1. It seems entirely possible to me that this change may well be good both for correctness and performance-under-load.

OTOH that *is* an argument for performance over correctness when you

get right down to it, so I'm not sure I can be too forceful about it.
At least not without a counter-proposal to restore guaranteed
correctness.

I'd rather look at it from the other direction -- does using a late-lowered atomicrmw pseudo _actually_ cause any observable performance degradation? I rather suspect it won't.

Now -- cmpxchg, as mentioned before, is trickier, and I could see that the inability to jump to a separate failure block may make a noticeable difference at the moment.

Re: [llvm-dev] RFC: Atomic LL/SC loops in LLVM revisited

Hexagon has a very few restrictions on what will cause loss of a
reservation: those are stores to the same address (a 64-bit granule) or
any sort of exception/interrupt/system call. Other than that the
reservation should stay. The architecture doesn't explicitly guarantee
that though, but in the absence of the elements listed above, a program
with LL/SC can be expected to make progress.

Consequently, the best way for us to handle LL/SC would be to expand
them early and let them be optimized as any other code. The usual
optimization restrictions should be sufficient to prevent introduction
of factors causing a loss of reservation.

With the constraints on LL/SC varying wildly between architectures,
maybe we should have several options available for different targets?

The constraints on LL/SC don't seem to vary that wildly between architectures (except Hexagon, I guess!) -- sure, the exact details are different, but they all (the others) have some set of requirements on the structure of their loop which are not generally possible guarantee in a high level abstract representation. Given that, the details of the requirements aren't particularly relevant, they just need to be properly adhered to when writing the target-specific code to generate the machine instructions for the ll/sc loop.

I'm a bit skeptical that Hexagon is truly different in this regards than all the other architectures. But, that said, I don't think it'd be at all problematic to keep the IR-level lowering code around for Hexagon to continue to use, even if we do migrate everything else away.

We definitely need to introduce new options alongside the current options to start with, anyways -- at the very least simply to allow migrating each architecture separately.

Re: [llvm-dev] RFC: Atomic LL/SC loops in LLVM revisited

On 15 June 2018 at 12:28, Tim Northover via llvm-dev
<[hidden email]> wrote:
> On Thu, 14 Jun 2018 at 13:45, Alex Bradbury <[hidden email]> wrote:
>> I don't like to see the compiler generate code that's
>> obviously dumber than what a human would write, but in this case do we
>> really think there would be any sort of measurable impact on
>> performance?
>
> It's certainly going to be marginal, but then so is the benefit of
> late expansion.

I feel differently: to my mind there's a lot to gain by late expansion
and relatively little to lose. By expanding only the LL/SC loop late
and potentially making use of future 'asm goto' support in the future
(as James suggests) there should be minimal/no codegen impact. Output
that is guaranteed to meet the platform requirements for forward
progress with a codegen method that is resilient to the introduction
of new in-tree or out-of-tree passes is a huge win to me vs the status
quo. Although the current approach appears to work ok in practice we
know expansion in IR for LL/SC is fundamentally on shaky ground - at
least for archs other than Hexagon.

> There's also the barrier that this genuinely is a place where people
> are willing to hand-code assembly, and go rooting around in the
> compiler's output to check we're doing what they expect even if they
> don't use assembly. The assembly is possibly mostly for historical
> reasons, but it's still out there and we want to convince people to
> move away from it.
>
> I think I'm going to try to implement that verifier pass you mentioned
> and run it across an iOS build.

That would be interesting. My concern with the verifier approach is
that it may not leave the end-user with many options if it fails. A
huge improvement on having the possibility of generating questionable
code with no way of checking of course.

Re: [llvm-dev] RFC: Atomic LL/SC loops in LLVM revisited

> Hexagon has a very few restrictions on what will cause loss of a
> reservation: those are stores to the same address (a 64-bit granule) or any
> sort of exception/interrupt/system call. Other than that the reservation
> should stay. The architecture doesn't explicitly guarantee that though, but
> in the absence of the elements listed above, a program with LL/SC can be
> expected to make progress.
>
> Consequently, the best way for us to handle LL/SC would be to expand them
> early and let them be optimized as any other code. The usual optimization
> restrictions should be sufficient to prevent introduction of factors causing
> a loss of reservation.

Thanks Krzysztof, it sounds like Hexagon gives so much freedom for
LL/SC loops that there really isn't a need to be concerned about
expansion in IR. As such, it would clearly make sense to retain that
codepath over the long-term, even if architectures are discouraged
from using it unless they're _really_ sure it's safe for their
architecture.

> With the constraints on LL/SC varying wildly between architectures, maybe we
> should have several options available for different targets?

Hexagon is the first architecutre I've encountered where there are
essentially no restrictions, so that certainly changes things. I was
mainly trying to determine consensus for the ultimate goal. If
everyone agreed that late expansion is the path forwards, we could
plan to remove and deprecate the current early expansion codepath at
some point.

Re: [llvm-dev] RFC: Atomic LL/SC loops in LLVM revisited

On 15 June 2018 at 23:03, James Y Knight via llvm-dev
<[hidden email]> wrote:
> However, in order to not block this on every architecture maintainer being
> persuaded, it's a good idea to introduce the new functionality into
> AtomicExpandPass, and switch architectures over to it as the arch
> maintainers are convinced it's a good idea.

Indeed, even if everyone agreed this was a good idea I wasn't
expecting to do this all at once.

>> The above can be achieved by extending AtomicExpandPass to support a
>> 'Custom'
>>
>> expansion method, which uses a TargetLowering call to expand to custom IR,
>> including an appropriate intrinsic representing the LL+SC loop
>
>
> I think it'd be better to sink more of the functionality into
> AtomicExpandPass itself (rather than adding a "Custom" hook). However, that
> ties into whether to introduce a common intrinsic that can be used across
> architectures...

Yes, I'd like to do more in AtomicExpandPass. Adding the 'Custom' hack
was the easiest way of prototyping this, and this thread will
hopefully give good guidance on the level of interest in using this in
a target-independent way.

>> * Word-sized compared-and-exchange: Lower to a pseudo-instruction using
>> the
>> normal instruction selection mechanism. This pseudo-instruction will be
>> expanded after register allocation.
>
>
> On RISCV, implementing the whole thing in the pseudo is probably right,
> since you only really have the inner-loop.
>
> But for other archs like ARMv7, I think it'll probably makes sense to
> continue to handle a bunch of the cmpxchg expansion in IR. There, the
> current cmpxchg expansion can be quite complex, but only loop really needs
> to be a primitive (we'd need two loop variants, both "strex, ldrex, loop" or
> "ldrex, strex, loop", depending on whether it generates an initial
> ldrex+barrier first). All the rest -- initial ldrex+barrier, clrex,
> barriers-- can all remain IR-level expanded.

Good point. As you say, the RISC-V expansion is much more
straight-forward. Although the barrier could be cleared eagerly after
compare failure by an SC to a dummy memory location, I don't currently
intend to do so:
1) GCC also doesn't intend to use such an expansion
2) No existing microarchitectural implementations have been shown to
benefit from this manual eager reservation clearing
3) Sticking to the simplest expansion is a good starting point, and
future microarchitects are most likely to optimise for code that is
out in the wild

> I'll note that in all cases, both for RISCV and ARM and others, we _really_
> would like to be able to have the compare-failure jump to a different
> address than success. That isn't possible for an intrinsic call at the
> moment, but I believe it will be possible to make that work soon, due to
> already ongoing work for "asm goto", which requires similar. Once we have
> that ability, I don't see any reason why the late-lowering cmpxchg pseudo
> should have any performance downside vs IR expansion, at least w.r.t. any
> correct optimizations.

I've seen periodically recurring discussions, but is someone actually
actively working on this?

>> The main aspect I'm unhappy with in this approach is the need to introduce
>> new
>> intrinsics. Ideally these would be documented as not for use by frontends
>> and
>> subject to future removal or alteration - is there precedent for this?
>> Alternatively, see the suggestion below to introduce target-independent
>> masked
>> AMO intrinsics.
>
>
> I agree -- it'd be great to somehow name/annotate these intrinsics, such
> that it's clear that they're a private implementation detail _INSIDE_ the
> llvm target codegen passes, with no stability guaranteed. Not even "opt"
> passes should be emitting them, so they should never end up in a bitcode
> file where we'd need to provide backwards compatibility.
>
> Maybe we can call them something like
> "llvm.INTERNAL_CODEGEN.f90d461eee5d32a1.masked.atomicrmw.add.i32" (where the
> random number in the middle is something that changes), and document that
> nobody must use intrinsics in the INTERNAL_CODEGEN, other than llvm CodeGen.

llvm.internal_use_only.masked.atomicrmw.add.i32 would get the point
across I think.

Is it not possible someone would generate a .bc after AtomicExpandPass
has run? Of course even now there's no guarantee such a file might
work on a future version of LLVM. e.g. the atomics lowering strategy
could change from one release to the next.

>> ## Alternative options
>
> [A bunch of alternatives I don't like, so i'm omitting them]
>
>>
>> 5. Introduce target-independent intrinsics for masked atomic operations.
>> This
>>
>> seems worthy of consideration.
>
>
> I think it's definitely worthwhile looking to see if the set of intrinsics
> for the atomicrmw operations (in particular, the set of additional arguments
> computed in the pre-loop section, and the return value) are likely going to
> be the same on the different architectures we have now. If so, I think it's
> definitely worthwhile making a commonly-named intrinsic. However, even so,
> it should remain for internal use only, and only implemented on targets
> which tell AtomicExpandPass to generate it. I'd like it to be common only to
> enhance code-reuse among the targets, not to provide a user-facing API.

Yes, giving the impression of providing a new user-facing API is my
concern. Particularly as we might define want to have a comprehensive
set of intrinsics but have targets support only the subset they
require for comprehensives atomics support (as some instructions may
go through the usual SDISel route without transformation to
intrinsics).

A common set of intrinsics is probably ok, but there are cases where
you might want a different interface. For instance clearing a
reservation requires an SC to a dummy address on RISC-V. Although we
have no intention of doing this in the RISC-V backend currently,
targets that wanted to implement such an approach might want to have
that dummy address as an argument to the intrinsic.

[llvm-dev] FW: RFC: Atomic LL/SC loops in LLVM revisited

>
> On 15 June 2018 at 12:28, Tim Northover via llvm-dev <llvm-
> [hidden email]> wrote:
> > On Thu, 14 Jun 2018 at 13:45, Alex Bradbury <[hidden email]> wrote:
> >> I don't like to see the compiler generate code that's obviously
> >> dumber than what a human would write, but in this case do we really
> >> think there would be any sort of measurable impact on performance?
> >
> > It's certainly going to be marginal, but then so is the benefit of
> > late expansion.
>
> I feel differently: to my mind there's a lot to gain by late expansion
> and relatively little to lose. By expanding only the LL/SC loop late
> and potentially making use of future 'asm goto' support in the future
> (as James suggests) there should be minimal/no codegen impact. Output
> that is guaranteed to meet the platform requirements for forward
> progress with a codegen method that is resilient to the introduction
> of new in-tree or out-of-tree passes is a huge win to me vs the status
> quo. Although the current approach appears to work ok in practice we
> know expansion in IR for LL/SC is fundamentally on shaky ground - at least for archs other than Hexagon.

This is a concern for MIPS for both correctness and quality of the generated code, and the maintenance burden regarding the number of different expansions required if they are to be expanded at the MC layer.

For example, MIPS branches have delay slots which must be filled. Filling them with nops is suboptimal for both size and performance. MIPSR6 features compact branches which don't have delay slots but do have forbidden slots which prohibit back-to-back branches (and other control flow instructions). microMIPS features compact branches in certain cases. microMIPSR6 only has compact branches but doesn't have forbidden slots.

Handling all those situations is currently done by the MIPS delay slot filler which knows how to reselect instructions.

MIPS needs post-ra expansion to handle the cases of stores from the same process Executing the ll/sc loop and needs the late target specific passes to clean-up the branches.

Thanks,
Simon

>
> > There's also the barrier that this genuinely is a place where people
> > are willing to hand-code assembly, and go rooting around in the
> > compiler's output to check we're doing what they expect even if they
> > don't use assembly. The assembly is possibly mostly for historical
> > reasons, but it's still out there and we want to convince people to
> > move away from it.
> >
> > I think I'm going to try to implement that verifier pass you
> > mentioned and run it across an iOS build.
>
> That would be interesting. My concern with the verifier approach is
> that it may not leave the end-user with many options if it fails. A
> huge improvement on having the possibility of generating questionable
> code with no way of checking of course.
>
> Best,
>
> Alex
> _______________________________________________
> LLVM Developers mailing list
> [hidden email]> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

Re: [llvm-dev] FW: RFC: Atomic LL/SC loops in LLVM revisited

> (Resending, missed llvm-dev somehow.)
>>
>> On 15 June 2018 at 12:28, Tim Northover via llvm-dev <llvm-
>> [hidden email]> wrote:
>> > On Thu, 14 Jun 2018 at 13:45, Alex Bradbury <[hidden email]> wrote:
>> >> I don't like to see the compiler generate code that's obviously
>> >> dumber than what a human would write, but in this case do we really
>> >> think there would be any sort of measurable impact on performance?
>> >
>> > It's certainly going to be marginal, but then so is the benefit of
>> > late expansion.
>>
>> I feel differently: to my mind there's a lot to gain by late expansion
>> and relatively little to lose. By expanding only the LL/SC loop late
>> and potentially making use of future 'asm goto' support in the future
>> (as James suggests) there should be minimal/no codegen impact. Output
>> that is guaranteed to meet the platform requirements for forward
>> progress with a codegen method that is resilient to the introduction
>> of new in-tree or out-of-tree passes is a huge win to me vs the status
>> quo. Although the current approach appears to work ok in practice we
>> know expansion in IR for LL/SC is fundamentally on shaky ground - at least for archs other than Hexagon.
>
> This is a concern for MIPS for both correctness and quality of the generated code, and the maintenance burden regarding the number of different expansions required if they are to be expanded at the MC layer.

Hi Simon. Although expanding at the MC layer has the nice property of
being as close to just emitting inline ASM as possible, that's
actually not what I'm advocating. If you look at e.g.
https://reviews.llvm.org/D47882 you'll see that as much expansion
takes possible takes place in IR, while the LL/SC loop is expanded in
a very late stage MachineFunctionPass. For the reasons you stated, it
sounds like the Mips backend wouldn't want to perform the last-stage
expansion quite as late as I'm doing in RISC-V, which is a choice each
backend is free to make.

Surely the strategy described is no worse than the scheme in D31287 in
terms of delay slot filling etc, and possibly has minor advantages by
allowing a little more to be expanded in the IR level and thus sharing
a little more code with other backends. Or am I misunderstanding?

To update on this, everything has now landed other than D48141 which
implements cmpxchg for RISC-V and introduces target-independent code
to help lowering part-word cmpxchg. In response to the comments on
this thread and review comments I revised the patches so that as much
lowering as possible is done in AtomicExpandPass, thus maximising the
potential for code reuse across different targets.

I've additionally posted https://reviews.llvm.org/D52234 which updates
the AtomicExpandPass docs. If there is consensus, I think both James
and I would prefer to more strongly encourage that targets avoid
expanding LL/SC loops in IR on the grounds that LLVM has no way to
guarantee that the generated code complies with the architecture's
requirements for forward progress.

I believe it would be worthwhile for the Mips backend to evaluate
using AtomicExpandPass in the same way the RISC-V backend now does. It
wouldn't provide a correctness benefit over the current logic, but
should help reduce the amount of target-specific code for handling
atomics.

Tim: did you get a chance at looking at a verifier pass?

James suggested what the eventual solution for "asm goto" may well
help in cleaning up the suboptimal control-flow around a cmpxchg
lowered using the approach described in this RFC. Does anybody know of
any work in this area?

Re: [llvm-dev] RFC: Atomic LL/SC loops in LLVM revisited

> To update on this, everything has now landed other than D48141 which
> implements cmpxchg for RISC-V and introduces target-independent code
> to help lowering part-word cmpxchg. In response to the comments on
> this thread and review comments I revised the patches so that as much
> lowering as possible is done in AtomicExpandPass, thus maximising the
> potential for code reuse across different targets.
>
> I've additionally posted https://reviews.llvm.org/D52234 which updates
> the AtomicExpandPass docs. If there is consensus, I think both James
> and I would prefer to more strongly encourage that targets avoid
> expanding LL/SC loops in IR on the grounds that LLVM has no way to
> guarantee that the generated code complies with the architecture's
> requirements for forward progress.
>
> I believe it would be worthwhile for the Mips backend to evaluate
> using AtomicExpandPass in the same way the RISC-V backend now does. It
> wouldn't provide a correctness benefit over the current logic, but
> should help reduce the amount of target-specific code for handling
> atomics.
>
> Tim: did you get a chance at looking at a verifier pass?

Just wanted to ping this thread given the ongoing Dev Meeting. If
anyone wants to discuss atomics lowering in person, I'd be keen to do
so. https://reviews.llvm.org/D48131 is still awaiting review, but once
landed atomics lowering should be "complete" using this proposed
strategy for the RISC-V backend.

I understand an RFC soon going to be posted for asm goto, which as
James noted may allow more-efficient branching in the cmpxchg case.