Details

According to LangRef "align must be explicitly specified on atomic stores,
and the store has undefined behavior if the alignment is not set to a value
which is at least the size in bytes of the pointee."

It would be useful to detect optimization introducing the undefined behavior
in verifier. By default the check is off and can be enabled by
verify-atomic-alignment option

Diff Detail

Event Timeline

I like the idea overall, but I think you need a way to figure out if the alignment was put there at the user's request, or if an optimization introduced it. That would allow you to turn on the check all the time, and only detect cases where LLVM messed up, instead of where it's obeying the user's request for under-alignment. Maybe clang could add a metadata node on all misaligned atomics?

IIRC the weird LangRef definition is because it was written before we banned modules without a DataLayout.

It clearly doesn't match reality, though; we generate well-defined code for unaligned atomic ops. The semantics of atomic operations in IR were changed without anyone actually updating any of the documentation (particularly around the generation of __atomic_* calls in the backend). Someone needs to go through and update it.

IIRC the weird LangRef definition is because it was written before we banned modules without a DataLayout.

It clearly doesn't match reality, though; we generate well-defined code for unaligned atomic ops. The semantics of atomic operations in IR were changed without anyone actually updating any of the documentation (particularly around the generation of __atomic_* calls in the backend). Someone needs to go through and update it.

Do you have a pointer to an intentional change in semantics? (i.e. what changes introduced this?)

Thanks for the pointer. I've read through the documentation introduced in that patch, and I'm still left wondering.

If we have say a 16 byte aligned CAS and a 16 byte *unaligned* CAS operation on x86-64, I don't see how we can implement them via the atomic library contract. Specifically, the aligned 16 byte CAS can be represented via a CMPXCHG16b. Given this, we're required to find a locked implementation for the *unaligned* case, and the SPEC for CMPXCHG16b explicitly states that a fault is generated if an unaligned address is used. Given that, we have no choice but to use a lock pattern. However, that means we can have two overlapping 16 byte accesses (one aligned, one not) and be unable to provide a correct implementation.

To prevent confusion, this problem doesn't exist for 8 bytes or smaller, due to the availability of the larger CAS. It's only 16 byte accesses which have this problem.

I think the strongest guarantee we can provide is a target defined maximum size of atomics which differs based on alignment.

In general, any lock-based atomic can't overlap with any lock-free atomic operation: the lock-free operation will ignore the lock, so the behavior is unpredictable. That isn't unique to 16-byte operations. That's not really fatal for the intended use of the __atomic functions: atomic variables in C/C++ can't overlap. Yes, it's not consistent with the way the LLVM documentation describes atomic operations.

In general, any lock-based atomic can't overlap with any lock-free atomic operation: the lock-free operation will ignore the lock, so the behavior is unpredictable.

The contract for the atomic library is that it _MUST_ do a lock-free operation on an object on which _any_ compiler in use on the system may have emitted a lock-free operation for. That's why libatomic is shipped as a separate shared library with GCC.

E.g., if any compiler is able to emit a a lock-free cmpxchg for a particular object, then the __atomic_compare_exchange call must also use cmpxchg, given that same pointer.

As a concrete example: if a compiler sees an 8-byte object, which is only guaranteed to be 4-byte-aligned, it may not be able to emit an inline lock-free atomic instruction, and instead call the library routine. However, it may so happen that the object is in fact be 8-byte aligned. If that's true, the library routine MUST use the lock-free instruction, rather than a lock.

This ensures correctness even when other code in the program was able to infer a greater alignment for the object. It also ensures correctness when objects (or shared objects) from multiple compiler versions, or with different CPU targets, are combined.

In general, any lock-based atomic can't overlap with any lock-free atomic operation: the lock-free operation will ignore the lock, so the behavior is unpredictable. That isn't unique to 16-byte operations. That's not really fatal for the intended use of the __atomic functions: atomic variables in C/C++ can't overlap. Yes, it's not consistent with the way the LLVM documentation describes atomic operations.

Can you cite this please? While technical consistent with everything else on this thread, this seems to differ in spirit. Before enshrining this in the documentation, I'd like to double check.

The contract for the atomic library is that it _MUST_ do a lock-free operation on an object on which _any_ compiler in use on the system may have emitted a lock-free operation for.

That's not really relevant. The question is what happens if, one on thread, I perform a 4-byte operation at address p, and then on another thread, I perform a 4-byte atomic operation on address p+2.

Can you cite this please?

The C standard bit follows from the strict aliasing rule (6.5p7). (Yes, it would be nice if it were stated a bit more explicitly in the actual description of atomics, but it's pretty clearly the intent; the description of modification orders for atomic objects doesn't make sense if they could overlap.)

The contract for the atomic library is that it _MUST_ do a lock-free operation on an object on which _any_ compiler in use on the system may have emitted a lock-free operation for.

That's not really relevant. The question is what happens if, one on thread, I perform a 4-byte operation at address p, and then on another thread, I perform a 4-byte atomic operation on address p+2.

OK. Certainly no guarantees can be made in such a case.

But, relatedly -- and more subtly -- I also expect there'll be no guarantee of correct behavior if you mix 2-byte and 4-byte naturally-aligned accesses to some address p.

Going further: this is an active field of research, and *architectures* don't provide meaningful guarantees when mixing alignments and / or sizes. I doubt it's useful for LLVM to provide guarantees either. I agree that we want to try to always do lock-free where we can, and be consistent about where we do non-lock-free.

I believe that instead of this patch, the LangRef should be updated to say:

Misaligned atomics in the IR are allowed, and work correctly (but will most likely be inefficient).

This sounds fine. However, the premise of this patch IIUC was that LLVM might add unaligned atomics. Is this something we should worry about? If so, my original comment stands: the verifier should make sure the only unaligned atomic accesses came directly at the user's request.

Unless all accesses to a given byte of memory are done with the same address and size, there can be no guarantee of atomicity. (that is: no overlapping accesses, and no mixed-size accesses)

This is fairly handwavy, but in the right general direction. You have to talk about time or synchronization if you want this statement to be meaningful: e.g. memory locations can get reused and then it's fine to access them differently, or you can synchronize and use different access patterns and across this alignment changes don't matter.

I believe that instead of this patch, the LangRef should be updated to say:

Misaligned atomics in the IR are allowed, and work correctly (but will most likely be inefficient).

This sounds fine. However, the premise of this patch IIUC was that LLVM might add unaligned atomics. Is this something we should worry about? If so, my original comment stands: the verifier should make sure the only unaligned atomic accesses came directly at the user's request.

Hrm -- that premise now worries me, because it implies that we have optimization passes which are merging atomic operations. I think that's incorrect, *even if* such merging results in aligned accesses.

I believe that instead of this patch, the LangRef should be updated to say:

Misaligned atomics in the IR are allowed, and work correctly (but will most likely be inefficient).

This sounds fine. However, the premise of this patch IIUC was that LLVM might add unaligned atomics. Is this something we should worry about? If so, my original comment stands: the verifier should make sure the only unaligned atomic accesses came directly at the user's request.

Hrm -- that premise now worries me, because it implies that we have optimization passes which are merging atomic operations. I think that's incorrect, *even if* such merging results in aligned accesses.

are likely wrong and need to be reverted. The user asked for N atomic ops of size M, but LLVM is instead emitting 1 atomic op of size N * M.

I'm not sure I agree: those are unordered memsets, which are basically @reames' memsets. I think it's up to him to give them semantics that are useful for his target.

Thank you. I was about to say pretty much the same. In particular, trying to claim that a transform is incorrect and needs reverted based on a proposed reading of a change to the LangRef which has not been posted for review, much less accepted seems to be going a definite step too far.

The contract for the atomic library is that it _MUST_ do a lock-free operation on an object on which _any_ compiler in use on the system may have emitted a lock-free operation for.

That's not really relevant. The question is what happens if, one on thread, I perform a 4-byte operation at address p, and then on another thread, I perform a 4-byte atomic operation on address p+2.

OK. Certainly no guarantees can be made in such a case.

But, relatedly -- and more subtly -- I also expect there'll be no guarantee of correct behavior if you mix 2-byte and 4-byte naturally-aligned accesses to some address p.

Going further: this is an active field of research, and *architectures* don't provide meaningful guarantees when mixing alignments and / or sizes. I doubt it's useful for LLVM to provide guarantees either. I agree that we want to try to always do lock-free where we can, and be consistent about where we do non-lock-free.

I think there's an important correction that needs made here. It's correct to say that not all architectures provide guarantees for mixed alignment and sizes. It is incorrect to say that no architecture does, or that some architectures don't provide clear behaviour in practice.

My general feeling from the above discussion is that we are going to *have to* make some of this target defined. We may be able to provide a guarantee of minimal functionality, but we should not disallow a target from providing stronger guarantees.

I'm not sure I agree: those are unordered memsets, which are basically @reames' memsets. I think it's up to him to give them semantics that are useful for his target.

Thank you. I was about to say pretty much the same. In particular, trying to claim that a transform is incorrect and needs reverted based on a proposed reading of a change to the LangRef which has not been posted for review, much less accepted seems to be going a definite step too far.

I must apologize for having given the impression I was actually calling for a revert now. I did not intend that -- we most definitely should not be hasty here! I meant "need to be reverted" in a more general sense.

That said, it's not the case that unordered memset/memcpy should have semantics defined to as "whatever's useful to reames". :) They need well-defined semantics at the IR level, compatible with other atomics' semantics. And they have them! -- they are intended to have the same behavior as if you had written a series of <len> / <elementsize> unordered atomic load/store instructions with size <elementsize>.

So, then, the question is: what is safe to do with atomic unordered instructions? Shall we merge these two instructions into one 32-bit store instruction?

We currently do not. On X86, for unordered atomics of those sizes and alignments, I'm pretty sure the above would in fact be a safe and beneficial optimization.

Similarly, merging two 32-bit stores into a (properly 8-byte-aligned) 64-bit store should also be safe -- but only on the Pentium and later. Otherwise, that's be an incorrect transformation, because the 8-byte atomic op ends up needing to do a libcall, which may not be atomic w.r.t. the 4-byte atomic read/write machine instruction.

My proposal is that at the IR level, we should document that it's unsafe to mix and match sizes on atomic operations, and behave as such in all our IR optimization passes. But, that doesn't prohibit teaching SelectionDAG load/store merging to do the above optimization in target configurations where it's known to be safe to do so.

I'm not sure I agree: those are unordered memsets, which are basically @reames' memsets. I think it's up to him to give them semantics that are useful for his target.

Thank you. I was about to say pretty much the same. In particular, trying to claim that a transform is incorrect and needs reverted based on a proposed reading of a change to the LangRef which has not been posted for review, much less accepted seems to be going a definite step too far.

I must apologize for having given the impression I was actually calling for a revert now. I did not intend that -- we most definitely should not be hasty here! I meant "need to be reverted" in a more general sense.

Gotcha, moving on...

That said, it's not the case that unordered memset/memcpy should have semantics defined to as "whatever's useful to reames". :) They need well-defined semantics at the IR level, compatible with other atomics' semantics.

I agree.

And they have them! -- they are intended to have the same behavior as if you had written a series of <len> / <elementsize> unordered atomic load/store instructions with size <elementsize>.

So, then, the question is: what is safe to do with atomic unordered instructions? Shall we merge these two instructions into one 32-bit store instruction?

We currently do not. On X86, for unordered atomics of those sizes and alignments, I'm pretty sure the above would in fact be a safe and beneficial optimization.

I would argument that we can, and possibly should. This is definitely legal on x86. Profitability is likely, but there might be some cornercase so I'll save that discussion for the patch review if someone wants to implement it!

Similarly, merging two 32-bit stores into a (properly 8-byte-aligned) 64-bit store should also be safe -- but only on the Pentium and later. Otherwise, that's be an incorrect transformation, because the 8-byte atomic op ends up needing to do a libcall, which may not be atomic w.r.t. the 4-byte atomic read/write machine instruction.

I agree with your facts. This is one of the arguments for me that some of this will have to be target defined and controlled by target hooks.

My proposal is that at the IR level, we should document that it's unsafe to mix and match sizes on atomic operations, and behave as such in all our IR optimization passes. But, that doesn't prohibit teaching SelectionDAG load/store merging to do the above optimization in target configurations where it's known to be safe to do so.

I think what we're debating here is the meaning of "unsafe". If "unsafe" is interpreted as "has target defined behaviour", then I completely agree. If "unsafe" is defined as "result is undefined" (in the UB sense), then I strongly disagree.

Do you have some hesitancy about this being target controlled for some reason? We certainly have precedent for a number of target legality hooks in optimizations already. Is there a concern I'm missing here?

I think what we're debating here is the meaning of "unsafe". If "unsafe" is interpreted as "has target defined behaviour", then I completely agree. If "unsafe" is defined as "result is undefined" (in the UB sense), then I strongly disagree.

At the IR level, we often make operations undefined even when the underlying machine operations has some useful semantics, if we can't provide that across all platforms. For example, float-to-int conversion produces poison on overflow, even though every target with a float-to-int instruction produces some well-defined result. We don't want to introduce target-specific semantics unless we have to. So the question is, what's the right tradeoff here? And if we do decide to distinguish between locked and lock-free operations, how should we represent that?

I haven't really thought it through completely, but I think if we want to distinguish them in IR, it might make sense to attach a "lockfree" bit to atomic operations where correctness depends on the lock-free semantics.

We certainly have precedent for a number of target legality hooks in optimizations already.

Target hooks don't control the semantics of a given operation. We use them to determine whether an transform is profitable, or sometimes whether the backend supports emitting a certain operation. Semantic differences must be encoded directly into the IR; the power of LLVM IR for target-independent optimization is tied to this. So adding a target hook to check whether a given target supports a certain lock-free operation is fine, but adding a target hook to control whether a certain sequence of atomic operations has defined behavior is not okay.

I think what we're debating here is the meaning of "unsafe". If "unsafe" is interpreted as "has target defined behaviour", then I completely agree. If "unsafe" is defined as "result is undefined" (in the UB sense), then I strongly disagree.

At the IR level, we often make operations undefined even when the underlying machine operations has some useful semantics, if we can't provide that across all platforms. For example, float-to-int conversion produces poison on overflow, even though every target with a float-to-int instruction produces some well-defined result. We don't want to introduce target-specific semantics unless we have to. So the question is, what's the right tradeoff here? And if we do decide to distinguish between locked and lock-free operations, how should we represent that?

Again, I would strongly resist any effort to define away overlapping atomics or unaligned atomics. This works today in practice on common architectures. We are reliant on this implementation working as it does today for overlapping atomics. Specifying that away because we happen to not be able to guarantee anything on other platforms is not acceptable.

Now, to be clear, I'm not arguing we can't change *spelling*. If you want to insist that overlapping atomics be done via intrinsics, I'll still argue with you because I think it's a bad design, but it's not a fundamental non starter like defining away existing target behaviour is.

I haven't really thought it through completely, but I think if we want to distinguish them in IR, it might make sense to attach a "lockfree" bit to atomic operations where correctness depends on the lock-free semantics.

We certainly have precedent for a number of target legality hooks in optimizations already.

Target hooks don't control the semantics of a given operation. We use them to determine whether an transform is profitable, or sometimes whether the backend supports emitting a certain operation. Semantic differences must be encoded directly into the IR; the power of LLVM IR for target-independent optimization is tied to this. So adding a target hook to check whether a given target supports a certain lock-free operation is fine, but adding a target hook to control whether a certain sequence of atomic operations has defined behavior is not okay.

We certainly do this for library calls, and some intrinsics. I can't think of an example in the base IR instruction set at the moment, but I really don't see it being that different.

I think this thread has moved beyond the original review and probably needs to be promoted to llvm-dev for a wider audience. If there are no objections, I vote we close this review as abandoned, and then I'll send a summary email to llvm-dev.

With apologies, I would just like to tack a note into this thread that the entire *field* of formal memory model proofs involving partially-overlapping atomics is a single paper (last I knew, https://www.cl.cam.ac.uk/~pes20/popl17/mixed-size.pdf). The paper says mixed-size, but the key point is not-perfectly-overlapping.

That one paper uncovered surprising new behaviors that don't exist in programs comprising only aligned atomics of matching sizes for all pairs that conflict. For example, because of this new behavior, the presence of misaligned atomics in a program can invalidate the proof that C++ programs with SC fences between every two atomic operations will admit only SC executions. Risking invalidating that proof supports the notion that they should be declared UB in C++ programs in the general sense (i.e. users definitely should not be writing this), though perhaps not UB on every imaginable back-end after optimizations (an argument you have made, I don't disagree with).

TL;DR: it's a bit early to declare that misaligned atomics are well-understood & just work, even where they usually fail to astonish.

With apologies, I would just like to tack a note into this thread that the entire *field* of formal memory model proofs involving partially-overlapping atomics is a single paper (last I knew, https://www.cl.cam.ac.uk/~pes20/popl17/mixed-size.pdf). The paper says mixed-size, but the key point is not-perfectly-overlapping.

That one paper uncovered surprising new behaviors that don't exist in programs comprising only aligned atomics of matching sizes for all pairs that conflict. For example, because of this new behavior, the presence of misaligned atomics in a program can invalidate the proof that C++ programs with SC fences between every two atomic operations will admit only SC executions. Risking invalidating that proof supports the notion that they should be declared UB in C++ programs in the general sense (i.e. users definitely should not be writing this), though perhaps not UB on every imaginable back-end after optimizations (an argument you have made, I don't disagree with).

TL;DR: it's a bit early to declare that misaligned atomics are well-understood & just work, even where they usually fail to astonish.

Again, I would strongly resist any effort to define away overlapping atomics or unaligned atomics. This works today in practice on common architectures. We are reliant on this implementation working as it does today for overlapping atomics.

In LLVM IR, we have some clear answers:

Perfectly-overlapping accesses -- no matter what their alignment -- are safe. We can and do support correct unaligned access via lowering to an appropriate libcall (which may use a lock).

Partially-overlapping accesses of the same size, but offset by some bytes is definitely not safe. (due to potentially mixing locked and native atomics).

A mixture of access sizes where any may be too large to be lock-free is definitely not safe. (same reason)

And then the hard question: A set of mixed-size accesses, where each is naturally-aligned and lock-free.

I would prefer if we could define away atomicity guarantees in the face of such accesses as well. While we may be able to carve out and define some cases where it is safe, if there's no great reason to, it would be far simpler not to.

Specifying that away because we happen to not be able to guarantee anything on other platforms is not acceptable.

Now, to be clear, I'm not arguing we can't change *spelling*. If you want to insist that overlapping atomics be done via intrinsics, I'll still argue with you because I think it's a bad design, but it's not a fundamental non starter like defining away existing target behaviour is.

I assume you're now talking about IR generated from a frontend, not simply the internal handling of the element-wise mem intrinsics, correct? Can you explain more about what is reliant on the ability to do overlapping/mixed-size atomic accesses? What kinds of access patterns are you depending on? (e.g. what alignments, sizes, and memory orderings).