#[inline]
fn kind(&self) -> usize {
// This function is going to probably raise some eyebrows. The function
// returns true if the buffer is stored inline. This is done by checking
// the least significant bit in the `arc` field.
//
// Now, you may notice that `arc` is an `AtomicPtr` and this is
// accessing it as a normal field without performing an atomic load...
//
// Again, the function only cares about the least significant bit, and
// this bit is set when `Inner` is created and never changed after that.
// All platforms have atomic "word" operations and won't randomly flip
// bits, so even without any explicit atomic operations, reading the
// flag will be correct.
//
// This function is very critical performance wise as it is called for
// every operation. Performing an atomic load would mess with the
// compiler's ability to optimize. Simple benchmarks show up to a 10%
// slowdown using a `Relaxed` atomic load on x86.
#[cfg(target_endian = "little")]
#[inline]
fn imp(arc: &AtomicPtr<Shared>) -> usize {
unsafe {
let p: &u8 = mem::transmute(arc);
(*p as usize) & KIND_MASK
}
}
#[cfg(target_endian = "big")]
#[inline]
fn imp(arc: &AtomicPtr<Shared>) -> usize {
unsafe {
let p: &usize = mem::transmute(arc);
*p & KIND_MASK
}
}
imp(&self.arc)
}

This got flagged by miri because of the transmute from an &UnsafeCell<T> to an &T (which is not okay, because the shared pointer expects to point to something frozen). But that’s not even the problem: The justification for the non-atomic access here just doesn’t hold. The C++11 model and also LLVM’s model say that a read-write race is either UB or yields undef, but it certainly won’t give you correct values for some bits. LLVM’s undef-on-race is explicitly happening on a byte-for-byte basis.

Any thoughts?

@carllerche One thing I am wondering, have you benchmarked using a relaxed load? EDIT: D’oh, you did, sorry for not reading properly. That is somewhat surprising, in particular on x86, but I guess LLVM is just too conservative around relaxed accesses.
Also, from the comment in your code, it is not clear to me whether you are saying that this is not UB, or whether you are saying that this is UB but does not cause trouble in practice. I am convinced that, if anything, it is the latter – and the comment should clearly say so.

Potentially relevant (including the preceding stuff). TL;DR this kind of trick would also be useful to optimize atomic into non-atomic reference counting when the object hasn’t ever escaped its creating thread, with a dynamic escape check; there is a paper investigating the performance impact for Swift (it is significant), and it seems to basically just assume that it’s sound. (I don’t know whether this is directly relevant to Rust – it’s at least not remotely obvious to me how one could go about adding/requiring those write barriers in library code.)

Are the problems all at the level of memory models? Is there anything where this would cause a problem at the level of hardware?

I believe this sort of trickery could be OK under the Java memory model, which banishes undefined behavior but otherwise does not guarantee much about shared non-volatile memory. The LLVM memory model has a corresponding ordering unordered that’s weaker than relaxed ordering (called monotonic there). It still come with some optimizer restrictions though. On the other hand, I am not sure whether the observed slowdown of relaxed access is inherent or just some passes being more conservative than they would have to be.

The read still sees a whole bunch of writes to pick from, at least one of them non-atomic. Some of those writes might happen-after the initial write, but they don’t happen-before the non-atomic read, so both the initial write and the atomic write are still candidates.

The read still sees a whole bunch of writes to pick from, at least one of them non-atomic. Some of those writes might happen-after the initial write, but they don’t happen-before the non-atomic read, so both the initial write and the atomic write are still candidates.

It could pick from every single possible write in the history and still be correct as long as bits don’t get randomly set. Every single write ensures to not set the bits that get checked by is_inline_or_static.

It could pick from every single possible write in the history and still be correct as long as bits don’t get randomly set. Every single write ensures to not set the bits that get checked by is_inline_or_static .

I get that. That would require two changes to the LLVM model:

Define data races and reading undef on a per-bit instead of per-byte basis. I do not foresee any principal problems with this, but could be missing something.

Define non-atomic reads that can read from multiple identical writes to just use that value. So undef only happens where there are at least two different writes that the non-atomic read could pick from. This is the part where I am worried about losing optimizations. I just don’t know them well enough to estimate why the rule here is the way it is right now.

I am approaching this not from the “what does LLVM do” perspective, but from the “what does LLVM promise” perspective. This may seem like language lawyering, but as a formal methods person I am more interested in figuring out if it is possible to write the “contract” between you and the compiler in a way that it covers your use-case, than in determining if LLVM currently happens to do the thing you want it to do. As mentioned above, I cannot currently come up with an optimization that would break your code, but there is a large and interesting space of programs that are UB but are not getting miscompiled (where the latter is based on experimental evidence and being unable to come up with a counter-example, not an in-depth analysis of the compiler), and you provided a very nice example for such a program.

carllerche:

I just don’t see how the code can’t work unless LLVM injects made up values.

The thing is, LLVM is perfectly allowed to inject made-up values if it can show that that doesn’t change program behavior. I gave some examples of that above. And while that sounds like a stupid thing to do, there are circumstances where it can be beneficial to “speculatively” write some value (moving a write out of a conditional branch), and that can look a lot like injecting made-up values (because if we only look at the one execution trace where the speculation is wrong, that speculative write does come out of nowhere).

I’ve too often been surprised by what compilers do to just assume that they won’t do horrendously strange-looking transformations.^^

When executing line 10, we see two possible writes to pick from (the ones from lines 3 and 5). Both have the same value. However, LLVM can optimize the first thread to remove line 3: That write is redundant because it gets overwritten in line 5, and the fact that there is a release write does not matter because lines 3 and 5 are non-atomic accesses.

We can almost, but not quite get there with bytes. Something not too dissimilar from the above situation can still arise with byte. Like, when I do

and when the compiler inlines all of that and reuses the same storage for the second buffer, the buf.inner.arc will first be KIND_INLINE and then later KIND_VEC (matching lines 2 and 3 of the example above). If we then send a pointer to the buf away, that will look like lines 4 and 9.

The one thing we do not get, I think, is line 5. We can, I guess trigger an atomic write to buf.inner.arc by doing something with buf that would make it change the pointer, but I am not familiar enough with this data structure to say what that would be and I am also not sure if the optimization mentioned above is still legal when line 5 is changed to be atomic (I asked my colleague).

Still, the fact that you can overwrite an existing BytesMut with a new one means that the KIND of one of these pointers can change, if you consider the entire lifetime of the location it is stored into (which you have to). So it’s not like all writes have the same KIND, just the ones that the non-atomic read can choose from – and as the example above shows, that on its own is not sufficient to make your argument work. It might be that BytesMut does something else that would be a sufficient condition, but it’s certainly more complex than we thought.

You are correct that it is a data race according to the LLVM model. It is assuming that, in practice, no chip will randomly set the bits to random values.

Ehm… what this actually is assuming is that LLVM will never be able to tell that this code has an unconditional data-race in a multi-threaded context and can therefore not be reached along with all code leading to it.

We should definitely fill in a missed optimization bug in LLVM with this example (spawning some threads) because all of it should compile to nothing.

We should also find out a way to make @carllerche’s example not have UB and still produce efficient code. The unordered ordering definitely looks worth exploring.

Ehm… what this actually is assuming is that LLVM will never be able to tell that this code has an unconditional data-race in a multi-threaded context and can therefore not be reached along with all code leading to it.

Yes pretty much…

I thought more about it, and there is a way that I could probably restructure the internals of Bytes to avoid the race, but it would take a large chunk of work. Unless, in practice, there is a threat today, I don’t think I’ll be able to prioritize the work.