Details

There is an issue where calls to bzero (memset(), etc) can be eliminated due to an optimizing compiler eliminating the call to bzero() (or memset(), etc) because the arguments to the call are not subsequently used by the function. The compiler can interpret this as "no side effects", and eliminate the call.

Link Time Optimization (LTO) is a problem for several implementations of explicit_bzero(). LTO enabled today with clang, it also works with several versions of GCC, and only the absence of WITH_LLD_IS_LLD=yes in src.conf is keeping LTO from being enabled in the linker with:

This style of implementation in FreeBSD (and by extension OpenBSD) today is flawed, because the compiler/linker is free to look into the
call tree and still eliminate the call. The issue shows up on LTO (today), but is not limited to systems where LTO is in-use.

I'd prefer we don't ditch support for the in-tree gcc compiler to get this. See my suggestions on how to accomplish that.
Sadly, there doesn't appear to be a #define for when the LTO linker is used, I guess because that's link time and not
run time.

I'd move this to after the clang / gcc block that you aded below. I'd condition it on #ifndef NOOPT.
I'd also include the explicit_bzero definition from below.
This will support old compilers that don't do LTO as well as new that can / do.

I don't think that this approach will work, because noinline does not do what you might expect. The compiler will not inline the function, but it can still look inside it. In particular, it will determine that it does not read or write any memory other than the arguments and propagate these attributes to anything that calls it. This will happen even with the ThinLTO model that Google is slowly upstreaming to LLVM.

In particular, if this function is called immediately before a free(), then the compiler is free to elide it and leave the sensitive data in the free list.

Note that the version using memory barriers can be made portable by using the relevant memory barrier function (memory order: sequentially consistent) from <stdatomic.h>.

The simplest way to prevent the compiler eliding the stores would be to have an extern volatile void* variable (defined in an assembly file, so that the compiler can never see it) and so assign the buf variable to it before doing the bzero. The compiler is not permitted to elide volatile stores to variables that it can not guarantee that it can see the entire lifetime of, because they may be device memory. This will guarantee that the compiler believes that the pointer has escaped and so it then may not elide any stores to it. This does not establish a happens-before relationship with any other threads, so the compiler should still be able to emit efficient code for everything.

Hi Jim,
I don't think that this approach will work, because noinline does not do what you might expect. The compiler will not inline the function, but it can still look inside it. In particular, it will determine that it does not read or write any memory other than the arguments and propagate these attributes to anything that calls it. This will happen even with the ThinLTO model that Google is slowly upstreaming to LLVM.
In particular, if this function is called immediately before a free(), then the compiler is free to elide it and leave the sensitive data in the free list.
Note that the version using memory barriers can be made portable by using the relevant memory barrier function (memory order: sequentially consistent) from <stdatomic.h>.
The simplest way to prevent the compiler eliding the stores would be to have an extern volatile void* variable (defined in an assembly file, so that the compiler can never see it) and so assign the buf variable to it before doing the bzero. The compiler is not permitted to elide volatile stores to variables that it can not guarantee that it can see the entire lifetime of, because they may be device memory. This will guarantee that the compiler believes that the pointer has escaped and so it then may not elide any stores to it. This does not establish a happens-before relationship with any other threads, so the compiler should still be able to emit efficient code for everything.

A key design goal throughout this code is avoiding the use of globals that will introduce cache misses / contention. Is there a way to accomplish this goal without introducing an extraneous global write? (I don't think we can mark local variables as extern? Or can we?)

Not really. The really correct solution is to introduce a __builtin_secure_bzero() that the compiler knows it is now allowed to elide. Anything else is trying to trick the compiler into thinking that it doesn't know what is happening.

Another good solution would be to provide a free-and-erase function as part of jemalloc (and make sure that it calls a function that free calls, not free itself.

Hi Jim,
I don't think that this approach will work, because noinline does not do what you might expect. The compiler will not inline the function, but it can still look inside it. In particular, it will determine that it does not read or write any memory other than the arguments and propagate these attributes to anything that calls it. This will happen even with the ThinLTO model that Google is slowly upstreaming to LLVM.

The "noinline" is there to turn off inlining, because inlining ("always_inline") is incompatible with optnone.

clang's VerifyFunctionAttrs() (in ./contrib/llvm/lib/IR/Verifier.cpp) requires that noinline be 'on' when optnone is used.

Please let me know if you think 'optnone' isn't workable.

Note that the version using memory barriers can be made portable by using the relevant memory barrier function (memory order: sequentially consistent) from <stdatomic.h>.

If memory barriers is your preferred approach, then I will rework the patch.

The simplest way to prevent the compiler eliding the stores would be to have an extern volatile void* variable (defined in an assembly file, so that the compiler can never see it) and so assign the buf variable to it before doing the bzero. The compiler is not permitted to elide volatile stores to variables that it can not guarantee that it can see the entire lifetime of, because they may be device memory. This will guarantee that the compiler believes that the pointer has escaped and so it then may not elide any stores to it. This does not establish a happens-before relationship with any other threads, so the compiler should still be able to emit efficient code for everything.

variant is the same, but has flags to force the call to occur even
if the compiler thinks it is pointless.
This makes
.Fn explicit_bzero
useful for clearing memory with sensitive contents like a password.

I'm not sure that NOOPT is needed here. Even turning this into a memset / bzero call should be fine for the compiler to do, if this function is not inlined. It should also be safe for LLVM to replace a call to explicit_bzero with an LLVM memset intrinsic with the volatile flag set, and so generate an optimised memset sequence that can't be removed.

@jim_netgate.com, does this trick work for both clang and gcc (both base and newer)? If so, I think this would be fine to commit. explicit_bzero could even be a macro! (And there might be a corresponding explicit_memset too, but I digress.)