Details

There's apparently a race between fatbin destructors registered by us
and some internal calls registered by CUDA runtime from cudaRegisterFatbin.
Moving fatbin de-registration to atexit() was not sufficient to avoid crash in
CUDA runtime on exit when the runtime was linked statically, but CUDA
kernel was launched from a shared library.

Moving atexit() call to before we call cudaRegisterFatbin appears to work
with both statically and dynamically linked CUDA TUs.

Can this ever end up in a shared library? If yes, please use the normal logic for creating a global destructor. atexit is not very friendly to dlopen...

Yes, it can end up in a shared library. What would be the normal logic in this case?

We used to use regular global destructor, but has even worse issues. Alas, NVIDIA provides no documentation to how compiler-generated glue is expected to interact with CUDA runtime, so we need to guess what it wants.
NVCC-generated glue generates call to atexit(). If we use global destructors, then by the time they are executed, nvidia's runtime has already been deinitialized and our attempt to call it causes the crash.
Deregistering fatbin from atexit() works better, but apparently we still race with the runtime. calling atexit() before we register the fatbin appears to work for all combinations of {static/dynamic, kernel/runtime}.

Depends a bit on the platform, __cxa_atexit on most modern ELF systems, fallback to atexit. If the global dtor is run too late, it smells like a missing library dependency. They are executed in topological order after all.