This is caused by available huge page memory being not sufficient on one or more of the allocated compute nodes. The above error happens more often with jobs using the "-ss" option for the aprun command. It is confirmed that the available hugepages are not even among the 4 NUMA nodes on a compute node.

Workaround

The first workaround is to resubmit your batch job so that it launches on a different set of compute nodes. We monitor failed jobs and manually reboot the problem nodes. The second workaroud is to not use the "-ss" option in your batch script, it sometimes has negative performance impact especailly for hybrid MPI/OpenMP applications.