Bug Description

This was reported by a hardware partner. The system set up is a server with 512GB RAM and an M.2 NVMe drive as the root filesystem/boot device.

Per the customer, when running the certification Memory Stress test (utilizing several stress-ng stressors run in sequence) the system freezes with CPU Soft Lockup errors appearing on console whe the "stack" stressor is run.

So far, this only seems to affect the 4.15 kernel. The tester has tried using the 2.5" SATA SSD as the RootFS/Boot device and the tests pass on all attempts. It is ONLY when using the M.2 NVMe as the root / boot device that the tests cause a lockup. The tester is re-trying now with the 2.5" NVMe device to see if this only occurs with the M.2 NVMe.

The tester has tried this on the following while using the M.2 NVMe as the rootFS/Boot device:
Test run #1 – 16.04.5 at kernel 4.15; Result: Failed stress-ng memory on stack stressor
Test run #2 – 18.04.1 at kernel 4.15; Result: Failed stress-ng memory on stack stressor
Test run #3 – 16.04.5 at kernel 4.4; Result: Passed stress-ng memory test

The stress-ng command invoked at the time the soft lockups occur is this:

'stress-ng -k --aggressive --verify --timeout 300 --stack 0'

This can be reproduced by running the memory_stress_ng test script from the cert suite: