stress-ng man page

stress-ng will stress test a computer system in various selectable ways. It was designed to exercise various physical subsystems of a computer as well as the various operating system kernel interfaces. stress-ng also has a wide range of CPU specific stress tests that exercise floating point, integer, bit manipulation and control flow.

stress-ng was originally intended to make a machine work hard and trip hardware issues such as thermal overruns as well as operating system bugs that only occur when a system is being thrashed hard. Use stress-ng with caution as some of the tests can make a system run hot on poorly designed hardware and also can cause excessive system thrashing which may be difficult to stop.

stress-ng can also measure test throughput rates; this can be useful to observe performance changes across different operating system releases or types of hardware. However, it has never been intended to be used as a precise benchmark test suite, so do NOT use it in this manner.

Running stress-ng with root privileges will adjust out of memory settings on Linux systems to make the stressors unkillable in low memory situations, so use this judiciously. With the appropriate privilege, stress-ng can allow the ionice class and ionice levels to be adjusted, again, this should be used with care.

One can specify the number of processes to invoke per type of stress test; specifying a negative or zero value will select the number of processors available as defined by sysconf(_SC_NPROCESSORS_CONF).

enables more file, cache and memory aggressive options. This may slow tests down, increase latencies and reduce the number of bogo ops as well as changing the balance of user time vs system time used depending on the type of stressor being used.

specify the class of stressors to run. Stressors are classified into one or more of the following classes: cpu, cpu-cache, device, io, interrupt, filesystem, memory, network, os, pipe, scheduler and vm. Some stressors fall into just one class. For example the 'get' stressor is just in the 'os' class. Other stressors fall into more than one class, for example, the 'lsearch' stressor falls into the 'cpu', 'cpu-cache' and 'memory' classes as it exercises all these three. Selecting a specific class will run all the stressors that fall into that class only when run with the --sequential option.

Specifying a name followed by a question mark (for example --class vm?) will print out all the stressors in that specific class.

by default, stress-ng will attempt to change the name of the stress processes according to their functionality; this option disables this and keeps the process names to be the name of the parent process, that is, stress-ng.

by default stress-ng will report the name of the program, the message type and the process id as a prefix to all output. The --log-brief option will output messages without these fields to produce a less verbose output.

output number of bogo operations in total performed by the stress processes. Note that these are not a reliable metric of performance or throughput and have not been designed to be used for benchmarking whatsoever. The metrics are just a useful way to observe how a system behaves when under various kinds of load.

The following columns of information are output:

Column Heading

Explanation

bogo ops

number of iterations of the stressor during the run. This is metric of how much overall "work" has been achieved in bogo operations.

real time (secs)

average wall clock duration (in seconds) of the stressor. This is the total wall clock time of all the instances of that particular stressor divided by the number of these stressors being run.

usr time (secs)

total user time (in seconds) consumed running all the instances of the stressor.

sys time (secs)

total system time (in seconds) consumed running all the instances of the stressor.

bogo ops/s (real time)

total bogo operations per second based on wall clock run time. The wall clock time reflects the apparent run time. The more processors one has on a system the more the work load can be distributed onto these and hence the wall clock time will reduce and the bogo ops rate will increase. This is essentially the "apparent" bogo ops rate of the system.

bogo ops/s (usr+sys time)

total bogo operations per second based on cumulative user and system time. This is the real bogo ops rate of the system taking into consideration the actual time execution time of the stressor across all the processors. Generally this will decrease as one adds more concurrent stressors due to contention on cache, memory, execution units, buses and I/O devices.

from version 0.02.26 stress-ng automatically calls madvise(2) with random advise options before each mmap and munmap to stress the the vm subsystem a little harder. The --no-advise option turns this default off.

Do not seed the stress-ng psuedo-random number generator with a quasi random start seed, but instead seed it with constant values. This forces tests to run each time using the same start conditions which can be useful when one requires reproduceable stress tests.

Do not respawn a stressor if it gets killed by the Out-of-Memory (OOM) killer. The default behaviour is to restart a new instance of a stressor if the kernel OOM killer terminates the process. This option disables this default behaviour.

touch allocated pages that are not in core, forcing them to be paged back in. This is a useful option to force all the allocated pages to be paged in when using the bigheap, mmap and vm stressors. It will severely degrade performance when the memory in the system is less than the allocated buffer sizes. This uses mincore(2) to determine the pages that are not in core and hence need touching to page them back in.

enable stressors that are known to hang systems. Some stressors can quickly consume resources in such a way that they can rapidly hang a system before the kernel can OOM kill them. These stressors are not enabled by default, this option enables them, but you probably don't want to do this. You have been warned.

measure processor and system activity using perf events. Linux only and caveat emptor, according to perf_event_open(2): "Always double-check your results! Various generalized events have had wrong values.". Note that with Linux 4.7 one needs to have CAP_SYS_ADMIN capabilities for this option to work, or adjust /proc/sys/kernel/perf_event_paranoid to below 2 to use this without CAP_SYS_ADMIN.

sequentially run all the stressors one by one for a default of 60 seconds. The number of instances of each of the individual stressors to be started is N. If N is less than zero, then the number of CPUs online is used for the number of instances. If N is zero, then the number of CPUs in the system is used. Use the --timeout option to specify the duration to run each stressor.

set CPU affinity based on the list of CPUs provided; stress-ng is bound to just use these CPUs (Linux only). The CPUs to be used are specified by a comma separated list of CPU (0 to N-1). One can specify a range of CPUs using '-', for example: --taskset 0,2-3,6,7-11

This can only be used when running on Linux and with root privilege. This option starts a background thrasher process that works through all the processes on a system and tries to page as many pages in the processes as possible. This will cause considerable amount of thrashing of swap on an over-committed system.

adjust the per process timer slack to N nanoseconds (Linux only). Increasing the timer slack allows the kernel to coalesce timer events by adding some fuzzinesss to timer expiration times and hence reduce wakeups. Conversely, decreasing the timer slack will increase wakeups. A value of 0 for the timer-slack will set the system default of 50,000 nanoseconds.

show the cumulative user and system times of all the child processes at the end of the stress run. The percentage of utilisation of available CPU time is also calculated from the number of on-line CPUs in the system.

specify a list of one or more stressors to exclude (that is, do not run them). This is useful to exclude specific stressors when one selects many stressors to run using the --class option, --sequential, --all and --random options. Example, run the cpu class stressors concurrently and exclude the numa and search stressors:

start N workers that issue multiple small asynchronous I/O writes and reads on a relatively small temporary file using the POSIX aio interface. This will just hit the file system cache and soak up a lot of user and kernel time in issuing and handling I/O requests. By default, each worker process will handle 16 concurrent I/O requests.

start N workers that exercise various parts of the AppArmor interface. Currently one needs root permission to run this particular test. This test is only available on Linux systems with AppArmor support.

start N workers that exercise various GCC __atomic_*() built in operations on 8, 16, 32 and 64 bit intergers that are shared among the N workers. This stressor is only available for builds using GCC 4.7.4 or higher. The stressor forces many front end cache stalls and cache references.

start N workers that grow their heaps by reallocating memory. If the out of memory killer (OOM) on Linux kills the worker or the allocation fails then the allocating process starts all over again. Note that the OOM adjustment for the worker is set so that the OOM killer will treat these workers as the first candidate processes to kill.

start N workers that repeatedly bind mount / to / inside a user namespace. This can consume resources rapidly, forcing out of memory situations. Do not use this stressor unless you want to risk hanging your machine.

start N workers that grow the data segment by one page at a time using multiple brk(2) calls. Each successfully allocated new page is touched to ensure it is resident in memory. If an out of memory condition occurs then the test will reset the data segment to the point before it started and repeat the data segment resizing over again. The process adjusts the out of memory setting so that it may be killed by the out of memory (OOM) killer before other processes. If it is killed by the OOM killer then it will be automatically re-started by a monitoring parent process.

do not touch each newly allocated data segment page. This disables the default of touching each newly allocated page and hence avoids the kernel from necessarily backing the page with real physical memory.

start N workers that binary search a sorted array of 32 bit integers using bsearch(3). By default, there are 65536 elements in the array. This is a useful method to exercise random access of memory and processor cache.

start N workers that perform random wide spread memory read and writes to thrash the CPU cache. The code does not intelligently determine the CPU cache configuration and so it may be sub-optimal in producing hit-miss read/write activity for some processors.

start N workers that change the file mode bits via chmod(2) and fchmod(2) on the same file. The greater the value for N then the more contention on the single file. The stressor will work through all the combination of mode bits.

start N workers exercising clocks and POSIX timers. For all known clock types this will exercise clock_getres(2), clock_gettime(2) and clock_nanosleep(2). For all known timers it will create a 50000ns timer and busy poll this until it expires. This stressor will cause frequent context switching.

start N workers that create clones (via the clone(2) system call). This will rapidly try to create a default of 8192 clones that immediately die and wait in a zombie state until they are reaped. Once the maximum number of clones is reached (or clone fails because one has reached the maximum allowed) the oldest clone thread is reaped and a new clone is then created in a first-in first-out manner, and then repeated. A random clone flag is selected for each clone to try to exercise different clone operarions. The clone stressor is a Linux only option.

start N workers that run three threads that use swapcontext(3) to implement the thread-to-thread context switching. This exercises rapid process context saving and restoring and is bandwidth limited by register and memory save and restore rates.

start N stressors that copy a file using the Linux copy_file_range(2) system call. 2MB chunks of data are copyied from random locations from one file to random locations to a destination file. By default, the files are 256 MB in size. Data is sync'd to the filesystem after each copy_file_range(2) call.

start N workers exercising the CPU by sequentially working through all the different CPU stress methods. Instead of exercising all the CPU stress methods, one can specify a specific CPU stress method with the --cpu-method option.

load CPU with P percent loading for the CPU stress workers. 0 is effectively a sleep (no load) and 100 is full loading. The loading loop is broken into compute time (load%) and sleep time (100% - load%). Accuracy depends on the overall load of the processor and the responsiveness of the scheduler, so the actual load may be different from the desired load. Note that the number of bogo CPU operations may not be linearly scaled with the load as some systems employ CPU frequency scaling and so heavier loads produce an increased CPU frequency and greater CPU bogo operations.

Note: This option only applies to the --cpu stressor option and not to all of the cpu class of stressors.

note - this option is only useful when --cpu-load is less than 100%. The CPU load is broken into multiple busy and idle cycles. Use this option to specify the duration of a busy time slice. A negative value for S specifies the number of iterations to run before idling the CPU (e.g. -30 invokes 30 iterations of a CPU stress loop). A zero value selects a random busy time between 0 and 0.5 seconds. A positive value for S specifies the number of milliseconds to run before idling the CPU (e.g. 100 keeps the CPU busy for 0.1 seconds). Specifying small values for S lends to small time slices and smoother scheduling. Setting --cpu-load as a relatively low value and --cpu-load-slice to be large will cycle the CPU between long idle and busy cycles and exercise different CPU frequencies. The thermal range of the CPU is also cycled, so this is a good mechanism to exercise the scheduler, frequency scaling and passive/active thermal cooling mechanisms.

Note: This option only applies to the --cpu stressor option and not to all of the cpu class of stressors.

specify a cpu stress method. By default, all the stress methods are exercised sequentially, however one can specify just one method to be used if required. Available cpu stress methods are described as follows:

1000 iterations of a mix of long double precision floating point operations

loop

simple empty loop

matrixprod

matrix product of two 128 × 128 matrices of double floats. Testing on 64 bit x86 hardware shows that this is provides a good mix of memory, cache and floating point operations and is probably the best CPU method to use to make a CPU run hot.

compute parity using various methods from the Standford Bit Twiddling Hacks. Methods employed are: the naïve way, the naïve way with the Brian Kernigan bit counting optimisation, the multiply way, the parallel way, and the lookup table ways (2 variations).

phi

compute the Golden Ratio ϕ using series

pi

compute π using the Srinivasa Ramanujan fast convergence algorithm

pjw

128 rounds of hash pjw function on 128 to 1 bytes of random strings

prime

find all the primes in the range 1..1000000 using a slightly optimised brute force naïve trial division search

psi

compute ψ (the reciprocal Fibonacci constant) using the sum of the reciprocals of the Fibonacci numbers

queens

compute all the solutions of the classic 8 queens problem for board sizes 1..12

perform integer arithmetic on a mix of bit fields in a C union. This exercises how well the compiler and CPU can perform integer bit field loads and stores.

zeta

compute the Riemann Zeta function ζ(s) for s = 2.0..10.0

Note that some of these methods try to exercise the CPU with computations found in some real world use cases. However, the code has not been optimised on a per-architecture basis, so may be a sub-optimal compared to hand-optimised code used in some applications. They do try to represent the typical instruction mixes found in these use cases.

start N workers that each create a daemon that dies immediately after creating another daemon and so on. This effectively works through the process table with short lived processes that do not have a parent and are waited for by init. This puts pressure on init to do rapid child reaping. The daemon processes perform the usual mix of calls to turn into typical UNIX daemons, so this artificially mimics very heavy daemon system stress.

start N workers that send and receive data using the Datagram Congestion Control Protocol (DCCP) (RFC4340). This involves a pair of client/server processes performing rapid connect, send and receives and disconnects on the local host.

by default, messages are sent using send(2). This option allows one to specify the sending method using send(2), sendmsg(2) or sendmmsg(2). Note that sendmmsg is only available for Linux systems that support this system call.

start N workers that create and remove directory entries. This should create file system meta data activity. The directory entry names are suffixed by a gray-code encoded number to try to mix up the hashing of the namespace.

specify unlink order of dentries, can be one of forward, reverse, stride or random. By default, dentries are unlinked in random order. The forward order will unlink them from first to last, reverse order will unlink them from last to first, stride order will unlink them by stepping around order in a quasi-random pattern and random order will randomly select one of forward, reverse or stride orders.

start N workers that perform dup(2) and then close(2) operations on /dev/zero. The maximum opens at one time is system defined, so the test will run up to this maximum, or 65536 open file descriptors, which ever comes first.

start N workers that perform various related socket stress activity using epoll_wait(2) to monitor and handle new connections. This involves client/server processes performing rapid connect, send/receives and disconnects on the local host. Using epoll allows a large number of connections to be efficiently handled, however, this can lead to the connection table filling up and blocking further socket connections, hence impacting on the epoll bogo op stats. For ipv4 and ipv6 domains, multiple servers are spawned on multiple ports. The epoll stressor is for Linux only.

create P child processes that exec stress-ng and then wait for them to exit per iteration. The default is just 1; higher values will create many temporary zombie processes that are waiting to be reaped. One can potentially fill up the process table using high values for --exec-max and --exec.

start N workers continually fallocating (preallocating file space) and ftuncating (file truncating) temporary files. If the file is larger than the free space, fallocate will produce an ENOSPC error which is ignored by this stressor.

specify the size of the fiemap'd file in bytes. One can specify the size as % of free space on the file system or in units of Bytes, KBytes, MBytes and GBytes using the suffix b, k, m or g. Larger files will contain more extents, causing more stress when gathering extent information.

start N workers that exercise file creation using various length filenames containing a range of allower filename characters. This will try to see if it can exceed the file system allowed filename length was well as test various filename lengths between 1 and the maximum allowed by the file system.

create P child processes and then wait for them to exit per iteration. The default is just 1; higher values will create many temporary zombie processes that are waiting to be reaped. One can potentially fill up the the process table using high values for --fork-max and --fork.

start N workers that generate floating point exceptions. Computations are performed to force and check for the FE_DIVBYZERO, FE_INEXACT, FE_INVALID, FE_OVERFLOW and FE_UNDERFLOW exceptions. EDOM and ERANGE errors are also checked.

start N workers that exercise /dev/full. This attempts to write to the device (which should always get error ENOSPC), to read from the device (which should always return a buffer of zeros) and to seek randomly on the device (which should always succeed). (Linux only).

start N workers that rapidly exercise the futex system call. Each worker has two processes, a futex waiter and a futex waker. The waiter waits with a very small timeout to stress the timeout and rapid polled futex waiting. This is a Linux specific stress option.

start N workers continually writing, reading and removing temporary files. The default mode is to stress test sequential writes and reads. With the --aggressive option enabled without any --hdd-opts options the hdd stressor will work through all the --hdd-opt options one by one to cover a range of I/O options.

specify various stress test options as a comma separated list. Options are as follows:

Option

Description

direct

try to minimize cache effects of the I/O. File I/O writes are performed directly from user space buffers and synchronous transfer is also attempted. To guarantee synchronous I/O, also use the sync option.

dsync

ensure output has been transferred to underlying hardware and file metadata has been updated (using the O_DSYNC open flag). This is equivalent to each write(2) being followed by a call to fdatasync(2). See also the fdatasync option.

fadv-dontneed

advise kernel to expect the data will not be accessed in the near future.

fadv-noreuse

advise kernel to expect the data to be accessed only once.

fadv-normal

advise kernel there are no explicit access pattern for the data. This is the default advice assumption.

fadv-rnd

advise kernel to expect random access patterns for the data.

fadv-seq

advise kernel to expect sequential access patterns for the data.

fadv-willneed

advise kernel to expect the data to be accessed in the near future.

fsync

flush all modified in-core data after each write to the output device using an explicit fsync(2) call.

fdatasync

similar to fsync, but do not flush the modified metadata unless metadata is required for later data reads to be handled correctly. This uses an explicit fdatasync(2) call.

iovec

use readv/writev multiple buffer I/Os rather than read/write. Instead of 1 read/write operation, the buffer is broken into an iovec of 16 buffers.

noatime

do not update the file last access timestamp, this can reduce metadata writes.

sync

ensure output has been transferred to underlying hardware (using the O_SYNC open flag). This is equivalent to a each write(2) being followed by a call to fsync(2). See also the fsync option.

rd-rnd

read data randomly. By default, written data is not read back, however, this option will force it to be read back randomly.

rd-seq

read data sequentially. By default, written data is not read back, however, this option will force it to be read back sequentially.

syncfs

write all buffered modifications of file metadata and data on the filesystem that contains the hdd worker files.

utimes

force update of file timestamp which may increase metadata writes.

wr-rnd

write data randomly. The wr-seq option cannot be used at the same time.

wr-seq

write data sequentially. This is the default if no write modes are specified.

Note that some of these options are mutually exclusive, for example, there can be only one method of writing or reading. Also, fadvise flags may be mutually exclusive, for example fadv-willneed cannot be used with fadv-dontneed.

start N workers that search a 80% full hash table using hsearch(3). By default, there are 8192 elements inserted into the hash table. This is a useful method to exercise access of memory and processor cache.

start N workers that stress the instruction cache by forcing instruction cache reloads. This is achieved by modifying an instruction cache line, causing the processor to reload it when we call a function in inside it. Currently only verified and enabled for Intel x86 CPUs.

start N workers that perform a mix of sequential, random and memory mapped read/write operations as well as forced sync'ing and (if run as root) cache dropping. Multiple child processes are spawned to all share a single file and perform different I/O operations on the same file.

write N bytes for each iomix worker process, the default is 1 GB. One can specify the size as % of free space on the file system or in units of Bytes, KBytes, MBytes and GBytes using the suffix b, k, m or g.

start N workers that exercise the system interval timers. This sets up an ITIMER_PROF itimer that generates a SIGPROF signal. The default frequency for the itimer is 1 MHz, however, the Linux kernel will set this to be no more that the jiffy setting, hence high frequency SIGPROF signals are not normally possible. A busy loop spins on getitimer(2) calls to consume CPU and hence decrement the itimer based on amount of time spent in CPU and system time.

start N workers that create and manipulate keys using add_key(2) and ketctl(2). As many keys are created as the per user limit allows and then the following keyctl commands are exercised on each key: KEYCTL_SET_TIMEOUT, KEYCTL_DESCRIBE, KEYCTL_UPDATE, KEYCTL_READ, KEYCTL_CLEAR and KEYCTL_INVALIDATE.

start N workers locking, unlocking and breaking leases via the fcntl(2) F_SETLEASE operation. The parent processes continually lock and unlock a lease on a file while a user selectable number of child processes open the file with a non-blocking open to generate SIGIO lease breaking notifications to the parent. This stressor is only available if F_SETLEASE, F_WRLCK and F_UNLCK support is provided by fcntl(2).

start N lease breaker child processes per lease worker. Normally one child is plenty to force many SIGIO lease breaking notification signals to the parent, however, this option allows one to specify more child processes if required.

start N workers that randomly lock and unlock regions of a file using the POSIX advisory locking mechanism (see fcntl(2), F_SETLK, F_GETLK). Each worker creates a 1024 KB file and attempts to hold a maximum of 1024 concurrent locks with a child process that also tries to hold 1024 concurrent locks. Old locks are unlocked in a first-in, first-out basis.

start N workers that randomly lock and unlock regions of a file using the POSIX lockf(3) locking mechanism. Each worker creates a 64 KB file and attempts to hold a maximum of 1024 concurrent locks with a child process that also tries to hold 1024 concurrent locks. Old locks are unlocked in a first-in, first-out basis.

instead of using blocking F_LOCK lockf(3) commands, use non-blocking F_TLOCK commands and re-try if the lock failed. This creates extra system call overhead and CPU utilisation as the number of lockf workers increases and should increase locking contention.

start N workers that randomly lock and unlock regions of a file using the Linux open file description locks (see fcntl(2), F_OFD_SETLK, F_OFD_GETLK). Each worker creates a 1024 KB file and attempts to hold a maximum of 1024 concurrent locks with a child process that also tries to hold 1024 concurrent locks. Old locks are unlocked in a first-in, first-out basis.

start N workers that linear search a unsorted array of 32 bit integers using lsearch(3). By default, there are 8192 elements in the array. This is a useful method to exercise sequential access of memory and processor cache.

start N workers continuously calling malloc(3), calloc(3), realloc(3) and free(3). By default, up to 65536 allocations can be active at any point, but this can be altered with the --malloc-max option. Allocation, reallocation and freeing are chosen at random; 50% of the time memory is allocation (via malloc, calloc or realloc) and 50% of the time allocations are free'd. Allocation sizes are also random, with the maximum allocation size controlled by the --malloc-bytes option, the default size being 64K. The worker is re-started if it is killed by the out of mememory (OOM) killer.

maximum per allocation/reallocation size. Allocations are randomly selected from 1 to N bytes. One can specify the size as % of total available memory or in units of Bytes, KBytes, MBytes and GBytes using the suffix b, k, m or g. Large allocation sizes cause the memory allocator to use mmap(2) rather than expanding the heap using brk(2).

maximum number of active allocations allowed. Allocations are chosen at random and placed in an allocation slot. Because about 50%/50% split between allocation and freeing, typically half of the allocation slots are in use at any one time.

start N workers that perform various matrix operations on floating point values. By default, this will exercise all the matrix stress methods one by one. One can specify a specific matrix stress method with the --matrix-method option.

start N workers that copy 2MB of data from a shared region to a buffer using memcpy(3) and then move the data in the buffer with memmove(3) with 3 different alignments. This will exercise processor cache and system memory.

allocate N bytes per memfd stress worker, the default is 256MB. One can specify the size in as % of total available memory or in units of Bytes, KBytes, MBytes and GBytes using the suffix b, k, m or g.

instead of walking through pages sequentially, select pages at random. The chosen address is iterated over by shifting it right one place and checked by mincore until the address is less or equal to the page size.

start N workers that lock and unlock memory mapped pages using mlock(2), munlock(2), mlockall(2) and munlockall(2). This is achieved by the mapping of three contiguous pages and then locking the second page, hence ensuring non-contiguous pages are locked . This is then repeated until the maximum allowed mlocks or a maximum of 262144 mappings are made. Next, all future mappings are mlocked and the worker attempts to map 262144 pages, then all pages are munlocked and the pages are unmapped.

start N workers continuously calling mmap(2)/munmap(2). The initial mapping is a large chunk (size specified by --mmap-bytes) followed by pseudo-random 4K unmappings, then pseudo-random 4K mappings, and then linear 4K unmappings. Note that this can cause systems to trip the kernel OOM killer on Linux systems if not enough physical memory and swap is not available. The MAP_POPULATE option is used to populate pages into memory on systems that support this. By default, anonymous mappings are used, however, the --mmap-file and --mmap-async options allow one to perform file based mappings if desired.

start N workers that each fork off 32 child processes, each of which tries to allocate some of the free memory left in the system (and trying to avoid any swapping). The child processes then hint that the allocation will be needed with madvise(2) and then memset it to zero and hint that it is no longer needed with madvise before exiting. This produces significant amounts of VM activity, a lot of cache misses and with minimal swapping.

start N workers that attempt to create the maximum allowed per-process memory mappings. This is achieved by mapping 3 contiguous pages and then unmapping the middle page hence splitting the mapping into two. This is then repeated until the maximum allowed mappings or a maximum of 262144 mappings are made.

start N workers continuously calling mmap(2), mremap(2) and munmap(2). The initial anonymous mapping is a large chunk (size specified by --mremap-bytes) and then iteratively halved in size by remapping all the way down to a page size and then back up to the original size. This worker is only available for Linux.

start N stressors that msync data from a file backed memory mapping from memory back to the file and msync modified data from the file back to the mapped memory. This exercises the msync(2) MS_SYNC and MS_INVALIDATE sync operations.

allocate N bytes for the memory mapped file, the default is 256MB. One can specify the size as % of total available memory or in units of Bytes, KBytes, MBytes and GBytes using the suffix b, k, m or g.

specify size of POSIX message queue. The default size is 10 messages and most Linux systems this is the maximum allowed size for normal users. If the given size is greater than the allowed message queue size then a warning is issued and the maximum allowed size is used instead.

start N workers that spawn child processes and monitor fork/exec/exit process events via the proc netlink connector. Each event received is counted as a bogo op. This stressor can only be run on Linux and with root privilege.

start N cpu consuming workers that exercise the available nice levels. Each iteration forks off a child process that runs through the all the nice levels running a busy loop for 0.1 seconds per level and then exits.

start N workers that migrate stressors and a 4MB memory mapped buffer around all the available NUMA nodes. This uses migrate_pages(2) to move the stressors and mbind(2) and move_pages(2) to move the pages of the mapped buffer. After each move, the buffer is written to force activity over the bus which results cache misses. This test will only run on hardware with NUMA enabled and more than 1 NUMA node.

start N workers that create as many pipes as allowed and exercise expanding and shrinking the pipes from the largest pipe size down to a page size. Data is written into the pipes and read out again to fill the pipe buffers. With the --aggressive mode enabled the data is not read out when the pipes are shrunk, causing the kernel to OOM processes aggressively. Running many instances of this stressor will force kernel to OOM processes due to the many large pipe buffer allocations.

start N workers that fork off children that execute randomly generated executable code. This will generate issues such as illegal instructions, bus errors, segmentation faults, traps, floating point errors that are handled gracefully by the stressor.

start N workers that perform open(2) and then close(2) operations on /dev/zero. The maximum opens at one time is system defined, so the test will run up to this maximum, or 65536 open file descriptors, which ever comes first.

specifies the size in bytes of each write to the pipe (range from 4 bytes to 4096 bytes). Setting a small data size will cause more writes to be buffered in the pipe, hence reducing the context switch rate between the pipe writer and pipe reader processes. Default size is the page size.

specifies the size of the pipe in bytes (for systems that support the F_SETPIPE_SZ fcntl() command). Setting a small pipe size will cause the pipe to fill and block more frequently, hence increasing the context switch rate between the pipe writer and the pipe reader processes. Default size is 512 bytes.

start N workers that iteratively creates and terminates multiple pthreads (the default is 1024 pthreads per worker). In each iteration, each newly created pthread waits until the worker has created all the pthreads and then they all terminate together.

create N pthreads per worker. If the product of the number of pthreads by the number of workers is greater than the soft limit of allowed pthreads then the maximum is re-adjusted down to the maximum allowed.

start N workers that randomly seeks and performs 512 byte read/write I/O operations on a file with readahead. The default file size is 1 GB. Readaheads and reads are batched into 16 readaheads and then 16 reads.

start N workers that map 512 pages and re-order these pages using the deprecated system call remap_file_pages(2). Several page re-orderings are exercised: forward, reverse, random and many pages to 1 page.

start N workers that exercise the VM reverse-mapping. This creates 16 processes per worker that write/read multiple file-backed memory mappings. There are 64 lots of 4 page mappings made onto the file, with each mapping overlapping the previous by 3 pages and at least 1 page of non-mapped memory between each of the mappings. Data is synchronously msync'd to the file 1 in every 256 iterations in a random manner.

start N workers that work set the worker to various available scheduling policies out of SCHED_OTHER, SCHED_BATCH, SCHED_IDLE, SCHED_FIFO and SCHED_RR. For the real time scheduling policies a random sched priority is selected between the minimum and maximum scheduling prority settings.

start N workers that exercise the fcntl(2) SEAL commands on a small anonymous file created using memfd_create(2). After each SEAL command is issued the stessor also sanity checks if the seal operation has sealed the file correctly. (Linux only).

start N workers that exercise Secure Computing system call filtering. Each worker creates child processes that write a short message to /dev/null and then exits. 2% of the child processes have a seccomp filter that disallows the write system call and hence it is killed by seccomp with a SIGSYS. Note that this stressor can generate many audit log messages each time the child is killed.

specify the size of the file in bytes. Small file sizes allow the I/O to occur in the cache, causing greater CPU load. Large file sizes force more I/O operations to drive causing more wait time and more I/O on the drive. One can specify the size in units of Bytes, KBytes, MBytes and GBytes using the suffix b, k, m or g.

start N workers that perform POSIX semaphore wait and post operations. By default, a parent and 4 children are started per worker to provide some contention on the semaphore. This stresses fast semaphore operations and produces rapid context switching.

start N workers that perform System V semaphore wait and post operations. By default, a parent and 4 children are started per worker to provide some contention on the semaphore. This stresses fast semaphore operations and produces rapid context switching.

start N workers that open and allocate shared memory objects using the POSIX shared memory interfaces. By default, the test will repeatedly create and destroy 32 shared memory objects, each of which is 8MB in size.

specify the size of the POSIX shared memory objects to be created. One can specify the size as % of total available memory or in units of Bytes, KBytes, MBytes and GBytes using the suffix b, k, m or g.

start N workers that generate SIGRT signals and are handled by reads by a child process using a file descriptor set up using signalfd(2). (Linux only). This will generate a heavy context switch load when all CPUs are fully loaded.

start N workers that check if SIGUSR1 signals are pending. This stressor masks SIGUSR1, generates a SIGUSR1 signal and uses sigpending(2) to see if the signal is pending. Then it unmasks the signal and checks if the signal is no longer pending.

start N workers that each spawn off 4 child processes that wait for a SIGUSR1 signal from the parent using sigsuspend(2). The parent sends SIGUSR1 signals to each child in rapid succession. Each sigsuspend wakeup is counted as one bogo operation.

This disables the TCP Nagle algorithm, so data segments are always sent as soon as possible. This stops data from being buffered before being transmitted, hence resulting in poorer network utilisation and more context switches between the sender and receiver.

by default, messages are sent using send(2). This option allows one to specify the sending method using send(2), sendmsg(2) or sendmmsg(2). Note that sendmmsg is only available for Linux systems that support this system call.

start N workers that pass file descriptors over a UNIX domain socket using the CMSG(3) ancillary data mechanism. For each worker, pair of client/server processes are created, the server opens as many file descriptors on /dev/null as possible and passing these over the socket to a client that reads these from the CMSG data and immediately closes the files.

the default action is to touch the lowest page on each stack allocation. This option touches all the pages by filling the new stack allocation with zeros which forces physical pages to be allocated and hence is more aggressive.

start N workers that use a 2MB stack that is memory mapped onto a temporary file. A recursive function works down the stack and flushes dirty stack pages back to the memory mapped file using msync(2) until the end of the stack is reached (stack overflow). This exercises dirty page and stack exception handling.

select a specific libc string function to stress. Available string functions to stress are: all, index, rindex, strcasecmp, strcat, strchr, strcoll, strcmp, strcpy, strlen, strncasecmp, strncat, strncmp, strrchr and strxfrm. See string(3) for more information on these string functions. The 'all' method is the default and will exercise all the string methods.

start N workers exercising a memory bandwidth stressor loosely based on the STREAM "Sustainable Memory Bandwidth in High Performance Computers" benchmarking tool by John D. McCalpin, Ph.D. This stressor allocates buffers that are at least 4 times the size of the CPU L2 cache and continually performs rounds of following computations on large arrays of double precision floating point numbers:

Operation

Description

copy

c[i] = a[i]

scale

b[i] = scalar * c[i]

add

c[i] = a[i] + b[i]

triad

a[i] = b[i] + (c[i] * scalar)

Since this is loosely based on a variant of the STREAM benchmark code, DO NOT submit results based on this as it is intended to in stress-ng just to stress memory and compute and NOT intended for STREAM accurate tuned or non-tuned benchmarking whatsoever. Use the official STREAM benchmarking tool if you desire accurate and standardised STREAM benchmarks.

Specify the CPU Level 3 cache size in bytes. One can specify the size in units of Bytes, KBytes, MBytes and GBytes using the suffix b, k, m or g. If the L3 cache size is not provided, then stress-ng will attempt to determine the cache size, and failing this, will default the size to 4MB.

start N workers that perform a range of data syncs across a file using sync_file_range(2). Three mixes of syncs are performed, from start to the end of the file, from end of the file to the start, and a random mix. A random selection of valid sync types are used, covering the SYNC_FILE_RANGE_WAIT_BEFORE, SYNC_FILE_RANGE_WRITE and SYNC_FILE_RANGE_WAIT_AFTER flag bits.

start N workers that continually read system and process specific information. This reads the process user and system times using the times(2) system call. For Linux systems, it also reads overall system statistics using the sysinfo(2) system call and also the file system statistics for all mounted file systems using statfs(2).

move data from a writer process to a reader process through pipes and to /dev/null without any copying between kernel address space and user address space using tee(2). This is only available for Linux.

start N workers creating timer events at a default rate of 1 MHz (Linux only); this can create a many thousands of timer clock interrupts. Each timer event is caught by a signal handler and counted as a bogo timer op.

start N workers creating timerfd events at a default rate of 1 MHz (Linux only); this can create a many thousands of timer clock events. Timer events are waited for on the timer file descriptor using select(2) and then read and counted as a bogo timerfd op.

start N workers that force Translation Lookaside Buffer (TLB) shootdowns. This is achieved by creating up to 16 child processes that all share a region of memory and these processes are shared amongst the available CPUs. The processes adjust the page mapping settings causing TLBs to be force flushed on the other processors, causing the TLB shootdowns.

start N workers that insert, search and delete 32 bit integers on a binary tree using tsearch(3), tfind(3) and tdelete(3). By default, there are 65536 randomized integers used in the tree. This is a useful method to exercise random access of memory and processor cache.

start N workers that generate write page faults on a small anonymously mapped memory region and handle these faults using the user space fault handling via the userfaultfd mechanism. This will generate a large quanity of major page faults and also context switches during the handling of the page faults. (Linux only).

mmap N bytes per userfaultfd worker to page fault on, the default is 16MB One can specify the size as % of total available memory or in units of Bytes, KBytes, MBytes and GBytes using the suffix b, k, m or g.

start N workers that perform various unsigned integer math operations on various 128 bit vectors. A mix of vector math operations are performed on the following vectors: 16 × 8 bits, 8 × 16 bits, 4 × 32 bits, 2 × 64 bits. The metrics produced by this mix depend on the processor architecture and the vector math optimisations produced by the compiler.

create P processes and then wait for them to exit per iteration. The default is just 1; higher values will create many temporary zombie processes that are waiting to be reaped. One can potentially fill up the the process table using high values for --vfork-max and --vfork.

start N workers that spawn off a chain of vfork children until the process table fills up and/or vfork fails. vfork can rapidly create child processes and the parent process has to wait until the child dies, so this stressor rapidly fills up the process table.

start N workers continuously calling mmap(2)/munmap(2) and writing to the allocated memory. Note that this can cause systems to trip the kernel OOM killer on Linux systems if not enough physical memory and swap is not available.

specify a vm stress method. By default, all the stress methods are exercised sequentially, however one can specify just one method to be used if required. Each of the vm workers have 3 phases:

1. Initialised. The anonymously memory mapped region is set to a known pattern.

2. Exercised. Memory is modified in a known predictable way. Some vm workers alter memory sequentially, some use small or large strides to step along memory.

3. Checked. The modified memory is checked to see if it matches the expected result.

The vm methods containing 'prime' in their name have a stride of the largest prime less than 2^64, allowing to them to thoroughly step through memory and touch all locations just once while also doing without touching memory cells next to each other. This strategy exercises the cache and page non-locality.

Since the memory being exercised is virtually mapped then there is no guarantee of touching page addresses in any particular physical order. These workers should not be used to test that all the system's memory is working correctly either, use tools such as memtest86 instead.

The vm stress methods are intended to exercise memory in ways to possibly find memory issues and to try to force thermal errors.

Available vm stress methods are described as follows:

Method

Description

all

iterate over all the vm stress methods as listed below.

flip

sequentially work through memory 8 times, each time just one bit in memory flipped (inverted). This will effectively invert each byte in 8 passes.

galpat-0

galloping pattern zeros. This sets all bits to 0 and flips just 1 in 4096 bits to 1. It then checks to see if the 1s are pulled down to 0 by their neighbours or of the neighbours have been pulled up to 1.

galpat-1

galloping pattern ones. This sets all bits to 1 and flips just 1 in 4096 bits to 0. It then checks to see if the 0s are pulled up to 1 by their neighbours or of the neighbours have been pulled down to 0.

gray

fill the memory with sequential gray codes (these only change 1 bit at a time between adjacent bytes) and then check if they are set correctly.

incdec

work sequentially through memory twice, the first pass increments each byte by a specific value and the second pass decrements each byte back to the original start value. The increment/decrement value changes on each invocation of the stressor.

inc-nybble

initialise memory to a set value (that changes on each invocation of the stressor) and then sequentially work through each byte incrementing the bottom 4 bits by 1 and the top 4 bits by 15.

rand-set

sequentially work through memory in 64 bit chunks setting bytes in the chunk to the same 8 bit random value. The random value changes on each chunk. Check that the values have not changed.

rand-sum

sequentially set all memory to random values and then summate the number of bits that have changed from the original set values.

fill memory with a random pattern and then sequentially rotate 64 bits of memory right by one bit, then check the final load/rotate/stored values.

swap

fill memory in 64 byte chunks with random patterns. Then swap each 64 chunk with a randomly chosen chunk. Finally, reverse the swap to put the chunks back to their original place and check if the data is correct. This exercises adjacent and random memory load/stores.

move-inv

sequentially fill memory 64 bits of memory at a time with random values, and then check if the memory is set correctly. Next, sequentially invert each 64 bit pattern and again check if the memory is set as expected.

modulo-x

fill memory over 23 iterations. Each iteration starts one byte further along from the start of the memory and steps along in 23 byte strides. In each stride, the first byte is set to a random pattern and all other bytes are set to the inverse. Then it checks see if the first byte contains the expected random pattern. This exercises cache store/reads as well as seeing if neighbouring cells influence each other.

prime-0

iterate 8 times by stepping through memory in very large prime strides clearing just on bit at a time in every byte. Then check to see if all bits are set to zero.

prime-1

iterate 8 times by stepping through memory in very large prime strides setting just on bit at a time in every byte. Then check to see if all bits are set to one.

prime-gray-0

first step through memory in very large prime strides clearing just on bit (based on a gray code) in every byte. Next, repeat this but clear the other 7 bits. Then check to see if all bits are set to zero.

prime-gray-1

first step through memory in very large prime strides setting just on bit (based on a gray code) in every byte. Next, repeat this but set the other 7 bits. Then check to see if all bits are set to one.

rowhammer

try to force memory corruption using the rowhammer memory stressor. This fetches two 32 bit integers from memory and forces a cache flush on the two addresses multiple times. This has been known to force bit flipping on some hardware, especially with lower frequency memory refresh cycles.

walk-0d

for each byte in memory, walk through each data line setting them to low (and the others are set high) and check that the written value is as expected. This checks if any data lines are stuck.

walk-1d

for each byte in memory, walk through each data line setting them to high (and the others are set low) and check that the written value is as expected. This checks if any data lines are stuck.

walk-0a

in the given memory mapping, work through a range of specially chosen addresses working through address lines to see if any address lines are stuck low. This works best with physical memory addressing, however, exercising these virtual addresses has some value too.

walk-1a

in the given memory mapping, work through a range of specially chosen addresses working through address lines to see if any address lines are stuck high. This works best with physical memory addressing, however, exercising these virtual addresses has some value too.

write64

sequentially write memory using 32 x 64 bit writes per bogo loop. Each loop equates to one bogo operation. This exercises raw memory writes. Note that memory writes are not checked at the end of each test iteration.

zero-one

set all memory bits to zero and then check if any bits are not zero. Next, set all the memory bits to one and check if any bits are not one.

start N workers that spawn off two children; one spins in a pause(2) loop, the other continually stops and continues the first. The controlling process waits on the first child to be resumed by the delivery of SIGCONT using waitpid(2) and waitid(2).

start N workers compressing and decompressing random data using zlib. Each worker has two processes, one that compresses random data and pipes it to another process that decompresses the data. This stressor exercises CPU, cache and memory.

specify the type of random data to send to the zlib library. By default, the data stream is created from a random selection of the different data generation processes. However one can specify just one method to be used if required. Available zlib data generation methods are described as follows:

Method

Description

random

segments of the data stream are created by randomly calling the different data generation methods.

start N workers that create zombie processes. This will rapidly try to create a default of 8192 child processes that immediately die and wait in a zombie state until they are reaped. Once the maximum number of processes is reached (or fork fails because one has reached the maximum allowed number of children) the oldest child is reaped and a new process is then created in a first-in first-out manner, and then repeated.

stress-ng was written by Colin King <colin.king@canonical.com> and is a clean room re-implementation and extension of the original stress tool by Amos Waterland <apw@rossby.metr.ou.edu>. Thanks also for contributions from Christian Ehrhardt, James Hunt, Jim Rowan, Joseph DeVincentis, Luca Pizzamiglio, Luis Henriques, Rob Colclaser, Tim Gardner and Zhiyi Sun.

Sending a SIGALRM, SIGINT or SIGHUP to stress-ng causes it to terminate all the stressor processes and ensures temporary files and shared memory segments are removed cleanly.

Sending a SIGUSR2 to stress-ng will dump out the current load average and memory statistics.

Note that the stress-ng cpu, io, vm and hdd tests are different implementations of the original stress tests and hence may produce different stress characteristics. stress-ng does not support any GPU stress tests.

The bogo operations metrics may change with each release because of bug fixes to the code, new features, compiler optimisations or changes in system call performance.