WebAssembly and Dynamic Memory

WebAssembly is a binary instruction format for a stack-based virtual machine. Wasm is designed as a portable target for compilation of high-level languages like C/C++/Rust, enabling deployment on the web for client and server applications.

These high-level languages like C/C++/Rust (+Javascript) deal with different allocations of memory, such as static memory, stack memory and dynamic memory.

Static memory allocation refers to assignment of fixed memory addresses during compile and start-time of an application (.data or .bss).

Stack memory allocation refers to a memory region where data is added or removed per function-scope in a last-in-first-out (LIFO) manner.

Dynamic memory allocation refers to performing manual memory management for dynamic memory allocation in the C programming language via a group of functions in the C standard library, namely malloc, realloc, calloc and free; or the corresponding boxing in Rust.

This document will focus onto the latter one, the dynamic memory and how it is realized by different toolchains compiling C/C++/Rust targeting the WebAssembly architecture/platform.

My intention is find the best toolchain to produce WebAssembly-modules being as portable as possible.

High level source code (C/C++/Rust) is compiled to WebAssembly using toolchains like emscripten, clang, or rustc. Emscripten will translate the C/C++ to JavaScript first and this translating to WebAssembly, whereas clang would translate from C/C++ directly to WebAssembly. (Note: avoid emscripten if your code has to deal with int64-primitives, as Javascript would emulate those using floats).

If developing portable WebAssembly modules/apps it is crucial to understand the management of static and dynamic memory and the influence of the compiler toolchain being used.

The security model of WebAssembly has the important goal to protect users from buggy or malicious modules. This is achieved by sandboxing the code, operating on a so called linear memory only.

The linear memory of WebAssembly can be compared to the virtual memory provided to processes running on Windows or Linux. The linear memory is the sandbox the WebAssembly module may operate on, storing the static data in an indexed region, managing the stack memory and managing the dynamic memory required by the app during runtime.

A WebAssembly module could be a library only or an app; the latter requiring a start-function in the module. An app would instantiate global, static constructors (C++) at start-time, whereas calling a function of an library would not (lemma: avoid global static constructors if you want to translate C++ to WebAssembly).

When instantiating a WebAssembly module, it might operate on a newly created linear memory or it might continue operation on a linear memory provided by the environment.

The management of dynamic memory is provided by so called runtime-libraries such as gnu-libc, musl-libc(embedded), bionic-libc(Android) or win32; each one implementing the functions malloc(size_t), free(ptr) etc using a different strategy. Depending on the use-case and platform, different implementations of the actual memory allocation mechanism exist. Their performance varies in both execution time and required memory.

[dlmalloc] Doug Lea has developed dlmalloc (“Doug Lea’s Malloc”) as a general-purpose allocator, starting in 1987. The GNU C library (glibc) uses ptmalloc, an allocator based on dlmalloc.

[jemalloc] Since FreeBSD 7.0 and NetBSD 5.0, the old malloc implementation (phkmalloc) was replaced by jemalloc, written by Jason Evans. The main reason for this was a lack of scalability of phkmalloc in terms of multithreading. In order to avoid lock contention, jemalloc uses separate “arenas” for each CPU. Experiments measuring number of allocations per second in multithreading application have shown that this makes it scale linearly with the number of threads, while for both phkmalloc and dlmalloc performance was inversely proportional to the number of threads.

[Hoard malloc] Hoard is an allocator whose goal is scalable memory allocation performance. Like OpenBSD’s allocator, Hoard uses mmap exclusively, but manages memory in chunks of 64 kilobytes called superblocks. Hoard’s heap is logically divided into a single global heap and a number of per-processor heaps. In addition, there is a thread-local cache that can hold a limited number of superblocks. By allocating only from superblocks on the local per-thread or per-processor heap, and moving mostly-empty superblocks to the global heap so they can be reused by other processors, Hoard keeps fragmentation low while achieving near linear scalability with the number of threads.

[wee_alloc] The Wee_alloc memory allocator has been designed for WebAssembly, and is available for Rust only. It has a tiny code size footprint, compiling down to only a kilobyte of .wasm code. It has been designed for scenarios where there are a handful of initial, long-lived allocations, after which the code does its heavy lifting without any further allocations. This scenario requires some allocator to exist, but we are more than happy to trade performance for small code size.

These implementations of the dynamic memory framework rely on memory pages (usually each page of size 4KB) mapped into the virtual address space of the process in question.

Within each process the actual implementation of the dynamic memory allocator is managing chunks of memory in an intelligent and efficient way.

In case the dynamic memory allocator might run out of memory, it would perform a platform specific call to request more memory pages being mapped into the address space of the process (aka instance of the wasm-module). Running on top of Linux-Kernel it would be a syscall-function, on other systems it might be the system-call brk(), and WebAssmbly defines the operator grow_memory.

For example the following C application may be translated to a WebAssembly (wasm) module (in fact an app with start function) either using emscripten or using a clang-toolchain (combined with musl-libc)

[emscripten] The emscripten toolchain will produce a wasm-binary importing/exporting the following symbols. emscripten is using an implementation of dlmalloc, which is expecting the WebAssembly-environment is exposing the following functions to request and manage memory pages during runtime.

[musl-libc/clang] The combination of musl-libc and clang-toolchain will produce a wasm-binary importing/exporting the following symbols. Here, musl-libc is providing an implementation of malloc/free being designed for embedded devices, running an Linux-Kernel or executing standalones. Here the WebAssembly environment is expected to provide these functions.

[rustc] Giving Rust a chance, the code for an embedded application being executed in an WebAssembly execution engine might look like the following code, without dependencies to a std-library, and linking against the wee_alloc implementation of the dynamic memory allocator. The intention of this code is to provoke allocation of dynamic memory using boxing (see the example at https://github.com/frehberg/wasm-dyn-mem).

As you might notice there are no explicit dependencies as created by emscripten or musl-libc/clang of the previous examples. Here the Rust implementation of the dynamic memory allocator wee_alloc is relying on the WebAssembly specific operator grow_memory.

Conclusion

In terms of portability the clear winner is the Rust compiler toolchain in combination with the dynamic memory allocator wee_alloc.

Its resulting WebAssembly code has not a single dependency/import to any another symbol/function-name being exposed by the runtime environment, and this lean code might be executed in any WebAssembly environment.

In contrast, when using the toolchains like emscripten or musl-libc/clang these will add dependencies, importing symbols form the WebAssembly runtime to request and manage the memory pages, either depending on those __syscall functions or the function enlargeMemory and more.

These functions may be specific for Javascript environments or custom environments only, and would prevent execution in other WebAssembly environments.

Outlook

Next I want to investigate the combination of Rust and its std-allocator, an implementation of jemalloc