Mirantis OpenStack 7.0: NFVI Deployment Guide — Huge pages

Memory addressing on contemporary computers is done in terms of blocks of contiguous virtual memory addresses known as pages. Historically, memory pages on x86 systems have had a fixed size of 4 kilobytes, but today this parameter is configurable to some degree: the x86_32 architecture, for example, supports 4Kb and 4Mb pages, while the x86_64 architecture supports pages 4Kb, 2Mb, and more recently, 1Gb, in size.

Pages larger than the default size are referred to as “huge pages” or “large pages” (the terms are frequently capitalized). We’ll call them “huge pages” in this document.

Processes work with virtual memory addresses. Each time a process accesses memory, a kernel translates the desired virtual memory address to a physical one by looking at a special memory area called the page table, where virtual-to-physical mappings are stored. The hardware cache on the CPU is used to speed up lookups. This cache is called the translation lookaside buffer (TLB).

The TLB typically can store only a small fraction of physical-to-virtual page mappings. By increasing memory page size we reduce the total number of pages that need to be addressed, thus increasing TLB hit rate. This can lead to significant performance gains when a process does many memory operations. Also, the page table may require a significant amount of memory in cases where it needs to store many references to small memory pages. in extreme cases, memory savings from using huge pages may amount to several gigabytes. (For example, see http://kevinclosson.net/2009/07/28/quantifying-hugepages-memory-savings-with-oracle-database-11g.)

On the other hand, when the page size is large but a process doesn’t use all the page memory, unused memory is effectively lost as it cannot be used by other processes. So there is usually a tradeoff between performance and more efficient memory utilization.

In the case of virtualization, a second level of page translation (between the hypervisor and host OS) causes additional overhead. Using huge pages on the host OS lets us greatly reduce this overhead.

It’s preferable to give a virtual machine with NFV workloads exclusive access to a predetermined amount of memory. No other process can use that memory anyway, so there is no tradeoff in using huge pages. Huge pages are thus the natural option for NFV workloads.

THP are turned on by default in MOS 7.0, but Explicit huge pages potentially provide more performance gains if an application supports them.

Although we tend to think of the hypervisor as KVM, KVM is really just the kernel module; the actual hypervisor is QEMU. That means that QEMU performance is crucial for NFV. Fortunately, it supports explicit usage of huge pages via the hugetlbfs library, so we don’t really need THP here. Moreover, THP can lead to side effects with unpredictable results — sometimes lowering performance instead of raising it.

Also be aware that when a kernel needs to swap out a THP, the aggregate huge page is first split to standard 4k pages. Explicit huge pages are never swapped to disk — this is perfectly fine for typical NFV workloads.

In general, huge pages in general can be reserved at boot or at runtime (though 1GB huge pages can only be allocated at boot). Memory generally gets fragmented on a running system and the kernel may not be able to reserve as many contiguous memory blocks in runtime as it can at boot.

For general NFV workloads we recommend using dedicated compute nodes with the major part of their memory reserved as explicit huge pages at boot time. NFV workload instances should be configured to use huge pages. We also recommend disabling THP on these compute nodes. As for preferred huge page sizes: the choice depends on the needs of specific workloads. Generally, 1Gb can be slightly faster, but 2Mb huge pages provide more granularity.

Huge pages and physical topology

All contemporary multiprocessor x86_64 systems have non-uniform memory access architecture (NUMA). NUMA-related settings will be described in the following sections of this guide. but there are some subtle characteristics of NUMA that affect huge page allocation on multi-CPU hosts that you should be aware of when configuring OpenStack.

As a rule, some amount of memory is reserved in the lower range of memory address space. This memory is used for memory-mapped I/O and usually it is reserved on the first NUMA cell — corresponding to the first CPU — before huge pages are allocated — but when allocating huge pages, the kernel tries to spread them evenly across all NUMA cells. If there’s not enough contiguous memory in one of the NUMA cells, the kernel will try to compensate by allocating more memory on the remaining cells. When the amount of memory used by huge pages is close to the total amount of free memory, you end up with uneven huge page distributions across NUMA cells. This is more likely to happen when using 1Gb pages.

Here is an example from a host with 64 gigabytes of memory and two CPUs:

This might lead to negative consequences. For example, if we use a VM flavor that requires 30Gb of memory in one NUMA cell (or 60Gb in two) there would be a problem. One might think that the number of huge pages on this host is enough to run two instances with 30Gb memory each or one, two-cell instance with 60Gb, but in reality, only one 30 Gb instance will be started: the other one will be one 1Gb page short. If we try to start a 60Gb, two-cell instance with this distribution of huge pages between NUMA cells it will fail to start altogether because Nova will try to find a physical host with two NUMA cells having 30Gb of memory each and fail to do that because one of the cells has insufficient memory.

You may want to use an option such as ‘Socket Interleave Below 4GB’ or similar if your BIOS supports it to avoid this situation. This option maps lower address space evenly between the NUMA cells, in effect splitting reserved memory between NUMA nodes.

In conclusion, you should always test to verify the real allocation of huge pages and plan accordingly, based on the results.

Enabling huge pages on MOS 7.0

To enable huge pages you need to configure every compute node where you plan to run instances that will use them. You also need to configure nova aggregates and flavors before launching huge pages backed instances.

Compute hosts configuration

Below we provide an example of how to configure huge pages on one of the compute nodes. All the commands in this section should be run on the compute nodes that will handle huge pages workloads.

Add huge pages allocation parameters to the list of kernel arguments in /etc/default/grub. Note that we are also disabling Transparent Huge Pages in the examples below because we’re using explicit huge pages to prevent swapping.Add the following to the end of /etc/default/grub:

Caution: be careful when deciding on the number of huge pages to reserve. You should leave enough memory for host OS processes (including memory for Ceph processes if your compute shares the Ceph OSD role) or risk unpredictable results.

Note: You can’t allocate different amounts of memory to each NUMA cell via kernel parameters. If you need to do so, you have to use command line or startup scripts. Here is an example in which we allocate 10 1Gb sized pages on the first NUMA cell and 30 on the second one:

Nova configuration

To use huge pages, you need to launch instances whose flavor has the extra specification hw:mem_pages_size.

By default, there is nothing to prevent normal instances with flavors that don’t have the extra spec from starting on compute nodes with reserved huge pages. To avoid this situation, you’ll need to create nova aggregates for compute nodes with and without huge pages, create a new flavor for huge pages-enabled instances, update all the other flavors with this extra spec and reconfigure nova scheduler service to check extra spec when scheduling instances. Follow the steps below:

From the commandline, create an aggregate for compute nodes with and without huge pages:

If the status is ‘ERROR’, check the log files for lines containing this instance ID. The easiest way to do that is to run the following command on the Fuel Master node:# grep -Ri <Instance ID> /var/log/docker-logs/remote/node-*

The ‘memoryBacking’ section should show that this instance’s memory is backed by huge pages. You may also see that the ‘cputune’ section reveals so-called ‘pinning’ of this instance’s vCPUs. This means the instance will only run on physical CPU cores that have direct access to this instance’s memory and comes as a bonus from hypervisor awareness of the host physical topology. We will discuss instance CPU pinning in the next section.

You may also look at the QEMU process arguments and make sure they contain relevant options, such as:

We can see that the instance uses 1000 huge pages (since this flavor’s memory is 2Gb and we are using 2048Kb huge pages).

Note: It’s possible to use more than one NUMA host cell for a single instance with the flavor key hw:numa_nodes, but you should be aware that multi-cell instances may show worse performance than single-cell instances in the case when processes inside them aren’t aware of their NUMA topology. See more on this subject in the section about NUMA CPU pinning.