kmalloc()

The kmalloc() function's operation is very similar to that of user-space's familiar malloc() routine, with the exception of the addition of a flags parameter. The kmalloc() function is a simple interface for obtaining kernel memory in byte-sized chunks. If you need whole pages, the previously discussed interfaces might be a better choice. For most kernel allocations, however, kmalloc() is the preferred interface.

The function is declared in <linux/slab.h>:

void * kmalloc(size_t size, int flags)

The function returns a pointer to a region of memory that is at leastsize bytes in length[3]. The region of memory allocated is physically contiguous. On error, it returns NULL. Kernel allocations always succeed, unless there is an insufficient amount of memory available. Thus, you must check for NULL after all calls to kmalloc() and handle the error appropriately.

[3] It may allocate more than you asked, although you have no way of knowing how much more! Because at its heart the kernel allocator is page-based, some allocations may be rounded up to fit within the available memory. The kernel never returns less memory than requested. If the kernel is unable to find at least the requested amount, the allocation fails and the function returns NULL.

Let's look at an example. Assume you need to dynamically allocate enough room for a fictional dog structure:

If the kmalloc() call succeeds, ptr now points to a block of memory that is at least the requested size. The GFP_KERNEL flag specifies the behavior of the memory allocator while trying to obtain the memory to return to the caller of kmalloc().

gfp_mask Flags

You've seen various examples of allocator flags in both the low-level page allocation functions and kmalloc(). Now it's time to discuss these flags in depth.

The flags are broken up into three categories: action modifiers, zone modifiers, and types. Action modifiers specify how the kernel is supposed to allocate the requested memory. In certain situations, only certain methods can be employed to allocate memory. For example, interrupt handlers must instruct the kernel not to sleep (because interrupt handlers cannot reschedule) in the course of allocating memory. Zone modifiers specify from where to allocate memory. As you saw earlier in this chapter, the kernel divides physical memory into multiple zones, each of which serves a different purpose. Zone modifiers specify from which of these zones to allocate. Type flags specify a combination of action and zone modifiers as needed by a certain type of memory allocation. Type flags simplify specifying numerous modifiers; instead, you generally specify just one type flag. The GFP_KERNEL is a type flag, which is used for code in process context inside the kernel. Let's look at the flags.

Action Modifiers

All the flags, the action modifiers included, are declared in <linux/gfp.h>. The file <linux/slab.h> includes this header, however, so you often need not include it directly. In reality, you will usually use only the type modifiers, which are discussed later. Nonetheless, it is good to have an understanding of these individual flags. Table 11.3 is a list of the action modifiers.

Table 11.3. Action Modifiers

Flag

Description

__GFP_WAIT

The allocator can sleep.

__GFP_HIGH

The allocator can access emergency pools.

__GFP_IO

The allocator can start disk I/O.

__GFP_FS

The allocator can start filesystem I/O.

__GFP_COLD

The allocator should use cache cold pages.

__GFP_NOWARN

The allocator will not print failure warnings.

__GFP_REPEAT

The allocator will repeat the allocation if it fails, but the allocation can potentially fail.

__GFP_NOFAIL

The allocator will indefinitely repeat the allocation. The allocation cannot fail.

__GFP_NORETRY

The allocator will never retry if the allocation fails.

__GFP_NO_GROW

Used internally by the slab layer.

__GFP_COMP

Add compound page metadata. Used internally by the hugetlb code.

These allocations can be specified together. For example,

ptr = kmalloc(size, __GFP_WAIT | __GFP_IO | __GFP_FS);

instructs the page allocator (ultimately alloc_pages()) that the allocation can block, perform I/O, and perform filesystem operations, if needed. This allows the kernel great freedom in how it can find the free memory to satisfy the allocation.

Most allocations specify these modifiers, but do so indirectly by way of the type flags we will discuss shortly. Don't worryyou won't have to figure out which of these weird flags to use every time you allocate memory!

Zone Modifiers

Zone modifiers specify from which memory zone the allocation should originate. Normally, allocations can be fulfilled from any zone. The kernel prefers ZONE_NORMAL, however, to ensure that the other zones have free pages when they are needed.

There are only two zone modifiers because there are only two zones other than ZONE_NORMAL (which is where, by default, allocations originate). Table 11.4 is a listing of the zone modifiers.

Table 11.4. Zone Modifiers

Flag

Description

__GFP_DMA

Allocate only from ZONE_DMA

__GFP_HIGHMEM

Allocate from ZONE_HIGHMEM or ZONE_NORMAL

Specifying one of these two flags modifies the zone from which the kernel attempts to satisfy the allocation. The __GFP_DMA flag forces the kernel to satisfy the request from ZONE_DMA. This flag says, with an odd accent, I absolutely must have memory into which I can perform DMA. Conversely, the __GFP_HIGHMEM flag instructs the allocator to satisfy the request from either ZONE_NORMAL or (preferentially) ZONE_HIGHMEM. This flag says, I can use high memory, so I can be a doll and hand you back some of that, but normal memory works, too. If neither flag is specified, the kernel fulfills the allocation from either ZONE_DMA or ZONE_NORMAL, with a strong preference to satisfy the allocation from ZONE_NORMAL. No zone flag essentially says, I don't care so long as it behaves normally.

You cannot specify __GFP_HIGHMEM to either __get_free_pages() or kmalloc(). Because these both return a logical address, and not a page structure, it is possible that these functions would allocate memory that is not currently mapped in the kernel's virtual address space and, thus, does not have a logical address. Only alloc_pages() can allocate high memory. The majority of your allocations, however, will not specify a zone modifier because ZONE_NORMAL is sufficient.

Type Flags

The type flags specify the required action and zone modifiers to fulfill a particular type of transaction. Therefore, kernel code tends to use the correct type flag and not specify the myriad of other flags it might need. This is both simpler and less error prone. Table 11.5 is a list of the type flags and Table 11.6 shows which modifiers are associated with each type flag.

Table 11.5. Type Flags

Flag

Description

GFP_ATOMIC

The allocation is high priority and must not sleep. This is the flag to use in interrupt handlers, in bottom halves, while holding a spinlock, and in other situations where you cannot sleep.

GFP_NOIO

This allocation can block, but must not initiate disk I/O. This is the flag to use in block I/O code when you cannot cause more disk I/O, which might lead to some unpleasant recursion.

GFP_NOFS

This allocation can block and can initiate disk I/O, if it must, but will not initiate a filesystem operation. This is the flag to use in filesystem code when you cannot start another filesystem operation.

GFP_KERNEL

This is a normal allocation and might block. This is the flag to use in process context code when it is safe to sleep. The kernel will do whatever it has to in order to obtain the memory requested by the caller. This flag should be your first choice.

GFP_USER

This is a normal allocation and might block. This flag is used to allocate memory for user-space processes.

GFP_HIGHUSER

This is an allocation from ZONE_HIGHMEM and might block. This flag is used to allocate memory for user-space processes.

GFP_DMA

This is an allocation from ZONE_DMA. Device drivers that need DMA-able memory use this flag, usually in combination with one of the above.

Table 11.6. Listing of the Modifiers Behind Each Type Flag

Flag

Modifier Flags

GFP_ATOMIC

__GFP_HIGH

GFP_NOIO

__GFP_WAIT

GFP_NOFS

(__GFP_WAIT | __GFP_IO)

GFP_KERNEL

(__GFP_WAIT | __GFP_IO | __GFP_FS)

GFP_USER

(__GFP_WAIT | __GFP_IO | __GFP_FS)

GFP_HIGHUSER

(__GFP_WAIT | __GFP_IO | __GFP_FS | __GFP_HIGHMEM)

GFP_DMA

__GFP_DMA

Let's look at the frequently used flags and when and why you might need them. The vast majority of allocations in the kernel use the GFP_KERNEL flag. The resulting allocation is a normal priority allocation that might sleep. Because the call can block, this flag can be used only from process context that can safely reschedule (that is, no locks are held and so on). Because this flag does not make any stipulations as to how the kernel may obtain the requested memory, the memory allocation has a high probability of succeeding.

On the far other end of the spectrum is the GFP_ATOMIC flag. Because this flag specifies a memory allocation that cannot sleep, the allocation is very restrictive in the memory it can obtain for the caller. If no sufficiently sized contiguous chunk of memory is available, the kernel is not very likely to free memory because it cannot put the caller to sleep. Conversely, the GFP_KERNEL allocation can put the caller to sleep to swap inactive pages to disk, flush dirty pages to disk, and so on. Because GFP_ATOMIC is unable to perform any of these actions, it has less of a chance of succeeding (at least when memory is low) compared to GFP_KERNEL allocations. Nonetheless, the GFP_ATOMIC flag is the only option when the current code is unable to sleep, such as with interrupt handlers, softirqs, and tasklets.

In between these two flags are GFP_NOIO and GFP_NOFS. Allocations initiated with these flags might block, but they refrain from performing certain other operations. A GFP_NOIO allocation does not initiate any disk I/O whatsoever to fulfill the request. On the other hand, GFP_NOFS might initiate disk I/O, but does not initiate filesystem I/O. Why might you need these flags? They are needed for certain low-level block I/O or filesystem code, respectively. Imagine if a common path in the filesystem code allocated memory without the GFP_NOFS flag. The allocation could result in more filesystem operations, which would then beget other allocations and, thus, more filesystem operations! This could continue indefinitely. Code such as this that invokes the allocator must ensure that the allocator also does not execute it, or else the allocation can create a deadlock. Not surprisingly, the kernel uses these two flags only in a handful of places.

The GFP_DMA flag is used to specify that the allocator must satisfy the request from ZONE_DMA. This flag is used by device drivers, which need DMA-able memory for their devices. Normally, you combine this flag with the GFP_ATOMIC or GFP_KERNEL flag.

In the vast majority of the code that you write you will use either GFP_KERNEL or GFP_ATOMIC. Table 11.7 is a list of the common situations and the flags to use. Regardless of the allocation type, you must check for and handle failures.

Table 11.7. Which Flag to Use When

Situation

Solution

Process context, can sleep

Use GFP_KERNEL

Process context, cannot sleep

Use GFP_ATOMIC, or perform your allocations with GFP_KERNEL at an earlier or later point when you can sleep

Interrupt handler

Use GFP_ATOMIC

Softirq

Use GFP_ATOMIC

Tasklet

Use GFP_ATOMIC

Need DMA-able memory, can sleep

Use (GFP_DMA | GFP_KERNEL)

Need DMA-able memory, cannot sleep

Use (GFP_DMA | GFP_ATOMIC), or perform your allocation at an earlier point when you can sleep

kfree()

The other end of kmalloc() is kfree(), which is declared in <linux/slab.h>:

void kfree(const void *ptr)

The kfree() method frees a block of memory previously allocated with kmalloc(). Calling this function on memory not previously allocated with kmalloc(), or on memory which has already been freed, results in very bad things, such as freeing memory belonging to another part of the kernel. Just as in user-space, be careful to balance your allocations with your deallocations to prevent memory leaks and other bugs. Note, calling kfree(NULL) is explicitly checked for and safe.

Look at an example of allocating memory in an interrupt handler. In this example, an interrupt handler wants to allocate a buffer to hold incoming data. The preprocessor macro BUF_SIZE is the size in bytes of this desired buffer, which is presumably larger than just a couple of bytes.