About Florent Bruneau

We do write complex software, and like everyone doing so, we need a way to structure our source code. Our choice was to go modular. A module is a singleton that defines a feature or a set of related feature and exposes some APIs in order for other modules to access these features. Everything else, including the details of the implementation, is private. Examples of modules are the RPC layer, the threading library, the query engine of a database, the authentication engine, … Then, we can compose the various daemons of our application by including the corresponding modules and their dependencies.

Most modules maintain an internal state and as a consequence they have to be initialized and deinitialized at some point in the lifetime of the program. An internal rule at Intersec is to name the constructor {module}_initialize() and the destructor {module}_shutdown(). Once defined, these functions have to be called and this is where everything become complicated when your program has tens of modules with complex dependencies between them.

Introduction

Back in 2009, Snow Leopard was quite an exciting OS X release. It didn’t focus on new user-visible features but instead introduced a handful of low level technologies. Two of those technologies Grand Central Dispatch (a.k.a. GCD) and OpenCL were designed to help developers benefit from the new computing power of modern computer architectures: multicore processors for the former and GPUs for the latter.

Alongside the GCD engine came a C language extension calledblocks. Blocks are the C-based flavor of what is commonly called a closure: a callable object that captures the context in which it was created. The syntax for blocks is very similar to the one used for functions, with the exception that the pointer star is * replaced by a caret ^. This allows inline definition of callbacks which often can help improving the readability of the code.

In the third post of the memory series we briefly explained locality and why it is an important principle to keep in mind while developing a memory-intensive program. This new post is going to be more concrete and explains what actually happens behind the scene in a very simple example.

This post is a follow-up to a recent interview with a (brilliant) candidate1. As a subsidiary question, we presented him with the following two structure definitions:

1

2

3

4

structfoo_t{

intlen;

char*data;

};

1

2

3

4

structbar_t{

intlen;

chardata[];

};

The question was: what is the difference between these two structures, what are the pros and the cons of both of them? For the remaining of the article we will suppose we are working on an x86_64 architecture.

By coincidence, an intern asked more or less at the same time why we were using bar_t-like structures in our custom database engine.

Introduction

Here we are! We spent 4 articles explaining what memory is, how to deal with it and what are the kind of problems you can expect from it. Even the best developers write bugs. A commonly accepted estimation seems to be around of few tens of bugs per thousand of lines of code, which is definitely quite huge. As a consequence, even if you proficiently mastered all the concepts covered by our articles, you’ll still probably have a few memory-related bugs.

Memory-related bugs may be particularly hard to spot and fix. Let’s take the following program as an example:

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

#include <stdio.h>

#define MAX_LINE_SIZE 32

staticconstchar*build_message(constchar*name)

{

charmessage[MAX_LINE_SIZE];

sprintf(message,"hello %s!\n",name);

returnmessage;

}

intmain(intargc,char*argv[])

{

fputs(build_message(argc>1?argv[1]:"world"),stdout);

return0;

}

This program is supposed to take a message as argument and print “hello <message>!” (the default message being “world”).

The behavior of this program is completely undefined, it is buggy, however it will probably not crash. The function build_message returns a pointer to some memory allocated in its stack-frame. Because of how the stack works, that memory is very susceptible to be overwritten by another function call later, possibly by fputs. As a consequence, if fputs internally uses sufficient stack-memory to overwrite the message, then the output will be corrupted (and the program may even crash), in the other case the program will print the expected message. Moreover, the program may overflow its buffer because of the use of the unsafe sprintf function that has no limit in the number of bytes written.

So, the behavior of the program varies depending on the size of the message given in the command line, the value of MAX_LINE_SIZE and the implementation of fputs. What’s annoying with this kind of bug is that the result may not be obvious: the program “works” well enough with simple use cases and will only fail the day it will receive a parameter with the right properties to exhibit the issue. That’s why it’s important that developers are at ease with some tools that will help them to validate (or to debug) memory management.

In this last article, we will cover some free tools that we consider should be part of the minimal toolkit of a C (and C++) developer.

malloc() is not the one-size-fits-all allocator

malloc() is extremely convenient because it is generic. It does not make any assumptions about the context of the allocation and the deallocation. Such allocators may just follow each other, or be separated by a whole job execution. They may take place in the same thread, or not… Since it is generic, each allocation is different from each other, meaning that long term allocations share the same pool as short term ones.

Consequently, the implementation of malloc() is complex. Since memory can be shared by several threads, the pool must be shared and locking is required. Since modern hardware has more and more physical threads, locking the pool at every single allocation would have disastrous impacts on performance. Therefore, modern malloc() implementations have thread-local caches and will lock the main pool only if the caches get too small or too large. A side effect is that some memory gets stuck in thread-local caches and is not easily accessible from other threads.

Since chunks of memory can get stuck at different locations (within thread-local caches, in the global pool, or just simply allocated by the process), the heap gets fragmented. It becomes hard to release unused memory to the kernel, and it becomes highly probable that two successive allocations will return chunk of memories that are far from each other, generating random accesses to the heap. As we have seen in the previous article, random access is far from being the optimal solution for accessing memory.

As a consequence, it is sometimes necessary to have specialized allocators with predictable behavior. At Intersec, we have several of them to use in various situations. In some specific use cases we increase performance by several orders of magnitude.

Developer point of view

In the previousarticles we dealt with memory classification and analysis from an outer point of view. We saw that memory can be allocated in different ways with various properties. In the remaining articles of the series we will take a developer point of view.

At Intersec we write all of our software in C, which means that we are constantly dealing with memory management. We want our developers to have a solid knowledge of the various existing memory pools. In this article we will have an overview of the main sources of memory available to C programmers on Linux. We will also see some rules of memory management that will help you keep your program correct and efficient.

From Virtual to Physical

In the previous article, we introduced a way to classify the memory a process reclaimed. We used 4 quadrants using two axis: private/shared and anonymous/file-backed. We also evoked the complexity of the sharing mechanism and the fact that all memory is basically reclaimed to the kernel.

Everything we talked about was virtual. It was all about reservation of memory addresses, but a reserved address is not always immediately mapped to physical memory by the kernel. Most of the time, the kernel delays the actual allocation of physical memory until the time of the first access (or the time of the first write in some cases)… and even then, this is done with the granularity of a page (commonly 4KiB). Moreover, some pages may be swapped out after being allocated, that means they get written to disk in order to allow other pages to be put in RAM.

As a consequence, knowing the actual size of physical memory used by a process (known as resident memory of the process) is really a hard game… and the sole component of the system that actually knows about it is the kernel (it’s even one of its jobs). Fortunately, the kernel exposes some interfaces that will let you retrieve some statistics about the system or a specific process. This article enters into the depth of the tools provided by the Linux ecosystem to analyze the memory pattern of processes.

Introduction

At Intersec we chose the C programming language because it gives us a full control on what we’re doing, and achieves a high level of performances. For many people, performance is just about using as few CPU instructions as possible. However, on modern hardware it’s much more complicated than just CPU. Algorithms have to deal with memory, CPU, disk and network I/Os… Each of them adds to the cost of the algorithm and each of them must be properly understood in order to guarantee both the performance and the reliability of the algorithm.

The impact of CPU (and as a consequence, the algorithmic complexity) on performances is well understood, as are disk and network latencies. However the memory seems much less understood. As our experience with our customers shows, even the output of widely used tools, such as top, are cryptic to most system administrators.

This post is the first in a series of five about memory. We will deal with topics such as the definition of memory, how it is managed, how to read the output of tools… This series will address subjects that will be of interest for both developers and system administrators. While most rules should apply to most modern operating systems, we’ll talk more specifically about Linux and the C programming language.