Category: Memory

There are two main tools I like to use for any memory related stuff in my c++ code. Following is an example code I had written as my implementation of a shared pointer. I think my code works as expected but not sure how much memory I am leaking if any.

The output by valgrind shows entire picture of virtual memory and how source files and code is laid out in the memory. It detects the loss of 32 bytes of heap memory still in use at the exit.

The fix:

The output shown by the google address sanitizer was useful at micro-level. It precisely pointed the line number and I was able to apply my first fix to the memory leak in virtual destructor as follows:

The modern microprocessor pipeline is 14 stages deep during which the programming instructions reordered all the times for optimization purposes.

Linux 2.6 supports full preemption, i.e. the thread can be suspended in between.

Atomic types are the locations in main memory to which the access is exclusive to a thread/process.

Barriers are used for ordering the accesses to the memory locations.

Atomic operations are provided at hardware level in order to make the operations indivisible.

The Implementation is highly dependent upon the hardware. x86 has strictest rules around memory ordering.

Atomic operations with memory fences prevents reordering of the instructions in order to make theoperation indivisible.

Atomic operations are expensive because the OS and hardware can not do all the necessary optimizations.

<atomic> header provides various atomic types. Following is non-exhaustive list of atomic types

atomic_bool

std::atomic<bool>

atomic_char

std::atomic<char>

atomic_schar

std::atomic<signed char>

atomic_uchar

std::atomic<unsigned char>

atomic_int

std::atomic<int>

atomic_uint

std::atomic<unsigned>

atomic_short

std::atomic<short>

atomic_ushort

std::atomic<unsigned short>

atomic_long

std::atomic<long>

atomic_ulong

std::atomic<unsigned long>

atomic_llong

std::atomic<long long>

atomic_ullong

std::atomic<unsigned long long>

atomic_char16_t

std::atomic<char16_t>

atomic_char32_t

std::atomic<char32_t>

atomic_wchar_t

std::atomic<wchar_t>

Operations on Atomic types

These operations take a argument for memory order. Which can be one of std::memory_order_relaxed, std::memory_order_acquire, std::memory_order_release, std::memory_order_acq_rel, std::memory_order_consume or std::memory_order_seq_cst

load: this is read operation on an atomic type.

store: this is a write operation on an atomic type.

exchange: this is read-modify-write operation on an atomic type. All the compare operations are used as compare_exchange(expected, desired, <optional memory order>). On successful exchange it return true else it returns false.

compare_exchange

compare_exchange_weak: these are really for the architectures where the read-modify-write operation is not guaranteed atomic. It can generate spurious errors and it is advised to use in a loop. It has same effect as of compare_exchange_strong on x86 platform.

compare_exchange_strong: it is guaranteed to return false on failure and guaranteed to return true on success.

fetch_ versions of add, or etc

overriden operators like +=, -=, *=, |=

Lock Based implementation of a Multi-producer, Multi-consumer Queue.

It is important to see the lock based data structures before implementing lock-free data structures.

How heap memory is obtained from kernel?
How efficiently memory is managed?
Is it managed by kernel or by library or by application itself?
Can heap memory be exploited?

were in my mind for quite some time. But only recently I got time to understand about it. So here I would like to share my fascination turned knowledge!! Out there in the wild, many memory allocators are available:

dlmalloc – General purpose allocator

ptmalloc2 – glibc

jemalloc – FreeBSD and Firefox

tcmalloc – Google

libumem – Solaris

…

Every memory allocator claims they are fast, scalable and memory efficient!! But not all allocators can be suited well for our application. Memory hungry application’s performance largely depends on memory allocator performance. In this post, I will only talk about ‘glibc malloc’ memory allocator. In future, I am hoping to cover up other memory allocators. Throughout…

A process is a program in execution. Since linux represent everything in terms of files, currently running processes are also represented using /proc file system. The proc/<pid> contains open files (sockets), pending signals, process state, kernel data structures, threads of execution and data section.

Processes provides two types of virtualization

Virtual Processor

Virtual Memory

Processes are created on linux using fork() system call. When fork() is called, a process is created which is child of process who is called fork(). The reason it’s called as a child because it gets copy of resources (data, variables, code, pages, sockets etc) from it’s parent process.

The list of processes are stored in a doubly linked list called as task list. The process descriptor is type struct task_struct defined in <linux/sched.h> .

Processes are identified by PID. PID is a defined by pid_t data structure which is an integer up to 32,768. Generally processes up to number 1024 is kernel processes and rest of them are user processes. When a new process is forked, the number is not sequential but its randomized in order to avoid being guesses. When number is processes reaches max limit, the number is reseted and starts from 1024. The limit of number of processes can be increased by setting value at proc/sys/kernel/pid_max.

if return value of fork() call is 0, fork is successful and control block is in child process.

if (pid == 0)
{
printf("child process");
}

if return value of fork() call is greater than 0, fork is successful and control block is in parent process process.

else if (pid > 0)
{
printf("parent process");
}

if return value of fork() is negative, the call is failed.

else
{
printf("fork failed");
}

fork() creates a child process that is a copy of the current task. It differs from the parent only in its PID, its PPID (parent’s PID), and certain resources and statistics (e.g. pending signals) which are not inherited

Besides the open files, other properties of the parent are inherited by the child:

Process 0 and 1

Process 0 is ancestor of all processes which is also known as swapper process. This process is build from scratch during boot up. Process 0 initializes all the data structures needed by kernel, enables interrupts, and creates init process which is process 1.init process, invoked by the kernel at the end of the bootstrap procedure. It is responsible for bringing up the system after the kernel has been bootstrapped. init usually reads the system-dependent initialization files (/etc/rc* files or /etc/inittab and the files in /etc/init.d) and brings the system to a certain state. Process 1 never dies. It is a normal user process, not a system process within the kernel, but runs with superuser privileges.

State of a Process

Process can be any one of following states:

TASK_RUNNING

Process is being executed on CPU or waiting

TASK_INTERRUPTABLE

Process is sleeping/suspended.

TASK_UNINTERRUPTABLE

Process is in sleeping state but signal to these processes will not change state of the process.

TASK_STOPPED

Process being stopped.

TASK_TRACED

Process is being debugged

EXIT_DEAD

Child process is being removed because parent process issued wait4() or waitpid()

EXIT_ZOMBIE

Child process being terminated but parent has not issued wait() or waitpid()

Copy on Write – an optimization on fork()

All the data and resources that belong to the parent are not copied to child. This is done is steps and as needed. This is an optimization in order to delay copying data as needed. The another reason is in calls such as exec(), none of the parent data is actually needed, there should not be a need to copy any pages from parent.

Process Switch

During process switch the registers are saved in kernel memory area for process and loaded back when process resumes on CPU. This is was done using far jmp assembly instruction on x86. On linux 2.6 onwards software is used for context switching instead of hardware context switching. Software context switch is safer and faster.

Scheduling

Processes work on time slicing basis giving user the illusion that only the user has access to the CPU and the memory is unlimited. The scheduling of processes depend upon scheduling policy or algorithm. The scheduling algorithm must schedule in such a way that processes should have fast response, good throughput, limited use of resources and honor the priorities.

Linux processes are preemptive. A process with high priority than low priority runs first by interruption of low priority process is one is running. There are following classes of Linux processes:

SCHED_FIFO

SCHED_RR

SCHED_NORMAL

Scheduling factors for a process

static priority: between 100 (high priority) to 139 (low priority)

nice value: -20 (high priority) to 19 (low priority)

dynamic priority: between 100 (high priority) to 139 (low priority)

real-time priority: 1 (highest priority) to 99 (lowest priority)

Kernel Preemption

In non-preemptive kernels, kernel code runs until completion. That is, the scheduler cannot reschedule a task while it is in the kernel: kernel code is scheduled cooperatively, not preemptively. Kernel code runs until it finishes (returns to user-space) or explicitly blocks.

The Linux kernel (since 2.6), unlike most other Unix variants and many other operating systems, is a fully preemptive kernel. It is possible to preempt a task at any point, so long as the kernel is in a state in which it is safe to reschedule.

Threads

Linux threads are light weight processes i.e. Linux does not make any difference between threads and processes. They share resources with parent like heap space, file descriptors, sockets but each thread gets own stack.

each thread contain TCB- a thread control block which contains following

Thread Identifier:Unique id (tid) is assigned to every new thread

Stack pointer: Points to thread’s stack in the process

Program counter: Points to the current program instruction of the thread

State of the thread (running, ready, waiting, start, done)

Thread’s register values

Pointer to the Process control block (PCB) of the process that the thread lives on

threads are also created with clone() system call like other processes except for some flags to clone like

clone(CLONE_VM | CLONE_FS | CLONE_FILES | CLONE_SIGHAND, 0);

Address Space of a Process

The address space of a process consist of linear addresses the process is allowed to use. The linear address spaces are different per process. The linear addresses are managed in form of Memory Regions. Following are examples of system calls to manage memory regions:

brk() System call:

brk() is the only function that is a syscall. All the other functions listed above are implemented using brk() and mmap()

it allocated and deallocated whole pages since it acts on memory region.

System calls a.k.a syscalls:

syscalls are implemented in Kernel and can be called from a process running in user space with appropriate privileges. There are around 330 number of system calls in Linux. The Linux API is written in c.

Process Termination

Process termination can be normal (finishing intended code execution) or can be abnormal. The Abnormal execution is because of following reasons:

Kernel, the process itself or any other process send certain signal.

Sigfault: Process tries to access memory location which it does not have privilege of access.

Abnormal conditions is exit with non-zero code.

On process exit Kernel releases the memory and other resources held by the process like file descriptors.

When Parent process dies, the children of that process becomes orphan processes and the process init becomes parent of those processes. Sometimes parents do not wait for child processes after fork(). These child processes becomes zombi processes after termination. A process can wait (block) when its children are in the running state using wait or waitpid.

The wait function can block the caller until a child process terminates, whereas waitpid has an option that prevents it from blocking.

The waitpid function doesn’t wait for the child that terminates first; it has a number of options that control which process it waits for.

Configuration using tuned

there are profiles available for specific needs like network-latency, latency-performance, network-performance, throughput-performance, desktop and balanced etc

Examples of the profiles

Latency Performance Settings in Linux Tuned

Settings

Meaning

force_latency=1

processor state C1

governor=performance

CPU is at higher performance state

energy_perf_bias= performance

Higher performance state

min_perf_pct=100

comes from the P-state drivers. The interfaces provided by the cpufreq core for controlling frequency the driver provides sysfs files for controlling P state selection. These files have been added to /sys/devices/system/cpu/intel_pstate.

kernel.sched_min_granularity_ns=10000000

Minimal preemption granularity for CPU-bound tasks

vm.dirty_ratio=10

The generator of dirty data starts writeback at this percentage

vm.dirty_background_ratio=3

Start background writeback (via writeback threads) at this percentage

vm.swappiness=10

The swappiness parameter controls the tendency of the kernel to move processes out of physical memory and onto the swap disk. 0 tells the kernel to avoid swapping processes out of physical memory for as long as possible 100 tells the kernel to aggressively swap processes out of physical memory and move them to swap cache

kernel.sched_migration_cost_ns=5000000

The total time the scheduler will consider a migrated process “cache hot” and thus less likely to be re-migrated

net.core.busy_read=50

This parameter controls the number of microseconds to wait for packets on the device queue for socket reads. It also sets the default value of the SO_BUSY_POLL option.

sysctl.net.core.busy_poll=50

This parameter controls the number of microseconds to wait for packets on the device queue for socket poll and selects

kernel.numa_balancing=0

disable NUMA balancing

net.ipv4.tcp_fastopen=3

Linux supports configuring both overall client and server support via /proc/sys/net/ipv4/tcp_fastopen (net.ipv4.tcp_fastopen via sysctl). The options are a bit mask, where the first bit enables or disables client support (default on), 2nd bit sets server support (default off), 3rd bit sets whether data in SYN packet is permitted without TFO cookie option. Therefore a value of 1 TFO can only be enabled on outgoing connections (client only), value 2 allows TFO only on listening sockets (server only), and value 3 enables TFO for both client and server.

Memory Address

This address consist of a segment and an offset i.e. distance from segment start address.

Linear address or Virtual address:

This address is a binary number in virtual memory that enables a process to use a location in main memory independently of other processes and to use more space than actually exists in primary storage by temporarily relegating some contents to a hard disk or internal flash drive.

Physical Address:

Address of the memory cells in RAM of the computer.

Need for Virtual Addressing

The main memory (RAM) available for a computer is limited.

Many processes use a common code in libraries.

Using Virtual addressing, a CPU and Kernel gives an impression to a process that the memory is unlimited.

Address Translation

Since 2 out of 3 address are virtual mentioned above, there is a need for address translation from Logical to Linear and Linear to Physical address.

For this reason each CPU contains a hardware names as Memory Management Unit (MMU).

Segmentation Unit: converts the Logical address to Linear.

Paging Unit: converts Linear address to Physical.

The address translation from linear address is done using two translation tables

NUMA is a shared memory architecture used in today’s multiprocessing systems. Each CPU is assigned its local memory and can access memory from other CPUs in the system. Local memory access provides the best performance; it provides low latency and high bandwidth. Accessing memory that is owned by the other CPU has a performance penalty, higher latency, and lower bandwidth.