In mid-2012 I acquired my copy of The Linux Programming Interface book. Since then it never left my desk, because on daily basis I face some sort of technical challenge which is well explained in the book.

The book is one of the most well written I’ve ever read. Despite its 1500 pages, the book is so pleasing to read that I’ve read some chapters more than once. Thanks to its high level of organization, which allows the book to have short and self-contained chapters, very instructive pictures and concise explanations.

Regardless of being a book focused on user-level, its explanations about the inner workings of libc and kernel are more enlightening than most kernel books.

Another point that draws attention is that Mr. Kerrisk mention the kernel version when each feature became available. It is specially useful for embedded programmers, which usually work with out-dated Linux versions.

A well known practice in software engineering is to keep the scope of variables as narrow as possible, preferably close to where it is being used. The more restricted is a variable’s scope, the easier is to predict its value. This scope restriction is even more important on concurrent programs, once variables with local scope are less susceptible to race conditions and they exempt the need to add unnecessary synchronization.

Most of the discussion regarding global variables can be extended to static allocated local variables. Despite not being as harmful as globally visible variables, they share most of disadvantages.

This article describe a technique to replace global variables. But first lets argue (once again) why they should be left as last resort.

Reasons to avoid global data

In most cases, a variable with global scope adds unnecessary complexity to your code. Specially when a routine’s behaviour depends on global variable’s value (which varies over time), leading to what is called “side effect”.

Here are the main reasons to avoid global variables:

Increase of complexity: there is a direct proportion between the number of routines that use a global variable and the amount of code you must be aware while coding with this variable.

Deterioration of modularity: as stated by Eric S. Raymond in The Art of Unix Programming [1] “The first and most important quality of modular code is encapsulation. Well-encapsulated modules don’t expose their internals to each other. They don’t call into the middle of each others’ implementations, and they don’t promiscuously share global data”.

Reentrancy: when the same routine’s execution can overlap, called from different threads or asynchronous flows in the code (e.g. a signal handler), it is subject to reentrancy issues. If this routine uses global data, one execution may affect the other and different order of evaluation may produce distinct results. Reentrant routines are those which each execution is independent, self contained and it does not matter if it is called concurrently or not. In other words, the order of evaluation does not change the result.

Uncertainty: the intrinsic problem of global variables

Given the following snippet:

int g_sum;

void sum_init(void)
{
g_sum = 0;
}

void sum_add_hundred(void)
{
g_sum += 100;
}

void sum_add_thousand(void)
{
g_sum += 1000;
}

int main(void)
{
sum_init();
sum_hundred();
sum_thousand();
}

the problem with this approach is that, in each function that changes the value of the global variable, it is not clear which value the variable had prior to the function call. This is probably the worst problem associated with global data, the high level of uncertainty of the variable’s value at a given time. The problem is aggravated when you have a nesting of routines that modify the global variable.

A technique to solve this problem is to create intermediate variables, “passing all data through function arguments and return values” [2]. Instead of modify a global, the function operate only on local variables, parameters and return values. Basically it consists of the following steps:

1. Replace each global input (the values the function reads) by a parameter;

The keyword const is used to assert that an identifier, once assigned a value, must not have a new value re-assigned to it.

Qualifying a read-only variable or parameter as constant provide many benefits, some are related to code readability while others are related with program’s robustness and security. In this post I’ll glimpse both of them and try to convince you why this simple (and many times overlooked) programming technique must become an habit.

Pointer to a Constant Value vs. Constant Pointer

It must be clear that const int *p1 is not the same as int * const p2. The first is a pointer to an integer that can point to other integers, but whose pointed value cannot be changed through the pointer, whereas the second is a constant pointer to a changeable integer, that can point to only one variable, whose value can be changed. In other words, p1 is a pointer to a constant, while p2 is a constant pointer.

Code documentation

The const is a good way to document your intent about a variable’s behavior.

For example, McConnell [1] suggests to use it in routine’s parameters list. With const a programmer says which parameters it expect to leave untouched (input parameters) and which of them it intend to change inside the routine (output parameters). This kind of documentation and logic can be enforced by compiler, whilst a comment can not. For example, in the snippet from CERT [2], if the function implementation is not consistent with the function’s signature, the compiler warns about:

Globals and local static variables can take advantage of const-qualifier, preventing unwanted modifications of these long-lived variables in hidden places.

Finally, as const use to taint, the compiler will probably warn you about changes in constant values made through dereferences.

Compiler optimizations

Once qualified as constant, you allow the compiler to make many optimizations on your code, some improve speed while others improve security and robustness. In this section is given an example of a robustness optimization done by PowerPC compiler.

Once qualified as constant, a global variable is placed at read-only data section (.rodata), whereas without const qualifying it is placed at initialized data section (.data). The fact it is at a read-only section makes it less susceptible to buffer overflows. For example, in the snippet bellow:

and if one insist to execute the application, it will crash with a segmentation fault. Ok, it is not a good thing, but you know that the application has a bug and you can track it with a moderate effort.

However, if you leave the string vector as a regular variable, without const-qualifying it:

char string[] = "The importance of const qualifier.\n";

the compiler will silently generate the code (some compilers still warn, but it is not garanteed in all toochains), and the strcpy function will overwrite other variable’s memory space, leading to bugs hard to track, diagnose and reproduce. Those kinds that happen only with a very special combination of inputs. In the end, you are hopelessly left without any log or dump.

A routine (comprising the terms function, procedure and macro) is the basic unit of abstraction in the procedural programming paradigm. They allow developers to break complex sequences of instructions into simpler pieces. As stated by Steve McConnel in his book: “managing complexity is software’s primary technical imperative” [1], and in my opinion, routines are the main tool to reduce complexity in software developed with procedural languages.

Note: in time, when I refer to the programming concept, I use the term routine. Whereas I use the terms function or macro when dealing with the routine’s technical implementation in C.

Software developed in procedural languages like C, are built upon many levels of abstraction, each level represented by a set of routines. They are the major element – provided by the language – to deal with complexity. Thus, it is very important to know how to write routines properly.

Related to that is the reason to create a routine. Reduce complexity is the single most important one [1], and come prior to reusability, portability and performance. Even though these reasons are valid, one must always consider to create a routine if it is going to simplify the code.

My concern here is to discuss some details that make the difference between a good and a poor routine. Consider a good routine the one with low rate of bugs, robust and mainly easy to read and understand by teammates. Of course, a bad routine is the opposite.

Functions

Routines come in two flavors, the ones that return values and the ones that do not. Many books and even languages name them differently: describing functions as routines that return values, mirroring the math term; and procedures as the routines that do not return values, which just perform some specific action. We use the term function to describe the routine implementation in C, not differentiating those terms, even though you can simulate the procedure concept with void functions.

The major goals when writing a function is to achieve the functional cohesion [1] and orthogonality [5].

Functional cohesion means that you design a function to do only one thing and do it very well. Why is this important? Because a highly functional cohesive routine is:

Usually shorter, which provides a greater level of modularity and reusability to your project;

More likely to be understood by other developers, because it is smaller, the name hints everything the routine does and people will have to load less information into their brains to undestand what the code does;

Easier to choose a good name, once it is simpler to name simple processes than complex ones;

Straightforward to tests and debug, given the shorter range of inputs and outputs the routine is expected to receive and provide, respectively;

Fewer parameters and return values to scrutinize with asserts;

And much easier to spot a bug. Simply put, or the function does its single job right, or it does not.

Orthogonality applies to cooperating functions and is all about reducing the interdependency between them. Modularity is another way to call it, but personally I prefer orthogonality, because it gives a more direct purpose (after all, why are you modularizing your code?). Thus, changing one function should affect the fewest number of other functions or none at all. Avoiding the use of global data within functions (even read them) is another way to make your functions orthogonal. Orthogonality is also a close friend of reentrancy.

To get the concepts, just take a look at the snippets bellow and see by yourself which one is simpler to undestand the overall concept, check the main function and tell what is going on:

Although the second listing looks bigger, and indeed it is, the point in splitting the code in functions is to help you to keep your focus and changes on small pieces of code. Let’s say you found a bug in the data writing, then you focus your effort in the write routine, you do not need to read all the other stuff to understand where the write is happening, you know where is it, in the write function.

Macros

Macros are a kind of fake function. You can write a function-like Macro that is processed by the pre-processor and expanded to a set of statements. Macros are not the safest way to implement code, so it is better to avoid it at all, replacing it by functions wherever possible. A quote from The Practice of Programming book [4]:

“There is a tendency among older C programmers to write macros instead of functions for very short computations that will be executed frequently; … The reason is performance: a macro avoids the overhead of a function call. This argument was weak even when C was first defined, a time of slow machines and expensive function calls; today it is irrelevant. With modern machines and compilers, the draw-backs of function macros outweigh their benefits.”

In this book, the authors argue that one of the most serious problems with function-like macros is that a parameter that appears more than once in the definition might be evaluated more than once, and if the argument in the call includes an expression with side effects (e.g. i++), the result is a subtle bug [4].

So, many programmers use Macros like functions, however they do not always keep in mind the subtleties differences or worst yet, they do not even know they are using a Macro, since even some standard C library functions can be implemented as Macros in different platforms, so your program may develop a hard-to-find bug in one specific platform.

Despite its simplicity, many doubts arise from the use of header files. In this post, I will provide some standards and insights on header files.

Edit: Michael Barr also gives insights about the subject in his post [6].

In my opinion, there are two points you must pay attention when using headers, what to put in the header and how to include a header. The last one seems obvious, but there are few mistakes you should avoid.

What should I put in a header file?

Include guards is the first thing you should ever write in a header file (if your IDE did not provide you one already), see “PRE06-C. Enclose header files in an inclusion guard” at [5] and “The #define Guard” at [4]. It avoids including types and functions from a header file more than once. Just remember to do not define identifiers surrounded by single or double underscore in your defines, once _DONOTUSE_ and __DONOTUSETOO__ kind of identifiers are reserved for future use by C standard, see “7.1.3 Reserved identifiers” at [1];

Avoid function definition/implementation in header files, unless you want the function’s code to be placed at every single call entry, bloating the binary, header files should contain only function declarations/prototypes;

Include all the header files which your header file depends on. This is a bit controversy among developers, once the more header files you include, the slower your compilation will run. Although, “Rule 2.1 Each header file should be self-contained” at [2] and “Names and Order of Includes” at [4] advocate to make header files self-contained, in other words, resolving all types and functions within the own header, through inclusion of files which declare them. I strongly agree with this, because resolving the types and functions within the header file save other developers from the burden of solving “undefined reference” and “implicit declaration” kinds of errors. If you declare a variable or function of bool type within your header, include <stdbool.h> there, if you also use an uin32_t, include <stdint.h> too, and so on. In other words, include in the header exactly what you use there, no more, no less;

Avoid cyclic inclusions, if a header file A.h depends on a type defined at B.h, and B.h includes A.h, you have a cyclic inclusion. Maybe you have a few header files in the middle of the cycle, but you got the idea. This kind of inclusion take some time to solve, that’s why I prefer to always use completely defined types, solved in the header file. As soon as you identify such cycles, the easier is to solve them;

How should I include a header file?

Do not use OS shortcuts (e.g. ‘.’ and ‘..’ on Unix) in include statements from “Names and Order of Includes” at [4], it is ugly and makes clear to your teammates your zero knowledge of build automation! But if your pride, code readability and elegance do not concern you, keep in mind that OS shortcuts bind your file to a specific directory, turning into a painful experience any attempt to reorganize directories and maintain Makefiles. By convention, the build automation tool (e.g. make) should be the responsible to find header files across your project (see “Header Files” at [3]), thus when you move files around it is required to update only the Makefile, not the hundreds of files you moved;

Do no use absolute path in include statements from “Header Files” at [3], here applies the same advices as the OS shortcuts. Absolute paths are another way to bind your file to a directory. For example, if you adopted the absolute path because you have more than one header file with the same name, instead of including it through its absolute directory, rename the header file or use flags in your Makefiles;

Include the header file of your implementation first (see “Names and Order of Includes” at [4]). In a module called network.c, the first include statement must be network.h. This rule ensures that your header file is self-contained,because your implementation file will not compile until you solve all the header’s dependencies within the own header file. It is preferable to break your compilation now, than leave the problem to be solved by the other twelve developers whom will use your module;

In this post I’ll present you a good practice that avoids memory leaks, segmentation faults among other memory and resource related issues.

The idea is very simple, it consist of release a resource at the same scope/level you acquired it.

By resource, you can consider a memory block, a thread, a file, or whatever limited and not automatically freed resource.

Following this simple pattern, you can match the pairs of opposite operations (i.e. malloc/free, fopen/fclose, among others) more easily, and because the release code is closer to the acquire code, you can ensure more easily that you always release a resource – giving you less chance to run out of a precious resources – and do not try to use a resource that no longer exists – avoiding segmentations faults.

The Pragmatic Programmer book provides a good discussion about resources balancing in many programming languages, but here we restrict ourselves to C.

Postponed buffer allocation

Sometimes you don’t have all the information required to acquire the resource (for example, you don’t know the required resource quantity in advance), because of it, you have to release it in a different scope from the one the resource was acquired. This prevents you to follow the advice stated above, thus do not allowing you to have the resource acquisition and release at the same scope.

One way to solve this problem is to hide the resource management inside a structure. The idea is called encapsulation, the same applied to object oriented languages.

This technique provides a way to postpone the buffer allocation, by encapsulation it inside a structure. At [2] is presented a different approach based on linked lists, but here is shown a simpler technique.

The autobuffer structure has two members, an character pointer array and the size of that array, to avoid buffers overflows. To manipulate that structure are provided three functions:

autobuffer_t * autobuffer_alloc(void), which is a fake allocation function, it does not allocate the required memory for the internal buffer. It just initializes the structure that has the buffer.

bool autobuffer_append(autobuffer_t *buffer, char *data, size_t data_size) is where the real job is done. Once you know the amount of memory you need, you simply append it to the buffer, by this time you can allocate the required memory amount and copy the data to it.

bool autobuffer_free(autobuffer_t *buffer) is the opposite to autobuffer_alloc(void), however, this is not a fake deallocation. This function really does free all the memory allocated, including the internal buffer.

The following code snippet illustrates this technique. Error handling was left out to shorten the code.

In this code you can clearly see that we use the buffer in one place and create it in other. In the read_user_data(autobuffer_t *b) function is where we know the amount of data we need, but in print_user_data(autobuffer_t *b) is where we use it. We could pass a buffer allocated in main with an arbitrary size, which is a kind of “guessing” way to do things, because this is not the place where you really knows the required size.

Postponed file opening

The example previously shown handles buffer allocation and deallocation, however you could use the same technique to handle file opening and closing.

In this case, instead of postponing the buffer allocation in order to know the required size, you could postpone the file opening to know its name or path, for example. And considering that you would like to use this file in many parts of your code, it should have a wider scope than the one it was opened.

This is the first of a series of post I’m planning to write about good programming practices, focused on C programming language.

Introduction

Assert is a very powerful tool, many times underestimated by programmers. Its main purpose is to ensure that some parts of your program are correct.

The Assert is usually a macro and posses the following signature:

assert(condition_that_must_be_true);

If you made an assumption during the coding time, and you want to make sure that your assumption is right, you can (and probably should) use an Assert. Because, if your assumption was correct, the program will run and you become more confident about its behavior. On the other hand, if your assumption was incorrect, the program will fail and you know exactly where and why it failed, thus feeding you with data to fix the problem. Like said in [5]:

“Many bugs originate from making the wrong assumption about what conditions that should be true when writing the code.”

Key note: Asserts are used to check for errors that should NEVER occur, not to check for errors that CAN happen.

As noted by [2], error handling techniques usually check for bad input data, whilst Asserts check for bugs in code.

Good places to use

Here follows verifications where Assert is very useful:

Valid range of values. If you expect that your algorithm’s result falls within a specific range of values, you can use an Assert. However, if this value is from an external source (like user input), you must use an error handling technique.

Specific values. Sometimes, we want to ensure that the result of a computation is a specific value, or sometimes to implement test cases, where the value of an input must result into an output value.

Design by contract. Design by contract is a technique described in [1] where distinct parts of code agree on specific data values. These agreements are the contracts. The caller’s obligations to the routine are preconditions, and the routine’s obligations to the caller are the postconditions. Using Asserts, we can check preconditions by testing against parameters values, whereas postconditions we can test by verifying if the results of the routine are within expected values.

Non-null pointers, when they MUST be non-null. If in some parts of your software you expect a valid pointer, leaving to the caller routine the responsibility to validate that pointer, this is a good opportunity to validate the correctness of your logic. However, it is important to note that Asserts are not the right tool to verify the validity of pointers returned by dynamic memory allocation functions, like malloc. These types of pointers checks are more suitable for error handling techniques.

Bad places to use

Like said previously, Asserts are intended to check for errors that should never happen, because there are situations where error handling techniques are more suitable. The rule of thumb is to use Assert in your own code, to validate the correctness of your implementation, not the correctness of libraries and other sources.

Here are some places where error handling techniques are more suitable:

Returns from system calls and library functions. Libraries functions and system calls usually return error codes if they fail, you must expect and handle their failure properly. Asserting for the success of these functions is to be over optimistic. These functions success, or failure, is something not under your control.

User input data. So, if libraries functions are likely to provide bad data to our code, what about input from users? They are even more susceptible to provide bad data, so we must expect and handle it properly, be with feedback to user, or assuming default values.

Network input data. Data from network are susceptible to be corrupted by many means, like any other external input, we must handle when bad data arrives.

File input data. Storage failure can happen, no matter how reliable you consider your hardware and file system to be, storage can fail. If your work on embedded systems using flash memory, this kind of errors will show up more often.

Out of system resources or memory. When dynamic allocation functions (like malloc()) fail, or any other system resource (i.e. file descriptors, threads, etc.), your code must handle it according to your error handling scheme. Asserting for the success of this kind of error is incorrect, because they can occur.

Conclusion
Like any other programming technique, Asserts has its advantages and disadvantages. Their conscious use can improve your implementation’s correctness, but the Developer must remember that there are places where it is not suitable, and an error handling technique shall be used instead.

Many of us know that we should always compile at the highest warning level available, this way the compiler can warning us about possible mistakes. However, it happens to me that many people simply ignore them, they say: “don’t worry, this warning is not important”. To me, my code must compile free of all -Wall and -Wextra warnings when developing with GNU compilers.

For example, people usually underestimate the signed and unsigned comparison warning. Here follows a code snippet which demonstrates why we can not ignore this warning.

Once we had assigned exactly the same constant to both variables, why are they different? It all comes to the range those variables have when interpreted. For example, a signed variable with 8 bits ranges from -128 to 127, or [-2^7, +2^7-1]. Whereas the unsigned version ranges from 0 to 255, or [0, 2^8-1].

The signed uses the left most bit from the most significant byte to track the signal, while the unsigned version don’t have to track any signal, thus using all bits to represent the number. Thats why the binary 11111111 means two different things, depending on how it is interpreted.