Obscure C

The C language is relatively "small" in comparison to other modern computer languages. To completely specify it, (and its standard library) only requires about ~550 pages. To do the same for Java or C++ would require an entire bookshelf, rather than a single book. However, even though the language is small enough to be easily comprehended, it has some dark corners. The purpose of this article is to explore some of them.

Since C is used in a wide variety of applications, the "dialect" of it varies. This means that some people may be quite familiar with many of the following items. However, the individual "non-obscure" subsets should hopefully vary, resulting in at least a few items you might not know about.

1) Pointers are not unsigned integers

Pointers are not unsigned integers. That is fairly obvious, you can't dereference an unsigned integer (even if it is an address). However, many people assume many properties of integers apply to pointers, even if they don't. Unfortunately, this can result in security issues.

A common problem is bounds-checking an access to a buffer. A simple solution might look like:

However, this doesn't work either. The problem is that in C, you can only compare pointers in the same underlying object. That means that all pointer arithmetic is assumed to never overflow. The second check can be optimized away as it only triggers in undefined behavior. The gcc compiler did this, breaking a security check in the Linux Kernel. As an aside, the proper way to check this is the simple:

if (len >= LEN) errx(1, "Overflow!\n");

This isn't the only pointer arithmetic that can be a little tricky. Another case is:

int *p;
uintptr_t u = p - NULL;

Code like the above might appear in macros attempting to do address-space manipulation, so it isn't quite as strange as it first appears. The problem is that again it is undefined behavior. NULL isn't in any valid object. Thus we are subtracting two pointers in different objects - undefined behavior. The compiler is allowed to set u to any value it pleases.

These features of C pointers mean that no object can be larger than half the address space. (Otherwise pointer arithmetic within it would overflow the (signed) ptrdiff_t). Another limitation is that no object can end on address (uintptr_t) -1. i.e. 0xffffffff on 32bit machines. You are always allowed to look one past the end of a C object. That isn't possible in that case due to the overflow. Again, that doesn't affect most programmers. However, in embedded work where every byte matters, such subtleties matter.

2) Loops in the C pre-processor

The C Pre-Processor is a simple text-replacing macro implementation. It isn't very complex, and deliberately reduces its own scope to try to avoid extra complexity. Macros are only expanded once per phase, so arbitrary computation via recursion is hindered. However, wouldn't it be nice to have loops controlled by macros? You could programatically cause varying amounts of text-substitution.

The obvious way doesn't work. i.e. via the relationship between recursion and iteration, since recursion is disallowed. However, that doesn't stop us from being tricky. The key is the idea that a text file is allowed to #include itself!

A new problem is that a macro definition can't really refer to previous definitions of itself, so we can't update a loop counter in an obvious way. That in turn can be surmounted by using binary logic. We can manually #define, and #undefine macros to implement an adder from raw binary gates. (It's possible to hide some of this ugliness with other macros, but that hides what's going on here, so we won't do it.)

3) strxfrm()

The C standard library has quite a few functions in it. Some of them are used more than others. Some are quite obvious from their name, so if you run into them you can derive what they should do. However, others are relatively obscure.

strxfrm() is one of the obscure ones. Its definition isn't particularly helpful. Quoting from the C99 standard section 7.21.4.5:

The strxfrm function transforms the string pointed to by s2 and places the resulting string into the array pointed to by s1. The transformation is such that the strcmp() function is applied to two transformed strings it returns a value greater than, equal to, or less than zero, corresponding to the result of the strcoll function applied to th same two original strings. No more than n characters are placed into the resulting array pointed to by s1, including the terminating null character. If n is zero, s1 is permitted to be a null pointer. If copying takes place between objects that overlap, the behavior is undefined.

Okay... so it transforms a string in some poorly-described way, but then refers to the also obscure function strcoll. That second function is related to locales. Different cultures have different ideas for ordering strings. Their alphabets vary, and new characters can be placed in different orders with respect to the ASCII 26. The strcoll function allows you to compare in a locale-sensitive way two strings so that they will be in the right order when sorted.

However, that doesn't really explain the need (and use) for the strxfrm() function. The key is efficiency. The strcoll() function needs to convert both arguments when comparing them. That conversion might be quite slow. If you are sorting a large table of internationalized strings, the speed impact might be important. The difference is between O(n) and O(nlogn) conversions.

If you never do internationalization work, you may never run into this function. Even if you do, only if you care about performance in specific types of algorithms will you need to know about this function. It's there if you need it. The rest of us can safely ignore it, relegating it to a dark corner of the language.

4) Integer Division

C is a low level language, and details of the underlying hardware matter. This means that C code can be very fast, but it does expose C programs to some dark corners. One such issue is exposed by integer division.

The effect of division is described in paragraphs 6.5.5.5-6 in the C 99 standard:

The result of the / operator is the quotient from the division of the first operand by the second; the result of the % operator is the remainder. In both operations, if the value of the second operand is zero, the behavior is undefined.

When integers are divided, the result of the / operator is the algebraic quotient with any fractional part discarded. (This is often called "truncation towards zero".) If the quotient a/b is representable, the expression (a/b)*b + a%b shall equal a.

So the above discusses that division by zero is undefined behavior. However, there is one other case not explicitly mentioned. The problem is that C supports three different types of representations of signed integers. Signed Magnitude. Ones Complement. Twos Complement. The first two of these have no other undefined possibilities for division. However, the third does. Guess which one your hardware probably uses?

Twos complement division has a hidden undefined case. INT_MIN / -1 will cause integer overflow. Since the result is INT_MAX + 1, which isn't representable in the integer type, there is a problem. You might expect that it overflows to INT_MIN again... but since overflow is undefined, anything could actually happen. On some machines you'll get a hardware exception.

Thus in addition to checking for division by zero, security conscious code also needs to test for the INT_MIN / -1 case.

5) Bit Shifting

Big shifts on unsigned integers work as you might expect. Shifting left is like multiplication, and right like division. Signed integers are a different story.

Shifting leftwards works until you overflow. Once overflow occurs, we are again in the world of undefined behavior. The compiler can do anything, including ignoring the possibility. Thus you cannot left shift into the sign bit! (Checking the sign bit is a really fast comparison with zero, or overflow flag test on some hardware.) If you want to use the special properties of manipulating the sign bit with shifts, you first need to cast to unsigned to do the work, and then cast back when done. Unsigned overflow is well-defined, avoiding all the problems.

Right-shifting signed integers is also slightly tricky. It isn't division by powers of two, like in the unsigned case. Non-negative integers will be divided as you might expect, but negative ones will exercise implementation defined behavior. This isn't quite as bad as undefined behavior. The result will be defined by your compiler maker. However, different compilers might choose different things, which makes it almost as annoying.

Fortunately, most compiler vendors choose sane defaults for this, with a right shift of a signed integer acting as an "arithmetic shift". The sign bit will stay constant, yet propagate its state rightwards. This is quite useful, allowing simplified masking operations.

Since the common case for the right shifts is to behave in a certain way, many people assume that that behavior is portable. Unfortunately, it isn't.

6) K&R Declarations

C predates its standardization by the ANSI organization. Thus some really old C code doesn't exactly look like its more modern descendants. On thing in particular that has changed is the way functions are declared. Now, the types of the arguments are specified within the parenthesis. In pre-ANSI code, they were after it.

You'll typically only run into this when maintaining really old code. (Which might have macros allowing both forms of declarations to co-exist.) However, there is one case in modern C code where you might want to use this archaic style. The problem occurs when dealing with variable length arrays passed as parameters.

C is a lingua-franca of programming languages, and has to talk to many others. One such common interface is with FORTRAN. In FORTRAN, it is quite common to pass arrays as parameters to functions. The problem is that there are two ways to do it. The first works nicely:

int varray(int a, int b[a])
{
return b[0] + b[a - 1]);
}

In the above, the length of the array is passed first, and the array second. Thus we can use the length in the definition of the array. What happens when the order is the other way around?

int varray(int a[b], int b)
{
return a[0] + a[b - 1]);
}

Unfortunately, this doesn't compile. We can't use b before it is defined in the list of function parameters. This is where the old K&R syntax comes to the rescue:

int varray(a, b)
int b;
int a[b];
{
return a[0] + a[b - 1];
}

Notice how we can reorder the type declarations now so that a is specified after b.

The above is nice, but how do we pre-declare such a function, so that others can call it? This is another obscure part of C99:

int varray(int a[*], int b);

The square brackets with the asterisk inside represent a variable length array. You can only use this syntax in pre-declarations, the only place it is needed. Most of the time, you wouldn't need to use it though. Only in the special case when the array is passed before the length is it needed.

7) typedef

The typedef keyword is a little tricky. It actually is much more flexible than many coding styles accept. You can put it pretty much anywhere in a type definition, since it acts like a "storage class&quot, like static or extern.

The first three of these are legal, and do as you might expect. The last isn't allowed though.

A more complex case might look like:

struct test {
int x;
} const typedef volatile bar;

Here we declare a struct tag "test", and make the const volatile version of that type and call it "bar". Reordering the three keywords const typedef volatile won't change the meaning. Neither will moving any (or all) of those keywords before the "struct". C syntax is very flexible.

Unlike C++, struct tags live in a separate "namespace" than type names. Thus the three x's refer to different things. The first is a struct tag. The second is a field of that struct. The final x is the name of a variable. Also notice how we are allowed to make the struct member volatile. We can also add arbitrary number of "const" or "volatile" keywords to the variable declaration. (Not that you'd really want to though. Beyond the first, they don't change anything.)

8) Goto labels and case statements

Goto is very powerful in C. It is the only construct that has full function scope. Everything else depends on bracing for scoping. Thus you can use goto to jump into places you might not expect. You can jump in between an else and its corresponding braced block. You can similarly jump right before the braced block attached to a while, for or do loop. These tricks can be used to make powerful iteration macros.

The switch statement isn't quite as flexible. However, it is close. case statements are only restricted to be somewhere within the braces of the switch, which isn't much of a restriction. Just like goto labels, they ignore scoping, and can be placed in "strange" places. A relatively well-known pattern "Duff's Device" depends on this.

Note how the "break" keyword isn't so nice. It is always attached to the construct immediately previous to it in scope. Thus in this case, attached to the while rather than the switch. It isn't too difficult to imagine cases where this might get confusing though. (Pun intended.)

9) The conditional operator

The conditional operator allows you to convert an if statement into an expression. This can simplify some code. However, what happens when the two alternatives are differently typed?

As you might expect, cases of incommensurable types are not allowed. Arithmetic types get converted to the "larger" type as would other binary operations. The subtle cases occur when you have pointers to types that are qualified differently.

The result is typed with all the qualifiers of the two cases. Even if one case is "impossible":

void conditional(const int *x)
{
int y;
*(1 ? &y : x) = 1;
}

The above doesn't compile. Even though we are technically only accessing y, the type of the conditional expression is a pointer to a constant integer. The constantness leaking in from the definition of x. This can affect some macros, which might otherwise be nicely optimized away. You can of course alter the types to be equal by using casts. The comma operator is also sometimes useful in this situation.

10) Array Magic

What does the following do?

int a[10];
a[0] = 0;
a[0][a] = 1;

The first line is obvious. It declares an array that holds ten integers. The second line is also clear, it sets the first such integer to zero. The tricky bit is the last line. What it uses is the fact a x[y] is the same as *(x + y) and *(y + x) and thus y[x] in C. What it "really" is doing is:

a[a[0]] = 1;

Which is quite a bit clearer. However, that isn't the point of this entry. What does that line do?

Well... obviously, since a[0] is zero, it sets a[a[0]], which is a[0], to be equal to 1. Right?

Nope. The above is technically undefined behavior in C99. (It is fixed in C11 to do what you might expect though.) Why is it undefined? The tricky bit is that we are modifying a lvalue at the same time as we are using it. The expression is similar to the more obviously wrong:

i = i++;

where we are adding one to i at the same time as having as the destination of the assignment.

The problem is that in the expression a[a[0]] = 1, there doesn't seem to be two different modifications happening. So what is going on? Well... the issue is that in C, the [] array access operation isn't a sequence point. That means that even though you might expect to have some ordering between the calculation what array member to use, and the use of that array entry, there isn't explicitly such a constraint in C99.

In short, we evaluate a[0], and then use that evaluation to work out which a[] location to modify. However, that evaluation can occur _after_ we have done the modification. (Which seems to make no sense.) However, there are cases where this weird atemporal strangeness matters:

Imagine some hardware with memory where reads are destructive. In such a device all reads must be later written back to memory to avoid problems. What happens with the above construct? Well... we read a[0]. We then write 1 to that location. We then re-write the old value into a[0] because the C compiler "knows" that no concurrent modification has occurred. The result is that a[0] stays as zero. Oops.

Another case is when your compiler has a very good optimizer. It can notice that a[0] cannot be modified by that line, and thus cache its value. If a later line assigns a[0] to some other variable, then the compiler can optimize that assignment to a set-to-zero. Undefined behavior is a strange beast, and with optimizations its effects can be wide-reaching.

Summary

C has some dark corners. Most of them aren't really relevant to most programmers. However, if you deal with security you should know about the integer overflow issues. Low-level programers should know about pointer arithmetic and its interaction with address spaces. Of course, if you are interested there is always more to learn. How many did you know about?

Comments

Resuna said...

Passing arrays (and other non-atomic objects such as structs) to functions was an evil and wicked extension to the language. Trying to make inter-language calls portable is a vain quest, since the semantics of how languages other than C pass parameters is by definition undefined in the C language itself. On small address space systems, it was common for Fortran to stuff parameters in static locations (including inline with the function) rather than a stack.

And dealing with objects larger than half the address space used to be reasonably common, back when there was barely room for a decent sized edit buffer even in a split-I&D executable on a PDP-11. You just have to be careful.

And speaking of evil and wicked, what ANSI C did to typecasts of signed values was scary. It used to be saner in most implementations.

jrl said...

In the first code snippet len is not initialized - wouldn't you want to initialize it to LEN? In 4th p is also not initialized. Excuse me if I am too slow to catch up with the examples and proposed solutions, but the problem definition is not valid for me, making it hard to understand the write-up. Btw other than that - nice article!

wurst said...

hans wurst

said...

Enter your comments here

said...

Enter your comments here

sfuerst said...

jrl: Insert the obvious initialization of "len" to the length you want to use between its definition and use. There should perhaps be an ellipsis denoting the missing block of code there. Note that len can be any size (not just equal to LEN), which is the whole point of checking its value.

Dave said...

I think it would be helpful to the reader to point out, that in C89 (and consequently C++98) the result of integer division is implementation defined when either of the operands is negative.

Truncation towards zero has only become required behavior in C99/C++11.

jrl said...

@sfuerst

Wouldn't the following be more readable?

---
void foo(unsigned int len) {

char buf[LEN];
char *buf_end = buf + LEN;

if (buf + len >= buf_end) errx(1, "Buffer Overflow!\n");

// ...

}

Unfortunately, function foo might not work correctly. If the len argument is too big, then the addition can overflow and the inequality will not hold. The obvious fix is something like:
---

I've observed that making code snippets more 'real-life' makes the reader more happy and article more interesting. E.g. rewriting the above into sth like the following:

---
char lookup_character(unsigned int idx) {

char buf[LEN] = { /* ... */ };
char *buf_end = buf + LEN;

if (buf + idx >= buf_end) errx(1, "Buffer Overflow!\n");

return buf[idx];
}

Unfortunately, lookup_character may fail if the idx (...)
--

Ming that I'm not trying to be picky, just giving you feedback so you can write more articales that I can read with more joy on my face :-) Peace.