EWONTFIX

NULL considered harmful

04 Jul 2013 03:25:02 GMT

The C and C++ languages define the macro NULL, widely taught as the
correct way to write a literal null pointer. The motivations for using
NULL are well-meaning; largely they come down to fact that it
documents the intent, much like how a macro named FALSE might better
document boolean intent than a literal 0 would do. Unfortunately,
use of the NULL macro without fully understanding it can lead to
subtle bugs and portability issues, some of which are difficult for
compilers and static analysis tools to diagnose.

Despite it being superceded by the 2011 standard, I'm going to quote
C99 because it's what I'm most familar with, and I suspect most
readers are in the same situation. 7.17 specifies the NULL macro as:

NULL

which expands to an implementation-defined null pointer constant;

6.3.2.3 Pointers in turn defines null pointer constant as:

An integer constant expression with the value 0, or such an
expression cast to type void *

Constant expressions shall not contain assignment, increment,
decrement, function-call, or comma operators, except when they are
contained within a subexpression that is not evaluated.

An integer constant expression shall have integer type and shall
only have operands that are integer constants, enumeration
constants, character constants, sizeof expressions whose results are
integer constants, and floating constants that are the immediate
operands of casts. Cast operators in an integer constant expression
shall only convert arithmetic types to integer types, except as part
of an operand to the sizeof operator.

What's important to realize is that this allows a great deal of
freedom with regard to how an implementation defines NULL. Some
examples:

In particular, the set of possible types NULL could have includes
all integer types and void *. Fortunately, since these are all null
pointer constants, in most contexts using NULL will get you The
Right Thing no matter which type NULL might have. Pointer
comparisons, assignments to pointers, the ?: operator, and so forth
all treat null pointer constants in special ways such that the type of
the null pointer constant mostly doesn't matter.

The main place the type of NULL does become an issue is with
variadic (or non-prototyped) functions. Consider these examples:

The first is wrong, in general, because the %p specifier for
printf requires an argument of type void *. The second example is
always wrong because the execl function requires arguments of type
char *, and NULL can never have type char *.

One might object that, despite being formally incorrect, the above
should work, especially if NULL is defined with the typical C
definition of ((void *)0). Assuming also that pointers of different
types are treated identically by the calling convention, having type
void * instead of char * would not matter when calling execl.
(Arguably it may not matter anyway, since C specified that char *
and void * have the same representation, but it's not entirely clear
how this applies to arguments to variadic functions.)

Things get a lot uglier with C++. In C++, NULL was traditionally
defined as plain 0, on account of the fact that C++ has no implicit
conversion from void * to other pointer types. (Note that the
current C++ standard also allows NULL to be defined as nullptr,
and some non-conforming implmentations define it in other ways, such
as __null.) So in many historical C++ implementations, if you pass
NULL to a variadic function, bad things might happen.

And bad things did happen. We were encountering several
hard-to-track-down crash bugs in certain applications (including
evince and audacity) built against musl,
which at the time defined NULL as 0 for C++ programs, and it
turned out the source of the problem was that the C++ code was making
calls to variadic C functions using NULL as a sentinel. The
original report and
further details are available in the mailing list archives for musl.

The crashes that were happening were on 64-bit archs (at the time, for
us that meant just x86_64) and were, if I recall, "heisenbugs":
inserting additional code to inspect what was going on made some or
all of them go away. As it turns out, while x86_64 (and most 64-bit
archs) uses entire registers or 64-bit stack slots for all arguments,
even ones smaller than 64-bit, there's no guarantee in the ABI that
the unused upper bits will be cleared by the caller. So, when the
caller passed 0 as an int to a variadic function, the upper 32
bits of the stack slot contained whatever junk happened to be there
before, and the callee's va_arg, attempting to read an argument of
type void *, pulled the junk in as the upper bits of the pointer.
Thus, manifestation of the bug was highly dependent both on the code
executed just prior to the variadic call, and the compiler's choice of
how to setup the arguments (e.g. push versus mov).

As far as what we would do about the issue, I've been on the fence for
a long time about the relative merits of trying to work around
incorrect, non-portable applications versus actively breaking them and
pressuring their maintainers to fix them. However, the widespread
incorrect use of NULL seemed overwhelming, and developers seemed
very reluctant to fix the issue. Fortunately, we came up with what I
believe is a really good solution on musl's side: replacing this:

(Note that musl assumes a type model where long and pointers have
the same size; this is also assumed by Linux and is fundamental to the
way system calls work, so it's not an unreasonable assumption for us.)

By making this change, all programs that assume they can pass NULL
as a pointer argument to variadic functions (sentinel role or
otherwise) continue to work as if NULL were defined as ((void
*)0). However, GCC does not consider 0L a valid "sentinel" value
(since it has the wrong type) for functions declared with the
sentinel
attribute.
So what we have achieved is making incorrect programs work, but
produce warnings where the code is incorrect.

One initial concern I had with this approach is that the matter of
whether NULL has pointer or integer type might be observable to
programs in such a way that the change from ((void *)0) to 0L
could break some incorrect C programs. However, after some of the
smart folks over at Stack Overflow tried
to find ways a program might detect whether NULL is defined with
integer or pointer
type
and failed, I deemed the change safe, and our developers and users
have not run into any problems since.

It's been some time now (about 6 months) since the issue was raised
and addressed in musl, but I just recently reported the sentinel
issue
in Busybox and submitted a patch to fix it.
While getting the patch accepted was painless, I mention this anecdote
as a possible explanation for why the NULL sentinel bug is so
prevalent: after the patch was accepted, I received several follow-up
emails
chipping in with misconceptions about NULL. The impression I was
left with is that everybody seems to think NULL is a much simpler
topic than it really is.

So what would I recommend programmers do with NULL? After all this
trouble with variadic function arguments, I've just basically stopped
using NULL completely. A simple 0 is shorter and usually clear when
initializing or returning a pointer, and for null pointer checks, I
usually use the ! operator rather than ==NULL anyway. However, I
really don't have any opinion on what others should do; from a
correctness standpoint, what matters is simply not making
assumptions about the type of NULL. Anywhere the type of the
expression matters, you really need to use a cast, which you could
apply to NULL, or simply to 0.

As a final remark, I don't think this article would be complete
without mentioning a similar
article,
albeit much shorter and less detailed, written by Jens Gustedt over
two years ago.