Date: Wed, 09 Jan 2013 12:02:29 +0100
From: John Spencer <maillist-musl@...fooze.de>
To: musl@...ts.openwall.com
CC: Luca Barbato <lu_zero@...too.org>, basile@...nsource.dyc.edu
Subject: NULL
glibc defines NULL as __null: a magic variable supplied by GCC and
compatibles which always has pointer context.
musl defines NULL to 0 in C++.
this is correct per the standard, but breaks a lot of software on 64bit
archs, because it promotes to int.
for example there are 140 variadic functions in glib/gtk, most of them
depend on a so-called sentinel value (nullptr end marker).
examples:
http://developer.gnome.org/gobject/2.35/gobject-The-Base-Object-Type.html#g-object-newhttp://developer.gnome.org/gobject/2.35/gobject-The-Base-Object-Type.html#g-object-get
any C++ program that uses those variadic functions and uses NULL as the
sentinel, will invoke UB and consequently segfault on musl, with a very
hard-to-debug backtrace (you cannot inspect variadic arguments in gdb
(unless you know exactly what kind of asm gcc creates and how to read
it), but exactly those arguments will end up being bogus).
the first such bug i encountered (in audacity) costed me several hours
of debugging distributed over 2 days.
note that most example code involving these variadic functions is
written in C, and happily uses NULL, as the problems surrounding NULL
are not exactly well-known.
http://gustedt.wordpress.com/2010/11/07/dont-use-null/
so the C++ guys basically just copypaste existing C code snippets (which
are theoretically broken, but work in any real world implementation),
and since it works in their environment they assume the code is correct.
after having fixed audacity, i tackled the segfaults encountered in evince.
here not even evince itself is broken, but also poppler, on which it
depends.
i've fixed the bugs in evince since, but still need to tackle the UB in
poppler.
frankly speaking i'm not very willing to go into the trouble of
debugging any C++ gtk application, and reporting the bugs i found.
when i discussed this with evince developers, they were asking all the
time "why does it work for me ? why didn't it break for anyone ?"
i told them that in C++ NULL expands to 0 and it worked because of luck.
after discussing with them for about half an hour, i got them to the
point where they agreed to merge a patch if i come up with one.
had i told them that i use another libc which does things differently
regarding NULL, i am not sure if i could have convinced them to change
their code.
at this point the issue in evince is not fixed, i still need to create a
patch based on their current git master, subscribe to their ML, send
them the patch and hope that they haven't changed their mind.
so for me, there are 3 options how to deal with issue in the future:
1) go into the trouble of debugging all future C++ apps i will port,
potentially spending hours to fix each of them, and then even more hours
discussing this topic on their mailing lists. since sabotage will very
likely only support a small subset of these programs, there will remain
a lot of others that stay broken.
the gentoo guys or whoever is going to port them will have a *very* hard
time figuring out why it crashes for them on musl.
this approach is in my eyes insane, as it will need a huge investment of
time and nerves, and the number of broken C++ apps is theoretically
infinite.
(as a side note: when i googled for g_signal_emit, i found multiple bug
reports to multiple projects, of which the backtraces showed exactly the
kind of varargs UB that i encountered.
all of them were not fixed, but merely worked around. apparently the
developers of these projects didnt find the real cause of their bugs.
this is not unlikely, as it is very hard to find out whats going on.
https://bugs.launchpad.net/compiz/+bug/932382https://trac.transmissionbt.com/ticket/1135 ... )
2) change musl so it is compatible with those apps. this would mean:
#if defined(__GNUC__) && defined(__cplusplus__)
#define NULL __null
#elif defined (__cplusplus__)
#define NULL 0
#else
#define NULL (void *) 0 /* for C code */
#end
this change is the easiest solution: any problem will be magically fixed.
3) create some kind of code analysis tool that will scan C++ code for
usage of NULL that emits warnings for each occurence of NULL used in a
variadic function call. the results could then be used to created
automated patches and used in distros based on musl and sent to the
corresponding package maintainers.
i suspect it is possible to convince the authors of the "cppcheck"
program to add such a check, as i alreaded contacted them in the past
and they seemed very cooperative.
this could be tackled as a side-project even if we decide for option 2.
i'm welcoming any comments on the issue. i'm especially interested what
the gentoo developers think.