The C++ Sucks Series: the quest for the entry point

Suppose you run on the x86 and you don't like its default FPU settings. For example, you want your programs to dump core when they divide by zero or compute a NaN, having noticed that on average, these events aren't artifacts of clever numerical algorithm design, but rather indications that somebody has been using uninitialized memory. It's not necessarily a good idea for production code, but for debugging, you can tweak the x86 FPU thusly:

So you call this function somewhere during your program's initialization sequence, and sure enough, computations producing NaN after the call to fpu_setup result in core dumps. Then one day someone computes a NaN before the call to fpu_setup, and you get a core dump the first time you try to use the FPU after that point. Because that's how x86 maintains its "illegal operation" flags and that's how it uses them to signal exceptions.

The call stack you got is pretty worthless as you're after the context that computed the NaN, not the context that got the exception because it happened to be the first one to use the FPU after the call to fpu_setup. So you move the call to fpu_setup to the beginning of main(), but help it does not. That's because the offending computation happens before main, somewhere in the global object construction sequence. The order of execution of the global object constructors is undefined by the C++ standard. So if you kindly excuse my phrasing – where should we shove the call to fpu_setup?

If you have enough confidence in your understanding of the things going on (as opposed to entering hair-pulling mode), what you start looking for is the REAL entry point. C++ is free to suck and execute parts of your program in "undefined" (random) order, but a computer still executes instructions in a defined order, and whatever that order is, some instructions ought to come first. Since main() isn't the real entry point in the sense that stuff happens before main, there ought to be another function which does come first.

One thing that could work is to add a global object to each C++ translation unit, and have its constructor call fpu_setup(); one of those calls ought to come before the offending global constructor – assuming that global objects defined in the same translation unit will be constructed one after another (AFAIK in practice they will, although in theory the implementation could, for example, order the constructor calls by the object name, so they wouldn't). However, this can get gnarly for systems with non-trivial build process and/or decomposition into shared libraries. Another problem is that compilers will "optimize away" (throw away together with the side effects, actually) calls to constructors of global objects which aren't "used" (mentioned by name). You can work around that by generating code "using" all the dummy objects from all the translation units and calling that "using" code from, say, main. Good luck with that.

The way I find much easier is to not try to solve this "portably" by working against the semantics prescribed by the C++ standard, but instead rely on the actual implementation, which usually has a defined entry point, and a bunch of functions known to be called by the entry point before main. For example, the GNU libc uses a function called __libc_start_main, which is eventually called by the code at _start (the "true" entry point containing the first executed instruction, AFAIK; I suck at GNU/Linux and only know what was enough to get by until now.) In general, running `objdump -T <program> | grep start` (which looks for symbols from shared libraries – "nm <program>" will miss those) is likely to turn up some interesting function. In these situations, some people prefer to find out from the documentation, others prefer to crawl under a table and die of depression; the grepping individuals of my sort are somewhere in between.

Now, instead of building (correctly configure-ing and make-ing) our own version of libc with __libc_start_main calling the dreaded fpu_setup, we can use $LD_PRELOAD – an env var telling the loader to load our library first. If we trick the loader into loading a shared library containing the symbol __libc_start_main, it will override libc's function with the same name. (I'm not very good at dynamic loading, but the sad fact is that it's totally broken, under both Windows and Unix, in the simple sense that where a static linker would give you a function redefinition error, the dynamic loader will pick a random function of the two sharing a name, or it will call one of them from some contexts and the other one from other contexts, etc. But if you ever played with dynamic loading, you already know that, so enough with that.)

Here's a __libc_start_main function calling fpu_setup and then the actual libc's __libc_start_main:

Pretty, isn't it? Most of the characters are spent on spelling the arguments of this monstrosity – not really interesting since we simply propagate whatever args turned up by grepping/googling for "__libc_start_main" to the "real" libc's __libc_start_main. dlopen and dlsym give us access to that real __libc_start_main, and /lib/libc.so.6 is where my Linux box keeps its libc (I found out using `ldd <program> | grep libc`).

If you save this to a fplib.c file, you can use it thusly:

gcc -o fplib.so -shared fplib.c
env LD_PRELOAD=./fplib.so <program>

And now your program should finally dump core at the point in the global construction sequence where NaN is computed.

This approach has the nice side-effect of enabling you to "instrument" unsuspecting programs without recompiling them s.t. they run with a reconfigured FPU (to have them crash if they compute NaNs, unless of course they explicitly configure the FPU themselves instead of relying on what they get from the system.) But there are niftier applications of dynamic preloading, such as valgrind on Linux and .NET on Windows (BTW, I don't know how to trick Windows into preloading, just that you can.) What I wanted to illustrate wasn't how great preloading is, but the extent to which C++, the language forcing you to sink that low just to execute something at the beginning of your program, SUCKS.

Barf.

Corrections - thanks to the respective commenters for these:

1. Section 3.6.2/1 of the ISO C++ standard states, that “dynamically initialized [objects] shall be initialized in the order in which their definition appears in the translation unit”. So at least you have that out of your way if you want to deal with the problem at the source code level.

2. Instead of hard-coding the path to libc.so, you can pass RTLD_NEXT to dlsym.

32 comments ↓

Good stuff. I'd be interested to know how to do this on Windows; I'm going to research that at some point, and maybe I'll post about it on my blog if and when I succeed.

I'm currently doing battle with a static library that uses global objects with non-trivial constructors that allocate memory, which doesn't interact too nicely with our memory manager. We're currently using lazy initialization of the memory manager, but shutting it down and checking for leaks on exit is problematic – not even using atexit() works, since sometimes some global objects manage to register their destructors to run after the memory manager shuts down and *BOOM*.

The man himself wrote a whole section about the problem of global object initialization order in "The Design and Evolution of C++" but doesn't really offer a solution.

I realize you have to continue serving your function as a high level C++ critic but your post really doesn't have much to do with C++, though, does it? What you want is better control over the linker/loader which is really an OS thing. I'm not even sure if you can cover all cases because some other process might have already loaded some of the dynamic libraries.

You might want to look into the GNU linker's –wrap function. It looks like it does exactly what you want, though I don't know that it will work in a low level function like you are looking for. If it *does* work then you at least can launch without the LD_PRELOAD env var stuff.

"info ld invocation options" to bring up the info page then search for "wrap".

Dynamic loading is not broken. It has well defined semantics. Read up on dlopen() and dlsym() in the Single Unix Specification v2 or v3. For example, there is no need to dlopen() libc explicitly in your code, you should just use dlsym() with RTLD_NEXT as handle.

@queisser: "I realize you have to continue serving your function as a high level C++ critic but your post really doesn’t have much to do with C++, though, does it? What you want is better control over the linker/loader which is really an OS thing."

No, C++'s ability to execute static constructors before main() really does suck, and IMO really does exacerbate the problem of finding the "real entry point" of a non-trivial program. It's true that even C programs execute a lot of code before main() — code that initializes the heap, sets up stdin and stdout, and whatnot — but that code is generally written by trusted sources(TM) and can basically be ignored when debugging. C++ allows Joe Random Programmer to insert code before main(), which is far, far worse.

Now, I'd argue that the answer is Don't Do That. If yosefk had had the foresight and (probably more critically) authority to enforce a project-wide coding standard that forbids the definition of any static object with a constructor or destructor, then he wouldn't have had to hack around the problem this way. (Hindsight is 20/20, yeah.)

Regarding global constructors + custom memory manager: yeah, been there, two. Since it was on an embedded target, I ended up adding the memory manager initialization as another hack to the already hacked libc startup code.

Regarding RTLD_NEXT – thanks for the tip. Regarding "dynamic loading not being broken" – "defined behavior" isn't the opposite of "broken behavior". When I say "broken", I mean (1) that it's not "The Right Thing" (and there would be some hubris here if we weren't discussing something as trivial as detecting redefinition, where The Right Thing is damn easy to define), and (2) the fact that actual compilers out there generating shared objects produce output compatible with the spec of shared objects doesn't make that output compatible with the spec of their source language (try throwing a C++ exception from one .so file and catch it in a caller function located in another .so file and you'll get the idea.)

Regarding this not being a C++ issue – as mentioned above, it is. Do you have a problem doing something "at the [real] beginning" of your C or Lisp or Python program?

Regarding my presumed duty to criticize C++ – um, how do I put this. I get to use the fucking shit a lot. When I no longer do, people will have to get C++ hate elsewhere.

> assuming that global objects defined in the same
> translation unit will be constructed one after another
> (AFAIK in practice they will, although in theory the
> implementation could, for example, order the
> constructor calls by the object name, so they
> wouldn’t)

Section 3.6.2/1 of the ISO C++ standard states, that "dynamically initialized [objects] shall be initialized in the order in which their definition appears in the translation unit"

I don't know about Linux, but dyld on Mac OS X lets you delcare a function with the "constructor" attribute, i.e.

void do_something(void) __attribute__((constructor))

It'll get called by the dynamic linker before entering main(). But I'm not sure how it gets called in relation to C++'s static constructors. They get called after your program's image has been loaded, since you can initialize globals in said constructors.

Would it have been possible for you to do the FPU tweaks and then re-exec(2)? If the FPU control word is set on a per-address space basis, that should do the trick.

@Damien: interesting, I didn't think about either. I'd guess __attribute__((constructor)) simply adds the address of the function to the .init section, so it gets called at some undefined point during the pre-main initialization sequence. Regarding exec(2) – I don't know whether the FPU mask is supposed to survive that, but it's a pretty violent measure – for example, if someone prints before main(), the text will be printed twice, etc. That is, it's basically OK to do this only if you assume that the program's init sequence is "tame" enough – in which case you wouldn't have the trouble of taming it in the first place, or something.

Why not add a floating point operation to the end of fpu_setup, like x=1.0+2.0; (with appropriate un-optimization settings) Then at least if you have a pending exception you will get your core dump when fpu_setup is called, not at some unknown later time when another function tries to do f.p.?

You'd stll have to debug your constructors without benefit of core dumps on NaNs, but seems like fair trade for not having to trick the loader into doing something it doesn't want to do, and how many constructors need to do f.p., anyway?

@Matt: It's a good idea to amend fpu_setup with an fp operation to save the head scratching when the program fails upon its first attempt to use fp elsewhere. However, I vigorously reject the claim that debugging global constructors is reasonably easy without core dumps at the point of failure :) Seriously, a 1/5M-1M LOC program uses what, 500-2000 translation units? Each can instantiate globals, which can have constructors calling constructors ad nauseam, and this shit can depend on getenv or files or the command line (accessible via stuff like the /proc/ file system.) How am I going to shovel through all that, and where do I even start?

Of course I don't recommend to use the LD_PRELOAD shite in production environments, only for debugging.

Instead of doing it dynamically, you can do it statically. Compile your fpu_setup in a separate .o adding

asm(".section .initncall fpu_setup");

to it, then make sure you pass this as the first thing to the linker. Nothing is guaranteed by the language, of course, but the C run-time it's built on is pretty reliable.

Dynamic linking semantic of ELF is horrible indeed, Windows is much better though. It does not silently pick up the first definition — each dll is in it's own namespace and you explicitly specify from which dll you want your symbol.

And btw, inserting static constructor into each .cpp file (that includes some header file) is the classic trick used to initialize iostream library "before anything else", as you're allowed to use it from any static constructor.
This relies on the standard initialization order in one translation unit and absense of "optimizations". C++ compilers are not allowed to optimize away any constructors, static or not, as hell knows what it can do inside. The only thing it can do is "inline and dissolve".

I've had the pleasure to witness the iostream initialization trick under the unfortunate circumstances of running on butt-slow targets (RTL simulators.) Fun.

However, I distinctly remember gcc "optimizing" away global constructors – which meant I had to give up on "automatic registration" (where you have a global object that adds itself to some map before main to register a library with a framework – you know the drill.) That only worked if the global was defined in a .cpp whose .o was passed directly to the linker; archiving the .o into a .a caused it to be optimized away. (Now that I think of it, perhaps "touching" the global in the library code would help.) This seemed completely broken exactly because static constructors can have side effects, which in this case they actually did, so it seemed like a broken build, but I double-checked and found no way around this.

If it works in .o, but not in .a it obviously has nothing to do with gcc optimizing it. gcc has created the code, you can see it in .o, so it's done it's job. It does not know or care what you're doing with .o.

This is a standard static linker behaviour. Were you putting static constructor in it's own .o inside .a? This won't work, because linker won't pick .o from .a, unless something it already picked needs some symbol, defined in that .o. This has nothing to do with what constructor does or whether it's "used" at a language level (e.g. you can use it from the same .o, it won't help).
Nothing was ever updated for C++ in this scheme, and in fact it's not even clear how it should work, because .a has semantics of a bunch of independent .o files that can be picked at will, not an all-or-nothing module. As it's typical for C++, noone cares, you're supposed to figure this yourself.

iostream trick works because it only needs to initialize itself if something uses it. As it inserted itself into all .os if any of them get's picked the code will run. If not, this means noone's going to use iostream in this program. If you have a header file that's included into every .cpp (that uses floating point operations) in your program, and you're willing to recompile the whole thing, you could add fpu_setup in static constructor in that header. Same thing.

On the original subject. In presence of dynamic libraries _start is no longer the entry point. Now it's buried inside dynamic linker. You can exploit this with preloading much simpler. When ld.so loads dynamic library it calls library's _init function. You can either try using this directly or just write a static constructor. That is, static constructor in preloaded library runs before anything in your program.

And while at it, valgrind does not use preloading (it wouldn't be able to do half of what it's doing), it uses dynamic binary compilation.

By "gcc optimized it away", I didn't mean "gcc the compiler, as opposed to the assembler and the linker, did it in an optimization pass as opposed to the linkage phase", I meant "the GNU C++ implementation" – the whole toolchain which is supposed to implement C++, which in this particular case it doesn't.

valgrind uses preloading in order to begin its dynamic binary compilation.

Is it possible to write a stub program that calls fpu_setup() and then just exec(2)s the main program? That seems easier, and since it's the same PID to the OS it seems like the FPU settings should survive.

Perhaps the FPU CSR would survive exec() – didn't check it; actually the asm(”.section .initncall fpu_setup”) trick mentioned in a comment above is probably the easiest way to do it as that way, you simply have the FPU configured before application code is called as things should be without having to tweak the way someone runs the code.

Actually what you should have done is… wait for it… not do floating point calculations in global constructors. I don't know why this trips up so many people. Global initialization order is undefined like you said, so why are you doing anything in global constructors int he first place? Make them global pointers and new them at the beginning of main, problem solved.

This is not a problem with C++, this is a problem of BAD code. Code heavily depending on global objects is bad – period.
Even worse if there is no documentation which global objects is constructed – If there is, simply add breakpoints in those constructors, and you will find your bug.

And if you really need global objects, dont use any global objects with constructors – rather call an init-function at the beginning of main.
If you dont have control over main, have only 1 global object with constructor, which constructs the others in defined order AND document it.

You'd be surprised how much effort was expended by the C++ community to make the order of initialization and destruction sequences deterministic and "correct" to some approximation (in the lack of an explicitly defined order as well as an inclination to think that such order should exist at all). So it's not "just" a bad code problem, it's a language-specific cultural problem interacting with a loose language spec.

On the other hand, if you like C++ and the one of its subcultures you're dealing with the most, then I'm sincerely happy for you.

There is an elegant solution to this, I think I read it in Alexandrescu's Modern C++ Design.
You need a .h file that's included from all the .cpp files (with VisualC++ you typically have this anyway, because of using precompiled headers).
Make a class, which increments a static variable in its contructor. When this if first incremented, do the initialization you need. Make a static global variable of this class at the beginning of the header file. (This will result in each translation unit having a copy, but your initialization code will only run once and as the first thing, because within translation units the order of initialization is the order of declaration.)

I know this method; I wouldn't call it "elegant", but to each his own. One "objective" issue with it though is, it doesn't work if you use a library with global constructors and you have its header files and object code but no source code. Whereas in C (or Python or a whole lot of other languages) you simply have your first executing statement, the second etc. etc. even if you can't recompile all the code that you're using from source.