A definitive treatise on coping with DLL hell (in general, not just in the Windows world whence the name came) would be nice.

DLL hell nowadays, and in the Unix world, is what you get when a single process loads and runs (or tries to) two or more versions of the same shared object, at the same time, or when multiple versions of the same shared object exist on the system and the wrong one (from the point of view of a caller in that process) gets loaded. This can happen for several reasons, and when it does the results tend to be spectacular.

Typically DLL hell can result when:

multiple versions of the same shared object are shipped by the same product/OS vendor as an accident of development in a very large organization or of political issues;

multiple versions of the same shared object are shipped by the same product/OS vendor as a result of incompatible changes made in various versions of that shared object without corresponding updates to all consumers of that shared object shipped by the vendor (this is really just a variant of the previous case);

a third party ships a plug-in that uses a version of the shared object also shipped by the third party, and which conflicts with a copy shipped by the vendor of the product into which the plug-in plugs in, or where such a conflict arises later when the vendor begins to ship that shared object (this is not uncommon in the world of open source, where some project becomes very popular and eventually every OS must include it);

At first glance the obvious answer is to get all developers, at the vendor and third parties, to ship updates that remove the conflict by ensuring that a single version, shipped by the vendor, will be used. But in practice this can be really difficult to do because: a) there’s too many parties to coordinate with, none of whom budgeted for DLL hell surprises and none of whom appreciate the surprise or want to do anything about it when another party could do something instead, b) agreeing on a single version of said object may involve doing lots of development to ensure that all consumers can use the chosen version, c) there’s always the risk that future consumers of this shared object will want a new, backwards-incompatible version of that object, which means that DLL hell is never ending.

Ideally libraries should be designed so that DLL hell is reasonably survivable. But this too is not necessarily easy, and requires much help from the language run-time or run-time linker/loader. I wonder how far such an approach could take us.

Consider a library like SQLite3. As long as each consumer’s symbol references to SQLite3 APIs are bound to the correct version of SQLite3, then there should be no problem, right? I think that’s almost correct, just not quite. Specifically, SQLite3 relies on POSIX advisory file locking, and if you read the comments on that in the src/os_unix.c file in SQLite3 sources, you’ll quickly realize that yes, you can have multiple versions of SQLite3 in one process, provided that they are not accessing the same database files!

In other words, multiple versions of some library, in one process, can co-exist provided that there’s no implied, and unexpected shared state between them that could cause corruption.

What sorts of such implied, unexpected shared state might there be? Objects named after the process’ PID come to mind, for example (pidfiles, …). And POSIX advisory file locking (see above). What else? Imagine a utility function that looks through the process’ open file descriptors looking for ones that the library owns — oops, but at least that’s not very likely. Any process-local namespace that is accessible by all objects in that process will provide a source of conflicts. Fortunately thread-specific keys are safe.

DLL hell is painful, and it can’t be prevented altogether. Perhaps we could produce a set of library design guidelines that developers could follow to produce DLL hell-safe libraries. The first step would be to make sure that the run-time can deal. Fortunately the Solaris linker provides “direct binding” (-B direct) and “groups” (-B group and RTLD_GROUP), so that between the two (and run-path and the like) it should be possible to ensure that each consumer of some library always gets the right one (provided one does not use LD_PRELOAD). Perhaps between linker features, careful coding and careful use, DLL hell can be made survivable in most cases. Thoughts? Comments?

5 Responses to “DLL hell”

One of our linker aliens just reminded me that -Bgroup is not what’s important here — -Bdirect is what matters.

That is, -Bdirect prevents accidental interposition, which is what helps in DLL hell avoidance.

I suspect that in the Linux world the standard response to DLL hell is to update the libraries involved to the latest, and that’s that. We try hard to avoid backwards incompatible changes, and that, sadly, is something that helps get Solaris into DLL hell situations.

@RC: Can you post pointers to recent arguments for static linking over dynamic shared objects? I’m quite curious, for I thought the matter was quite settled.

There are a number of reasons why dynamic linking is preferable to static linking, such as:

– you need dynamic shared objects if you want dynamic, pluggable interfaces (anything that uses dlopen(3C)), else every time you add a plugin you must rebuild all your applications!
– bug fixes can be made to the shared objects without having to re-link all their consumers
– can’t have interposition via LD_PRELOAD without having a run-time linker and dynamic shared objects
– address space randomization — can’t do that with statically linked binaries, not without re-linking them every time they are run
– the fact that symbol tables are needed for run-time linking greatly facilitates tools like truss (strace) and DTrace

and so on. Dynamic linking does have some costs, but IMO it’s benefits outweigh them.

Disk space might once have been a motivation for dynamic linking, but we’ve long since discovered many benefits to having dynamic linking.

Surely DLLs also provide the means to share pages and reduce physical ram requirements – if code is compiled as relocatable and built into two diferent executables, then it will have to consume real RAM for each executable being used (albeit all instances of each executable can page off the same image).

The ‘bug fix’ thing is a double edged sword. Or not, I guess, depending on whether you care about regression testing.

Well, we do care about regression testing. We build things nightly, and run enormous numbers of tests on every build, and all pushes/putbacks require testing (often just unit testing — depends on the size of the fix and whether a suitable testsuite exists). Moreover, if a bugfix introduces another bug, well, it’s a supported product, and we’ll fix that bug.

Consider the alternative: bugs introduced by bug fixes might go undetected for months or years in a world in which everything is linked statically, because to get tested the bugfix would require waiting for others to re-link, or, more likely, re-build their software. Worse, security bugs in libraries would go unfixed because, let’s face it, third parties will not be on top of bug fixes in others’ static libraries, they won’t re-link/re-build in a timely manner — not only because the chances that they’d be responsive to others’ security bugs are low, but also because the whole point of static linking, from a third party vendor’s point of view, is to increase stability, and if they’re going to be re-linking often, then they’ll be re-qualifying often out of the same fear that bug fixes beget bugs, meaning that third parties that feel that way will be no more motivated to re-link than they want dynamic linking.

And on top of that we’d be giving up on plug-in architectures. Seriously? No more Firefox plug-ins, no more PAM modules, no more plug-ins of any kind? In a world of REST APIs, mashups, various HLLs on the server-side, browsers and JavaScript, browser plug-ins galore, etcetera, we’re really going to be switching back to a no-plug-ins, static linking architecture?

I just don’t see that happening. Dynamic linking, if anything, will become a minor detail in the brave new world of "web 2.0", and static linking will remain in the museum, or as a internal detail of build systems, exactly where it belongs. Dynamic linking is everywhere in the world of high level languages. I’ve yet to see a call for the Ruby, Perl, Java, Erlang, and/or Python equivalent of static linking; why should C be stuck in the 1970s?