Finding Thread-unsafe Code

One problem that I have had on a number of occasions when developing Unix software is libraries that use non-reentrant code which are called from threaded programs. For example if a function such as strtok() is used which is implemented with a static variable to allow subsequent calls to operate on the same string then calling it from a threaded program may result in a SEGV (if for example thread A calls strtok() and then frees the memory before thread B makes a second call to strtok(). Another problem is that a multithreaded program may have multiple threads performing operations on data of different sensitivity levels, for example a threaded milter may operate on email destined for different users at the same time. In that case use of a library call which is not thread safe may result in data being sent to the wrong destination.

One potential solution is to use a non-threaded programming model (IE a state machine or using multiple processes). State machines don’t work with libraries based on a callback model (EG libmilter), can’t take advantage of the CPU power available in a system with multiple CPU cores, and require asynchronous implementations of DNS name resolution. Multiple processes will often give less performance and are badly received by users who don’t want to see hundreds of processes in ps output.

So the question is how to discover whether a library that is used by your program has code that is not reentrant. Obviously a library could implement it’s own functions that use static variables – I don’t have a solution to this. But a more common problem is a library that uses strtok() and other libc functions that aren’t reentrant – simply because they are more convenient. Trying to examine the program with nm and similar tools doesn’t seem viable as libraries tend to depend on other libraries so it’s not uncommon to have 20 shared objects being linked in at run-time. Also there is the potential problem of code that isn’t called, if library function foo() happens to call strtok() but I only call function bar() from that library then even though it resolves the symbol strtok at run-time it shouldn’t be a problem for me.

Instead of using his code I wrote a minimal implementation of the same concept which searches the section 3 man pages installed on the system for functions which have a _r variant. In addition to that list of functions I added some functions from Bruce’s list which did not have a _r variant. That way I got a list of 72 functions compared to the 40 that Bruce uses. Of course with my method the number of functions that are intercepted will depend on the configuration of the system used to build the code – but that is OK, if the man pages are complete then that will cover all functions that can be called from programs that you write.

Now there is one significant disadvantage to my code. That is the case where unsafe functions are called before child threads are created. Such code will be aborted even though in production it won’t cause any problems. One thing I am idly considering is writing code to parse the man pages for the various functions so it can use the correct parameters for proxying the library calls with dlsym(RTLD_NEXT, function_name). The other option would be to hand code each of the 72 functions (and use more hand coding for each new library function I wanted to add).

To run my code you simply compile the shared object and then run “LD_PRELOAD=./thread.so ./program_to_test” and the program will abort and generate a core dump if the undesirable functions are called.

I’d take issue with the statement that non-threaded programs can’t take advantage of CPUs and are slower than threaded ones. :-)

Many HPC codes are non-threaded MPI programs, they take full advantage of multiple cores. In fact we’ve seen one particular code available in both a threaded SMP version and a pure MPI version and found that the MPI version makes far more efficient use of an SMP system (scales better) than the SMP version.

Regarding threading being faster, Tridge has already made a convincing case back in 2004 (mentioned in passing at a LUV meeting way back when it was in the Telstra building) that threads are always slower than processes, asides from when the OS is very broken.

On my system I needed to remove the duplicate functions names before compiling the library. Added “sort -u” to the code. Actually you might want merge the $OTHERS into the list before sorting and removing duplicates.

Chris: That’s interesting for HPC, but not of much use for what I’m doing. I have to work with the standard Milter libraries which don’t support such things.

Bill: Good point, that’s one of the many rough edges in my code. I trimmed the OTHERS list, it’s the entries from Sun’s list that didn’t appear when I searched the man pages – mostly functions with no reentrant versions. Your idea is much better.