I have further analyzed the problem and have determined the exact
cause of the problem. I am hoping the RedHat could provide a fix for
this problem now that the cause of the problem is understood. The
details are below.
Problem: Nested "dlopen()" calls from a statically built application
will cause a segmentation fault.
Example: A statically built application a.out does a dlopen() of
libfoo1.so. In turn, libfoo1.so does a dlopen() of libfoo2.so. The
second dlopen(), which is libfoo2.so, will cause a segmentation fault.
Cause: The segmentation fault occurs in the dynamic loader ld.so in
the function _dl_catch_error() [elf/dl-error.c] due to an
uninitialized function pointer GL(dl_error_catch_tsd) which, after
macro expansion, is really _rltd_local._dl_error_catch_tsd
[sysdeps/generic/ldsodefs.h]. Thus, the question becomes, why isn't
GL(dl_error_catch_tsd) being initialized during the second dlopen()?
Keep in mind that I'm picking on GL(dl_error_catch_tsd) because that
is where the segmentation fault occured. There are likely other
variables in the _rtld_local structure may be uninitialized as well.
An explanation follows for both the statically built case, which
crashes, and the dynamically built case, which works.
Application Built Statically (segmentation fault)
-------------------------------------------------
For libc.a, the GL(dl_error_catch_tsd) macro expands to the variable
shown below [elf/dl-tsd.c]
# ifndef SHARED
...
void **(*_dl_error_catch_tsd) (void) __attribute__ ((const)) =
&_dl_initial_error_catch_tsd;
...
#endif
Thus, libc.a has an initialized copy of _dl_error_catch_tsd which
points to the _dl_initial_error_catch_tsd routine.
# nm -A /usr/lib64/libc.a | grep error_catch_tsd
/usr/lib64/libc.a:dl-error.o: U _dl_error_catch_tsd
/usr/lib64/libc.a:dl-tsd.o:0000000000000000 D _dl_error_catch_tsd
/usr/lib64/libc.a:dl-tsd.o:0000000000000000 T
_dl_initial_error_catch_tsd
Also in libc.a, the _dl_catch_error function is defined, which is the
routine in which the segmentation fault occurs.
# nm -A /usr/lib64/libc.a | grep dl_catch_error
/usr/lib64/libc.a:dl-deps.o: U _dl_catch_error
/usr/lib64/libc.a:dl-error.o:0000000000000000 T _dl_catch_error
/usr/lib64/libc.a:dl-open.o: U _dl_catch_error
/usr/lib64/libc.a:dl-libc.o: U _dl_catch_error
For libc.so, none of the symbols mentioned above are defined.
The a.out has the symbols because it was compiled with libc.a.
Thus, the first call to dlopen( libfoo1.so ) resolves its symbols
from the a.out address space. That is, it calls the _dl_catch_error
routine in the a.out address space which, in turn, accesses the
_dl_error_catch_tsd function pointer in the a.out address space which
was initialized with the address of the _dl_initial_error_catch_tsd
routine, which also exists in the a.out address space.
By the way, the reason I know what address space things are coming
from is because I put "_dl_printf" statements in the "glibc" sources
and compared the addresses that were printed at runtime with the
addresses shown in "/proc/<pid>/maps".
The second call to dlopen( libfoo2.so ) tries to resolve its symbols
from the ld.so (loader) address space.
Before I continue, let me say a few words about ld.so. During the
compilation of the loader, the GL(dl_error_catch_tsd) macro expands
to _rtld_local._dl_error_catch_tsd [sysdeps/generic/ldsodefs.h], a
totally different variable that the one in libc.a. That is, GL
(dl_error_catch_tsd) expands to a different variable in libc.a than
ld.so as can be seen by the code snippet shown below
from "sysdeps/generic/ldsodefs.h"
#ifndef SHARED
# define EXTERN extern
# define GL(name) _##name
#else
# define EXTERN
# ifdef IS_IN_rtld
# define GL(name) _rtld_local._##name
# else
# define GL(name) _rtld_global._##name
# endif
As you can see, during the compilation of libc.a, which is NOT
SHARED, GL(dl_error_catch_tsd) becomes _dl_error_catch_tsd. In the
compilation of ld.so, GL(dl_error_catch_tsd) expands to
_rtld_local._dl_error_catch_tsd. The reason I mention this is
because we can't even think about using libc.a's object because they
are completely different.
Anyway, back to the second call to dlopen( libfoo2.so ). This is
going to call the _dl_error_catch routine in the ld.so's address
space. The problem is that, for the loader, GL(dl_error_catch_tsd)
gets initialized in dl_main [elf/rtld.c], but dl_main only gets
called for shared applications, not during a dlopen. Therefore, GL
(dl_error_catch_tsd) never gets initialized and, when it is
referenced in _dl_catch_error [elf/dl-error.c], it contains a value
a "0" (NULL pointer) which causes a segmentation fault.
So, why does the first dlopen( libfoo1.so ) execute routines in the
a.out, while the second dlopen( libfoo2.so ) execute routines in
ld.so?
The reason is that when the a.out calls dlopen() it uses the dlopen
statically linked in from libdl.a . When the first library calls
dlopen() it get resolved to the one in the pulled-in libdl.so.
That's because the a.out does NOT have a ** dynamic symbol table **
(separate from externals and debug symbols) so the first library
can't hook back to the dlopen() in the a.out. Thus it must use the
one pulled in from libdl.so.
Application Built With Shared Libraries (works)
-----------------------------------------------
In the case where the a.out is built with shared libraries, the
ld.so's (loader) dl_main [elf/rtld.c] routine is called which will
initialize GL(dl_error_catch_tsd), so we don't get a segmentation
fault since the variable is properly initialized.
Conclusion
----------
One possible fix would be to put a check in either _dl_catch_error
[elf/dl-error.c] or dlerror_run [elf/dl-libc.c] to see if we are in
the loader code and if dl_main has NOT been called. If we are in the
loader code and dl_main has not been called, then we need to
initialize GL(dl_error_catch_tsd) and other needed variables so that
we don't get a segmentation fault due to uninitialized variables.
I will be adding a small reproducer for this problem shortly.
Rigoberto Corujo

Created attachment 104377[details]
Reproducer for the problem where nested dlopen()'s cause segmentation fault
Untar this file and compile with the "compile.sh" script.
Set LD_LIBRARY_PATH to your working directory.
Run the "a.out"

dlopen support in statically linked apps is very limited, not meant
to be general purpose library loader for any kind of libraries.
Its role is just to support NSS modules (built against the same
libc as later run on).
dlopen from within the dlopened libraries is definitely not supported.
If libnss_ldap.so.* calls dlopen, then the bug is in that library.
For NSS purposes there is _dl_open_hook through which libraries
that call __libc_dlopen/__libc_dlsym/__libc_dlclose can use the
loader in the statically linked binary.

Using any NSS functionality in statically linked applications is only
supportable if nscd is used. Without nscd you are on your own. We
will not and *can not* handle anything else.
I don't think it makes any sense to keep this bug open. It is an
installation problem if nscd is not running.

Ulrich,
Are you saying that "service nscd start" would prevent the
segmentation fault from occuring? I just tried that with the initial
reproducer that I provided (the one that calls initgroups()) and I
get the same results (segmentation fault). Have you guys been
successful in running my reproducer with nscd?
As a follow-up to Jakub's comment, I just want to add that it is
actually "libsasl.a" that is doing the dlopen().
The "libnss_ldap.so" library links against "libldap.a".
The "libldap.a" links against "libsasl.a".
If the solution to this problem is to run nscd, then so be it. But,
there must be more to it than that because, like I said before, I
don't see a difference. I need some clarification, because I
understood Jakub to mean that what was going on was illegal but
Ulrich seems to suggest that this should work as long as nscd is
running.
Also, if dlopen'ing a shared library from a dlopen'ed library is not
allowed, then it would be beneficial to put a check in "glibc" so
that an error is returned to the calling dlopen() rather than letting
a segmentation fault occur.
Rigoberto

> I just tried that with the initial
> reproducer that I provided (the one that calls initgroups()) and I
> get the same results (segmentation fault). Have you guys been
> successful in running my reproducer with nscd?
That is impossible unless the program cannot communicate with the nscd
and falls back on using NSS itself or you hit a different problem.
There has been at one point a change in the protocol but I don't think
there are any such binaries out there.
Run the program using strace and eventually start nscd by hand and add
-d -d -d (three -d) to the command line. It won't fork then and spit
out lots of information.

Ulrich,
I followed your instructions. Every time I run my "a.out" there is
output from "nscd", so there is communication going on. The
segmentation fault is still occuring.
Can you confirm that you have indeed run my reproducer that calls
initgroups() and have not had a segmentation fault?
The man page for "nscd" states that it is used to cache data. I'm
not sure why running this daemon would solve my problem?
Rigoberto

> Can you confirm that you have indeed run my reproducer that calls
> initgroups() and have not had a segmentation fault?
Which producer which calls initgroups? There is only one attachment
and this is code which uses dlopen() for other purposes than NSS.
This is not supported. If it breaks, you keep the pieces.
Run your applications which uses NSS and make sure there are no other
dlopen calls in the statically linked code. Use strace to see what is
going on.
> The man page for "nscd" states that it is used to cache data. I'm
> not sure why running this daemon would solve my problem?
It's not the caching part which is interesting here, it's the "nscd
takes care of using the LDAP NSS module" part. All the statically
linked application has to do is to communicate the request via a
socket to nscd and receive the result. No NSS modules involved on the
client side. Which is why I say that if you still see NSS modules
used, something is wrong.
One possibility is that you use services other than passwd, group, or
hosts. Is this the case? These services are currently not supported
in nscd. There is usually no need for this since plain files are
enough (/etc/services etc don't change).
So, please make sure your code does not use dlopen() for anything but
NSS and that after starting nscd either it is used or only
libnss_files is used.

Ulrich,
Either I'm misunderstanding you, you're misunderstanding me, or we're
both misunderstanding each other. Please take a look at the very
first entry I made to this bugzilla. Would you please compile and
run the code as I described and then tell me whether you see the same
problem I'm seeing? This problem has nothing to do with any
application that I'm writing. The second reproducer, which I had
attached, was merely to show what is happening under the covers in an
easy to understand way. The first reproducer, which I embedded
directly into the text I entered, is at the heart of the problem.
Please take a look at that and then we can continue our discussion.
Rigoberto

Why don't you just attach the data I'm looking for? Yes, your code
uses initgroups and this cannot fail if nscd is used. Which is why I
ask for the strace output related to the initgroups call and the
actual crash.
Since I do not believe that you can continue to see the same crash
with and without nscd (unless there is something broken in nscd) I
also asked for other places you might use dlopen (explicitly or
implicitly).
So, run strace.
FWIW, with a FC3t2 system I have no problem using the LDAP NSS module
from the statically linked executable but this pure luck. Important
is that once nscd runs no NSS module is used.

Created attachment 104426[details]
output of the strace with the statically built a.out
The LDAP database contains only one user "johndoe" as well as the group
"johndoe". Running the "id johndoe" command verifies that communications with
the slapd server is good. The "nscd -d -d -d" is also running. Communication
with it also appears to be good. I will attach the output of "ncsd -d -d -d"
shortly.

Comment on attachment 104427[details]
output of the "nscd -d -d -d"
The "nscd -d -d -d" is started freshly. The "strace a.out" is immediately run.
The output of "nscd" is shown. The "a.out" is still getting a segmentation
fault.

I see what is going on. The initgroup calls do not try to use nscd at
all but instead use the NSS modules directly. This is fatal in this
situation.
We might be able to get some code changes into one of the next RHEL3
updates but there is not much we can do right now. Except questioning
why you have to link statically. This is nothing but disadvantages.

Ulrich,
I, like you, work for support. You work for RedHat support and I
work for HP support. Our XC (Extreme Clusters) product is based on
RedHat Linux. One of our customers had asked us to document how to
configure LDAP. While configuring LDAP, I found that "mysqld" did
not start when LDAP was configured. After further analysis, I found
that mysqld was linked statically and called initgroups(). To work
around the mysqld problem we simply used a non-static version of
mysqld. However, this was a concern to me because there may be other
packages, or customer written applications, which could potentially
run into this problem. So, I had to get to the bottom of the
situation and find out why statically built applications which called
initgroups() would seg fault. This has led to this conversation that
you and I have been having. As you can see, it is not I who is
developing statically linked applications, but I am concerned that
customers who do develop statically linked applications and turn on
LDAP may run into this problem.
At the very least, for the short term, that second dlopen() should
return an error and not seg fault. Maybe errno could be set to EPERM
(operation not permitted) or something along those lines.
So, we are leaving this as a "to be fixed in a future release",
correct?
Rigoberto

I'm reassigning this bug to glibc and marked it as an enhancement.
This is what it is, NSS simply isn't supported in statically linked
applications. The summary has been changed to reflect the status.
If you are entitled to support for these kind of issues you should
bring this issue up with your Red Hat representative so that it can be
added to IssueTracker. If you don't know what this is then you are
likely not entitled and you might want to consider getting appropriate
service agreements.

> At the very least, for the short term, that second dlopen() should
> return an error and not seg fault.
No, since there are situations when it works. NSS in statically
linked code is simply an "if it breaks you keep the pieces" thing, if
it works you can be very happy, if not, you'll have the find another
way. I cannot prevent people from having at least the opportunity to
get it to work.
> So, we are leaving this as a "to be fixed in a future release",
> correct?
Yes. I'll keep this bug open so that once we have code for this, I
can announce it. Whether we can use this in code in future RHEL3
updates is another issue.

I added support for caching initgroups data in the current upstream
glibc. Backporting the changes to RHEL3 is likely not going to happen
since the whole program changed dramatically since the fork of the
sources for RHEL3. If it is essential, contact your representative
for support from Red Hat. I close this bug since the improvement has
been implemented.