Due to a faulty fan, one CPU overheated and brought the system down. On
restart, fsck indicated that some filesystem corruption occured.
On startup, gdm would not start. After entering my username in the
console, the login prompt came back without giving me the opportunity to
enter my password. The logical next step, booting in single user mode.
In single user mode, it quickly appeared that a few programs segfault.
Among them : su, apache, gdm, smbd, nmbd, cron, pppd, and login. Mostly
everything else superficially seems to work, with a few exceptions.
So I tried to find out what these program could have in common appart
from creating tasks with a different user than the one under which they
are run. I suspected that they all depended on a library whose file the
crash corrupted. So off I went with ldd. Apart from the omnipresent
libc6 (without which not much does anything at all), the prime suspect
was libcrypt. It seems that anything that uses libcrypt crashes the
moment it calls it. I only say "it seems" because I was unable to be
more conclusive after observation of strace output. But it may be
because I am not familiar with strace.
I observed one exception : makepasswd. Strace shows it calling something
from libcrypt, but it does its job with no problem. I compared
/lib/libcrypt.so.1 between the broken server and another machine with
the same OS, and the file sizes were identical. So I have no proof that
libcrypt is guilty and my feelings toward this hypothesis may be
completely wrong.
Here is an example of strace outsput. The program studied is "login"
(the one that generates the console login prompt).
It begins with calls in
/lib/libcrypt.so.1
/lib/libpam.so.0
/lib/libpam_misc.so.0
/lib/libdl.so.2
Then, on the sane system it goes like the following. It's the same on
the broken system, except that the memory addresses are not the same.
open("/lib/libc.so.6", O_RDONLY) = 3
read(3, "\177ELF\1\1\1\0\0\0\0\0\0\0\0\0\3\0\3\0\1\0\0\0\230\327"...,
1024) = 1
fstat64(3, {st_mode=S_IFREG|0755, st_size=1170492, ...}) = 0
old_mmap(NULL, 1187296, PROT_READ|PROT_EXEC, MAP_PRIVATE, 3, 0) =
0x4005c000
mprotect(0x40174000, 40416, PROT_NONE) = 0
old_mmap(0x40174000, 24576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED,
3, 0x1
old_mmap(0x4017a000, 15840, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_FIXED|MAP_ANO
close(3) = 0
munmap(0x40016000, 40843) = 0
Here, login on the broken machine segfaults :
--- SIGSEGV (Segmentation fault) ---
+++ killed by SIGSEGV +++
Except that instead the memory address on the last line is different :
munmap(0x40016000, 35897) = 0
I dont know if that detail is relevant, but since some (but not all) of
the segfaulting programs end the same way, I thought it might be.
On the sane system, here is the beginning of what follows in the strace
after the point where it has segfaulted on the broken system.
brk(0) = 0x80546dc
brk(0x8054704) = 0x8054704
brk(0x8055000) = 0x8055000
getuid32() = 0
ioctl(0, SNDCTL_TMR_TIMEBASE, {B38400 opost isig icanon echo ...}) = 0
ioctl(0, SNDCTL_TMR_TIMEBASE, {B38400 opost isig icanon echo ...}) = 0
brk(0x8057000) = 0x8057000
readlink("/proc/self/fd/0", "/dev/pts/2", 4095) = 10
socket(PF_UNIX, SOCK_STREAM, 0) = 3
connect(3, {sin_family=AF_UNIX, path="/var/run/.nscd_socket"}, 110) = -1
ENOENT
close(3) = 0
open("/etc/nsswitch.conf", O_RDONLY) = 3
I thought it might give some elements of context.
If anyone has read this far, thank you. At that point, I am somewhat out
of my depth to say the least. Any hint that can help me pin down the
cause of my misery is more than welcome.
And yes, I do have backups of my data, but not of the operating system.