NOTE: Since I'm not sure if this will interest Redhat or XFree
more,
I've sent it to both. Thus I've included references
to the redhat packages involved.
VERSION: Redhat package: XFree86-SVGA-3.3.6-33
R6.3, public-patch-3
The problem was initially discovered in the XF86_SVGA
server
contained in the above Redhat package. It is also present
in XF86_SVGA compiled from sources X336-src-x.tgz with and
without the following fixes...
fix-01-r128, fix-02-svr4,
fix-03-mmap, fix-04-s3trio3d2x,
fix-05-s3trio3d, fix-06-s3trio3d2x,
fix-07-s3trio64v2gx+netfinity,
fix-08-s3savage_ix+mx.
CLIENT MACHINE and OPERATING SYSTEM: i386/Redhat Linux 7.0
Dell Cpt Laptop, Intel Celeron 333Mhz Processer,
Kernel 2.2.17.
Kernel is unpatched and compiled from source, not a redhat
rpm.
Also PCMCIA version 3.1.21 compiled from source.
DISPLAY TYPE: Neomagic NM2360 driving internal LCD panel
(Chipset forced to NM2200 in XF86Config)
WINDOW MANAGER: None -- see later.
COMPILER: gcc version 2.96 20000731 (Red Hat Linux 7.0)
AREA: xc/lib/font/fc
SYNOPSIS:
fs_handle_unexpected() [in xc/lib/font/fc/fserve.c] can
call
_fs_eat_rest_of_error() [in .../fsio.c] with an FSFpeRec
"conn"
structure in which the field trans_conn is NULL.
This ultimately leads to TRANS(Read)() [_FontTransRead()]
[in
xc/lib/xtrans/Xtrans.c] attempting to dereference a NULL
pointer followed by "Caught signal 11" and a server crash.
DESCRIPTION:
A brief description of my setup...
I'm running XFree86 4.0.1 as shipped in Redhat 7.0. Redhat
supplies both the new 4.0.1 XFree86 server and the older
individual servers from XFree86 3.3.6.
Due to some (very minor) glitches with the newer server on
my
system I'm still running the XF86_SVGA server from 3.3.6.
I'm also running vnc (version: 3.3.3r2, with some
modifications, installed from source) and kdm (from kde
1.1.2, redhat package kdebase-1.1.2-48).
I log in using kdm and my .xsession script runs the
vncviewer in fullscreen mode in order to simulate an X
session which I can share to other displays as I move
around.
Fonts are provided by xfs (from 4.0.1, redhat package
XFree86-xfs-4.0.1-1).
This setup worked correctly for several weeks until I
install
the microsoft web truetype fonts
(http://www.microsoft.com/typography/fontpack/) to be
served
by xfs.
Now if the laptop is suspended for more than about 5-10
minutes the X server will probably crash shortly after the
system is resumed.
The specifics of the problem...
I've traced the execution path by using a combination of
the
call trace and judicious insertion of printf(). I can get
as far as fs_wakeup() [in xc/lib/font/fc/fserve.c] although
I'm unsure where this is getting called from (as a callback
func)?
At somepoint after the resume, fs_wakeup() is called, and
not
finding a matching block record for the data it reads from
the
connection, it calls fs_handle_unexpected() [in the same
file].
----->8 Snip 8<-----
static void
fs_handle_unexpected(conn, rep)
FSFpePtr conn;
fsGenericReply *rep;
{
if (rep->type == FS_Event && rep->data1 == KeepAlive) {
fsNoopReq req;
/* ping it back */
req.reqType = FS_Noop;
req.length = SIZEOF(fsNoopReq) >> 2;
_fs_add_req_log(conn, FS_Noop);
_fs_write(conn, (char *) &req, SIZEOF(fsNoopReq));
<-- NO ERROR CHECK HERE.
}
/* this should suck up unexpected replies and events */
_fs_eat_rest_of_error(conn, (fsError *) rep);
}
----->8 Snip 8<-----
fs_handle_unexpected() finds that this is a "KeepAlive" and
attempts to send a Noop back to the font server.
This involves calling _fs_write() in [.../fsio.c].
When _fs_write() calls _FontTransWrite() [in
xc/lib/xtrans/Xtrans.c] it fails setting errno to EPIPE.
_fs_write() then calls _fs_connection_died() [in
.../fserve.c]
which (amongst other things) sets conn->trans_conn to NULL.
_fs_write() then sets errno to EPIPE and returns -1 to
signal
the error.
Crucially fs_handle_unexpected() doesn't check the return
value of _fs_write() and goes on to call
_fs_eat_rest_of_error() [in .../fsio.c] with "conn"
containing
the NULL pointer in the field trans_conn.
----->8 Snip 8<-----
void
_fs_connection_died(conn)
FSFpePtr conn;
{
if (!conn->attemptReconnect)
return;
conn->attemptReconnect = FALSE;
fs_close_conn(conn);
conn->time_to_try = time((Time_t *) 0) + FS_RECONNECT_WAIT;
conn->reconnect_delay = FS_RECONNECT_WAIT;
conn->fs_fd = -1;
conn->trans_conn = NULL; <--- HERE.
conn->next_reconnect = awaiting_reconnect;
awaiting_reconnect = conn;
}
----->8 Snip 8<-----
_fs_eat_rest_of_error() just does a call to
_fs_drain_bytes()
[in the same file] passing on "conn" containing the NULL
pointer.
_fs_drain_bytes() calls _fs_read() [in the same file] to
read the data from the connection, once again passing on
"conn".
_fs_read() calls TRANS(Read) [_FontTransRead()] in
xc/lib/xtrans/Xtrans.c passing conn->trans_conn as the
first paramater (Ie. NULL).
----->8 Snip 8<----- [From _fs_read()]
while ((bytes_read = _FontTransRead(conn->trans_conn,
data, (int) size)) != size) {
----->8 Snip 8<-----
_FontTransRead() tries to dereference this NULL pointer and
we catch sig11.
----->8 Snip 8<-----
int
TRANS(Read) (ciptr, buf, size) <-- ciptr is NULL.
...
{
return ciptr->transptr->Read (ciptr, buf, size);
}
----->8 Snip 8<-----
REPEAT BY:
This sequence seems to cause the crash everytime on my
system (but it can also happen even if the exact sequence
is not followed).
Initially: We've logged in using kdm and are running
a normal X session under Xvnc with vncviewer running
full screen so we can see it.
[Windowmaker 0.62.1, redhat package WindowMaker-0.62.1-14,
is
running under Xvnc; no window manager is running under
XF86_SVGA (just vncviewer), nothing else.]
1, Run xscreensaver to lock the X display with the
"xscreensaver-command -lock" command
(XScreeSaver 3.25, redhat package xscreensaver-3.25-4).
Note xscreensaver is locking Xvnc NOT XF86_SVGA.
2, Press a key so that xscreensaver prompts for a password.
3, While the password dialog is displayed hit Fn+Suspend to
place the laptop into suspend mode.
4, Go and have a cup of coffee.
Note this step is important!
You must wait at least 10 minutes (say 15 for good
measure).
If you try to resume immediately it won't crash.
5, After a 15 minute delay hit the power button. The
password
dialog will pop back up (unless the panel is set to
blank
after 10 minutes in which case press Ctrl or something
to
wake it up).
6, Type your password, hit Enter and XF86_SVGA will crash
with the "Caught signal 11." message.
Xvnc will be fine, and you can restart the X server and
reconnect to it with vncviewer and your Xvnc session
will
be uneffected.
SAMPLE FIX:
This fix adds an error check such that
fs_handle_unexpected()
checks the return value of _fs_write() and exits
immediately
if it is -1 (thus skipping the call to
_fs_eat_rest_of_error()).
It also adds a few warning messages so you know that the
problem is still there, although now better handled.
----->8 Snip 8<-- (diff -c xc/lib/font/fc/fserve.c
xc.new/lib/font/fc/fserve.c)
*** xc/lib/font/fc/fserve.c Wed Jun 11 13:08:41 1997
--- xc.new/lib/font/fc/fserve.c Sun Nov 26 14:49:27 2000
***************
*** 1,4 ****
--- 1,5 ----
/* $TOG: fserve.c /main/49 1997/06/10 11:23:56 barstow $ */
+ /* Modified to prevent a seg.fault crash -- R.Kay (26-11-00) */
/*
Copyright (c) 1990 X Consortium
***************
*** 92,97 ****
--- 93,99 ----
(pci)->descent || \
(pci)->characterWidth)
+ #include <stdio.h> /* So we can print some warnings -- RKAY */
extern FontPtr find_old_font();
***************
*** 1214,1220 ****
req.reqType = FS_Noop;
req.length = SIZEOF(fsNoopReq) >> 2;
_fs_add_req_log(conn, FS_Noop);
! _fs_write(conn, (char *) &req, SIZEOF(fsNoopReq));
}
/* this should suck up unexpected replies and events */
_fs_eat_rest_of_error(conn, (fsError *) rep);
--- 1216,1229 ----
req.reqType = FS_Noop;
req.length = SIZEOF(fsNoopReq) >> 2;
_fs_add_req_log(conn, FS_Noop);
! /* If _fs_write fails, conn->tran_conn will be NULL and calling
! * _fs_eat_rest_of_error will eventually cause a segfault in
! * _FontTransRead() -- RKAY */
! if (_fs_write(conn, (char *) &req, SIZEOF(fsNoopReq)) == -1) {
! fprintf(stderr, "Warning: _fs_write failed in "
! "fs_handle_unexpected.\n");
! return;
! }
}
/* this should suck up unexpected replies and events */
_fs_eat_rest_of_error(conn, (fsError *) rep);
----->8 Snip 8<------------------------
----->8 Snip 8<-- (diff -c xc/lib/font/fc/fsio.c xc.new/lib/font/fc/fsio.c)
*** xc/lib/font/fc/fsio.c Fri Jul 23 14:42:00 1999
--- xc.new/lib/font/fc/fsio.c Sun Nov 26 14:49:38 2000
***************
*** 1,5 ****
--- 1,6 ----
/* $XConsortium: fsio.c,v 1.37 95/04/05 19:58:13 kaleb Exp $ */
/* $XFree86: xc/lib/font/fc/fsio.c,v 3.5.2.2 1999/07/23 13:22:20 hohndel
Exp $
*/
+ /* Modified to prevent a seg.fault crash -- R.Kay (26-11-00) */
/*
* Copyright 1990 Network Computing Devices
*
***************
*** 457,462 ****
--- 458,467 ----
} else if (ECHECK(EINTR)) {
continue;
} else { /* something bad happened */
+ /* RKAY */
+ if (ECHECK(EPIPE))
+ fprintf(stderr, "Warning: EPIPE while writing to font "
+ "server.\n");
_fs_connection_died(conn);
ESET(EPIPE);
return -1;
----->8 Snip 8<------------------
Of course there remains the question of why writing to the
font server fails after a resume. I've not had chance to
investigate that aspect of the problem.
However, whatever the answer may be, a problem with the
font
server shouldn't result in the X server seg faulting.
One side effect of the above patch is that subsequent to
the
averted crash XF86_SVGA start consuming ~90% of the CPU.
This appears to be because the mechanism for calculating
the
timeouts for the Select() calls in WaitForSomething()
[xc/programs/Xserver/os/WaitFor.c] now decides on a timeout
of 0. And so XF86_SVGA runs in a busy loop. Aside from that
it works fine (if it wasn't for the fan switching in I
wouldn't have noticed).
There appears to be a function _fs_try_reconnect()
[xc/lib/font/fc/fserve.c] that looks
like it should re-establish the font server connection.
The only place I can find where this is called is
fs_wakeup(). However, if the connection has died and
fs_connection_died() has been called, as in this case, then
conn->fs_fd == -1 and fs_wakeup() exits immediately and the
call to fs_try_reconnect() is never reached.
----->8 Snip 8<-----
if (conn->fs_fd == -1)
return FALSE;
----->8 Snip 8<-----
I tried sleeping for 10 seconds and then calling
fs_try_reconnect in fs_handle_unexpected() but
conn->trans_conn was still NULL. I guess an examination of
xfs is would be a good idea.
Hopefully this is of some use,
R.Kay

Thanks very much for doing the debugging session and analasis, and also for
the patch as well. I believe we have a fix for 4.x based servers now and
will be testing it out soon. i will also try out your patch soon too, if
by chance you've come across any more info or patches please feel free to
submit them, and I will try to get a fix out ASAP. Sorry for the delay
of response. I'm playing catchup with an inherited bug report pile from
XFree86, and hope to get caught up sometime this year.. ;o)
This bug is a duplicate of Bug #17991 and countless others I wont mention,
however I'm not marking it duplicate, as it is the most detailed report
of the bunch. Thanks again.