Security

(public)

User Story

Much of the feedback from Netscape 6 indicates that there are hangs that many
people are hitting. We currently have no mechanism for collecting data about
hangs. If we could trap these hangs into talkback, we
whttp://client/mojo/feedback/newsgroup_feedback_overview.htmlould be able to get
information that helps us fix them. Even just getting people we can contact to
reproduce problems would be an enormous help.
I talked this over with Chris Saari and he believes it's straightforward to trap
at least one class of hangs by having a timer that works just like the busy
cursor does on the Mac. After some set number seconds where we have not
returned to the event loop, this timer fires and either drops into talkback (if
we're very sure it's always the right thing to do) or puts up a dialog asking
the user how they'd like to procede (if we think that's safe or we're not always
sure.)
Chris doesn't know how to get into talkback, but can help with hooking up the
event stuff. Someone would need to own the UI if we did a dialog. It would
help to be able to tell talkback this report was due to a hang rather than
relying on the user to say that in their comments.

This mechanism will work except we wont get stacktrace. With our qfa
component we can trigger an artificial incident and bring the customized
talkback dialogbox(using talkback server UI). By doing this, data will be
maintained by talkback servers. Note: Customized Talkback UI is not available
for Mac.

I think it would work XP, assuming your hung app was still going through the
event loop. If not, it might still work on Windows and/or linux if we're
processing timers asychronously (I'm not familar with how timers are implemented
there), ie. they're not processed as part of the event loop.
If you're hung at some interrupt level higher than timers, well, your life
sucks. I highly doubt that though.

In the timer callback you check the machine's tick count and store it. In the
event processing loop, you store the current time. If more than say, 10 seconds
have passed between the last time you were in the event loop and the current
time during the timer callback, you may wish to consider doing something about it.
This only catches bugs that stop you from going through the event loop. It is
entirely possible to make the app appear hung, yet still be going thorough the
event loop. I cite my many 0.9.1 command dispatching/handling bugs.

Are you storing the system time? There may be scenarios that
we aren't thinking of where this might cause problems:
* What about sleep functions on laptops?
The app will be suspended but "current time" will keep going.
* What if I reset the time on my machine?
Sudden time jumps might trigger a crash. Daylight savings time?
Maybe we could use alecf's mozilla timer service instead of system time.
If this goes in, we should make tunable/turn-off-able with a pref.

Currently on the trunk nightlies, (in my experience) there have been more hangs than usual, so implementing this is more important than it's been for a while.
I would suggest, rather than trying to identify when the CPU is too busy for too long or the cursor is in some mode or another, just catch a KILL signal (as distinct from a normal term signal, or whatever they call it on windows), because if it's hung, eventually someone has to kill it. Win XP already does this and offers to send reports to MS.