From Moon at STONY-BROOK.SCRC.Symbolics.COM Tue Dec 23 02:36:00 1986
From: Moon at STONY-BROOK.SCRC.Symbolics.COM (David A. Moon)
Date: Dec 22 86 20:36 EST
Subject: Kludging around the KS's flakey disk
In-Reply-To: <861222020255.2.SRA@WHORFIN.LCS.MIT.EDU>
Message-ID: <861222203625.1.MOON@EUPHRATES.SCRC.Symbolics.COM>
Date: Mon, 22 Dec 86 02:02 EST
From: Rob Austein
I just spent five minutes looking at the DISK code. Based on that
wealth of experience, it appears to me that if I were to bring up the KS
with UNSAFE+1 JFCL'd out, it would stop doing a BUGPAUSE every hour or
so.
Somebody tell me why I shouldn't do this, before I wear out the $P keys
on the KS's console....
The idea was that if the disk goes unsafe, is reset, and goes unsafe
again within one second, it is probably about to explode and the fire
department should be called. It looks like the code jumps to UNSAFE
for a lot of different reasons, not all of which are the drive going
unsafe. The drive also goes unsafe for a lot of different reasons,
some of more consequence than others.
I agree that if you JFCL out the BUGPAUSE it won't do it any more, so if
you think this won't do irreparable harm to the disk, go ahead.
From sra at XX.LCS.MIT.EDU Mon Dec 22 08:02:00 1986
From: sra at XX.LCS.MIT.EDU (Rob Austein)
Date: Dec 22 86 02:02 EST
Subject: Kludging around the KS's flakey disk
Message-ID: <861222020255.2.SRA@WHORFIN.LCS.MIT.EDU>
I just spent five minutes looking at the DISK code. Based on that
wealth of experience, it appears to me that if I were to bring up the KS
with UNSAFE+1 JFCL'd out, it would stop doing a BUGPAUSE every hour or
so.
Somebody tell me why I shouldn't do this, before I wear out the $P keys
on the KS's console....
From ALAN at AI.AI.MIT.EDU Mon Dec 22 05:17:45 1986
From: ALAN at AI.AI.MIT.EDU (Alan Bawden)
Date: Dec 21 86 23:17:45 EST
Subject: RP06s eat dead bears
In-Reply-To: Msg of Sun 21 Dec 86 13:52:25 EST from David Vinayak Wallace
Message-ID: <133398.861221.ALAN@AI.AI.MIT.EDU>
Date: Sun, 21 Dec 86 13:52:25 EST
From: David Vinayak Wallace
AI hung in a fashion I'd never seen on the KS's: disk accesses would
hang forever. Pages in core were easily accessible; ITS ran fine
unless you tried to touch the disk. I could create a job and run some
instructions in low memory, but when I tried to do a .CALL OPEN ITS
hung.
You must have been lucky. Our RP06's have been causing ITS to hang in this
way ever since day 1.
I went upstairs and the system console said:
DSK: UNIT #1 CAME BACK ONLINE
DSK: UNIT #0 CAME BACK ONLINE
DSK: UNIT #1 CAME BACK ONLINE
and some status registers. For all three, ER1= 40000 which is Drive
Unsafe.
Unsafe almost always comes on. The interesting bits were the ones in ER3,
which according to the crash dump were (according to the crufty
documentation) "AC power low", "DC power low" and "Spare" (!).
I dumped it to CRASH DSKOFL, not that I think it will help any.
Why? I manadged to get the ER3 bits out of it.
Notice that the crash file was written out but the dates were not set
on the file?
DSKDMP never sets write dates, because it doesn't know how to tell the
time. The DMPCPY program (which TARAKA runs when the system boots) sets
the date on any file it suspects is a crash dump. In this case, DMPCPY
wasn't able to set the date either, because you had cold booted the
machine, so ITS didn't know what time it was when DMPCPY ran.
... By the way, should these be going to BUG-ITS or KS-ITS -- I can
never tell any more.
You sent this to the right place. KS-ITS is almost -never- the right place
to send a Bug Report.
From ALAN at AI.AI.MIT.EDU Mon Dec 22 04:34:01 1986
From: ALAN at AI.AI.MIT.EDU (Alan Bawden)
Date: Dec 21 86 22:34:01 EST
Subject: Gee, that's not his host's name
In-Reply-To: Msg of Sat 20 Dec 86 13:22 EST from Ramin Zabih
Message-ID: <133383.861221.ALAN@AI.AI.MIT.EDU>
Date: Sat, 20 Dec 86 13:22 EST
From: Ramin Zabih
Typing :FINGER on AI just produced this output:
...
RDZ Ramin Zabih F T23 <>: 709 x8827 RDZ, Zvona (Chaos)
It seems that someone is confused about the name of the 3600 I'm using...
RDZ's host has a short name of "NULL". He's been expecting the name "NULL"
to break some program ever since he named it that. I presume thats why he
mailed this bug report to Bug-LISP.
Nothing's broken actually. I was just hacking him...
From GUMBY at AI.AI.MIT.EDU Sun Dec 21 19:52:25 1986
From: GUMBY at AI.AI.MIT.EDU (David Vinayak Wallace)
Date: Dec 21 86 13:52:25 EST
Subject: No subject
Message-ID: <133197.861221.GUMBY@AI.AI.MIT.EDU>
AI hung in a fashion I'd never seen on the KS's: disk accesses would
hang forever. Pages in core were easily accessible; ITS ran fine unless
you tried to touch the disk. I could create a job and run some instructions
in low memory, but when I tried to do a .CALL OPEN ITS hung.
I went upstairs and the system console said:
DSK: UNIT #1 CAME BACK ONLINE
DSK: UNIT #0 CAME BACK ONLINE
DSK: UNIT #1 CAME BACK ONLINE
and some status registers. For all three, ER1= 40000 which is Drive Unsafe.
I dumped it to CRASH DSKOFL, not that I think it will help any. Notice
that the crash file was written out but the dates were not set on the file?
I cold-booted just in case -- ITS seems to be running fine now. By the way,
should these be going to BUG-ITS or KS-ITS -- I can never tell any more.
david
From RDZ at AI.AI.MIT.EDU Sat Dec 20 19:22:00 1986
From: RDZ at AI.AI.MIT.EDU (Ramin Zabih)
Date: Dec 20 86 13:22 EST
Subject: Gee, that's not my host's name
Message-ID: <861220132243.9.RDZ@NULLSTELLENSATZ.AI.MIT.EDU>
Typing :FINGER on AI just produced this output:
-User- --Full name-- Jobnam Idle TTY -Console location-
___005 < [not logged in] HACTRN 23.T05 906 x1729 CENT, OAF
KWH Ken Haase HACTRN *:**.T15 Net site PREP (Chaos)
DPH Daniel Huttenlocher HACTRN 46.T16 723 x8843 Alan, DPH
RDZ Ramin Zabih F T23 <>: 709 x8827 RDZ, Zvona (Chaos)
It seems that someone is confused about the name of the 3600 I'm using...
From Alan at AI Fri Dec 19 08:54:39 1986
From: Alan at AI (Alan at AI)
Date: Dec 19 86 02:54:39 EST
Subject: OK, I just saw it happen again.
Message-ID: <132522.861219.ALAN@AI.AI.MIT.EDU>
For most of yesterday (Thursday the 18th) COMSAT on MC was catatonic. Our
guess is that it was stuck in a JOB device wait (waiting for the DQ
device). As soon as Alan looked at the situation COMSAT started running
again, so probably something he did caused COMSAT to get PCLSR'd out of the
system call for the first time all day, and the second time the timing
screw did not occur.
The right thing is for someone to fix the last bug in the JOB/BOJ code.
A quick fix the COMSAT maintainers might consider, is to take an occasional
%PIRLT interrupt to keep its interactions with DQ lubricated.
A better fix would be for Alan to finish up the improved Domain Demon
interface, so that COMSAT can use it instead, and not be subject to this
particular class of ITS bug.
From ALAN at AI.AI.MIT.EDU Thu Dec 18 08:57:19 1986
From: ALAN at AI.AI.MIT.EDU (Alan Bawden)
Date: Dec 18 86 02:57:19 EST
Subject: No subject
Message-ID: <132118.861218.ALAN@AI.AI.MIT.EDU>
TTYSET on a TTY opened as a device (rather than as a console)
clobbers the wrong TTYST* words!
From ALAN at AI.AI.MIT.EDU Thu Dec 4 20:00:04 1986
From: ALAN at AI.AI.MIT.EDU (Alan Bawden)
Date: Thu, 4 Dec 86 14:00:04 EST
Subject: more fukt
In-Reply-To: Msg of Thu 4 Dec 86 03:20:35 EST from Pandora B. Berman
Message-ID: <126472.861204.ALAN@AI.AI.MIT.EDU>
Date: Thu, 4 Dec 86 03:20:35 EST
From: Pandora B. Berman
... maybe lester didn't fix the disk hard enough.
I don't think it is related. This is a problem we have had ever since we
went to two drives. The symptom is that for no apparent reason a drive
interrupts you and reports that it has just recently come back online.
Both drives do it. We have no idea why they do this. We also have no idea
why the code that we put in to recover from this doesn't work. (Perhaps we
need to reset the drive harder when this happens.) Luckily this doesn't
seem to happen all that often, but it has been the most common reason for
AI crashes since we got the new drive.
It also seems that if one drive is in the middle of a transfer, and you
tell the other drive to do something, the first drive will interrupt you
and complain that you shouldn't bother it while it is busy. We simply
ignore these complaints, which seems to work just fine. (Like it's
happened 470 times since AI came up 10 hours ago...)
From CENT at AI.AI.MIT.EDU Thu Dec 4 09:20:35 1986
From: CENT at AI.AI.MIT.EDU (Pandora B. Berman)
Date: Thu, 4 Dec 86 03:20:35 EST
Subject: more fukt
Message-ID: <126255.861204.CENT@AI.AI.MIT.EDU>
it happened again. dumped to CRASH;FUCKED AGAIN.
maybe lester didn't fix the disk hard enough.
From CENT at AI.AI.MIT.EDU Thu Dec 4 07:19:00 1986
From: CENT at AI.AI.MIT.EDU (Pandora B. Berman)
Date: Thu, 4 Dec 86 01:19:00 EST
Subject: &^*^&$%!!
Message-ID: <126198.861204.CENT@AI.AI.MIT.EDU>
things were hanging all over, and alan diagnosed that a disk had briefly
gone offline, and AI had not quite recovered correctly. only thing i
could do, really, was to lift switch 0 and reload. crash dump (i think)
to CRASH;FUCKME HARDER, a fine traditional name.