[This message started off life as a followup to the "fsck finds DUPs" thread,
but I wanted to check a few things out before sending it. That thread now
seems to have died off, but this one has just started, so I'll add my 2c here
instead ...]
> does anyone else see a lot of duplicated inodes after during a lot of
> disk access? (a lot == building the world or something equally as large.)
I have been experiencing problems like these since the begining of this year,
when we got some new 486 based pc's for a lab.
Initially I was convinced that the problem must lie with my hardware, since no
one else on the mailing lists was reporting similar problems. Now, after some
further testing, and this recent flurry of similar sounding reports, I'm not so
sure.
The problem I am seeing is that a disk sector from one part of the disk
occasionally ends up getting written to an incorrect location. I've used fsdb
(which doesn't come with NetBSD, but which I compiled from our 4.4 sources) to
confirm this.
If the overwritten sector is a file data block you might never notice a
problem, since fsck doesn't see anything wrong. Alternatively, if the affected
file was a binary, it might mysteriously start dumping core whereas a freshly
compiled binary works fine. Or you might see compilations failing due to
syntax errors, and on investigation find a C source file with part of some
other random file (always a multiple of 512 bytes) imbedded in the middle.
I've experienced all of the above :-(
But if both affected sectors contained inodes, fsck will report many DUP
blocks. The pattern of duplicates I see seems to confirm the theory that they
were caused by one sector of inodes overwriting another.
The problem mostly seems to occur during relatively sustained and intensive
disk i/o, such as when doing a "make build". But recently, in order to get
more information, I wrote a simple perl script that kept doing random cp's,
cmp's and rm's until an error was detected. This script seems to relatively
reliably trigger the problem after a few hours of running. Curiously however,
it never seems to trigger the inode->inode type of error ... it's always a
datablock->datablock error, whereas the "make build" causes either equally
often...
Some other possibly relevant details follow:
My script copies files and checks the results using the perl "system" function
with a command string like "(cp ...; if cmp ...; then ... ) &". The "&" causes
some degree of multitasking. When I ran it in single user mode, with no
backgrounding (to minimise "concurrent" disk access) it ran for ~ 12 hours
without error.
I've been having these problems with versions of netbsd-current from both
before and after the new filesystem code was integrated. I've now run 'fsck
-c2' on all my filesystems, and the problem still occurs.
My hardware is a 486DX2/66 with 16MB of memory, a micronics mpower-4
motherboard with on-board localbus mach32 video and an on-board localbus ide
controller, 2 340MB conner cp30344 ide drives. The problem also occurs on a pc
with the same motherboard but a 486/33, 8MB of memory and a single 240MB conner
drive. And I've also recently discovered that it occurs on a 40MHz 386 with a
non-localbus isa no-name ide controller.
Turning off caching (both internal and external) and a bios option called
"fast ide" didn't make any difference.
Having got to the stage where I could reasonably reliably reproduce the problem
under NetBSD, I installed OS/2 on the 486/33, and ran a slightly modified (to
cope with unix - os/2 differences) perl script on it. It ran happily over a
weekend without error.
I'm still not 100% convinced that the problems I am seeing are the same as are
being reported here, since up til now other reports have only concerned
duplicate block errors ... no-one else seems to be seeing more general file
corruption.
Duncan