So we had another kernel oops, although this one I unfortunately was
not able to diagnose; it never made it into the logfile. I did see two
kinds of log messages. About 1000 of these:
Feb 10 05:56:55 rin kernel: is_tree_node: node level 4562 does not match to the expected one 1
Feb 10 05:56:55 rin kernel: drbd(43,1):vs-5150: search_by_key: invalid format found in block 28538659. Fsck?
Feb 10 05:56:55 rin kernel: drbd(43,1):vs-13070: reiserfs_read_inode2: i/o failure occurred trying to find stat data of [10763889 10764043 0x0 SD]
Feb 10 05:56:55 rin kernel: is_tree_node: node level 4562 does not match to the expected one 1
Feb 10 05:56:55 rin kernel: drbd(43,1):vs-5150: search_by_key: invalid format found in block 28538659. Fsck?
Feb 10 05:56:55 rin kernel: drbd(43,1):vs-13070: reiserfs_read_inode2: i/o failure occurred trying to find stat data of [10763889 10764044 0x0 SD]
Feb 10 05:56:55 rin kernel: is_tree_node: node level 4562 does not match to the expected one 1
Feb 10 05:56:55 rin kernel: drbd(43,1):vs-5150: search_by_key: invalid format found in block 28538659. Fsck?
Feb 10 05:56:55 rin kernel: drbd(43,1):vs-13070: reiserfs_read_inode2: i/o failure occurred trying to find stat data of [10763889 10764045 0x0 SD]
And another 100 or so of these:
Feb 10 05:42:57 rin kernel: drbd(43,1):vs-13075: reiserfs_read_inode2: dead inode read from disk [2050566 2050611 0x0 SD ]. This is likely to be race with knfsd. Ignore
Feb 10 05:42:57 rin kernel: drbd(43,1):vs-13075: reiserfs_read_inode2: dead inode read from disk [2050566 2050612 0x0 SD ]. This is likely to be race with knfsd. Ignore
Feb 10 05:42:57 rin kernel: drbd(43,1):vs-13075: reiserfs_read_inode2: dead inode read from disk [2050566 2050613 0x0 SD ]. This is likely to be race with knfsd. Ignore
Of the first kind, the 'invalid format' messages, we saw about 20 of
them last night about 7pm, then nothing until 5:40 this morning, when
1000+ of them came in. In the MIDDLE of this huge spat of messages you
see the second kind, the nfs race condition messages. All the batch
starting at 5:40 or so came in quick succession; basically a constant
spew until 5:56 when everthing came to a screeching halt.
I know it appears the most likely culprit is an underlying disk
problem, but this is unlikely -- The underlying block devices are
hardware RAID volumes, not just disks, and nothing on the RAID side
reports any problem. Also, there are no log messages at all indicating
disk problems. Only drbd complained, and drbd is what panicked the
system with a null pointer deref.
Any ideas?
Brian