I'm having an odd issue with one of the machines I remotely administer. Recently, it's /home partition started developing filesystem errors that prevent it from being mounted at boot. Instead it drops to a login screen, and I have to walk someone through logging in and running fsck -y on the partition. It seems to need it every time it reboots now. I tried reformatting the partition with

Code:

mke2fs -t ext4 -c -c /dev/sda7

to scan for bad blocks, but it didn't find any. I suppose it might be the superblock(?) that's bad, but I would think that would've been detected too.

So I have 2 questions:
1. What could be causing this/how to prevent it?
2. Can the init scripts be configured to keep booting, even if /home fails to mount, so that I can at least ssh into the box?

Remember that corruption doesn't necessarily come from the disk. Just like any other computer, garbage in, garbage out. Your CPU could be emitting garbage for the disk to write, or perhaps your RAM has amnesia causing your CPU to write bad data to the disk.

I would think that initscripts should keep on booting without home, but since ~/.ssh lives on home for many users, it would still be hard to ssh in (especially if root is disabled)._________________Intel Core i7 2700K@ 4.1GHz/HD3000 graphics/8GB DDR3/180GB SSDWhat am I supposed to be advocating?

I think if it was a kernel or RAM issue, I'd see more problems than just this. Then again, I first saw this problem shortly after upgrading to gentoo-sources-3.8.13. I haven't had any issues with the root fs (also ext4), but it gets less I/O.

The init scripts fail at running fsck on /home, and drop to an emergency login: "Welcome to (none).(none)" or something. The hostname isn't even set yet. Conceivably, the network and sshd could be started, and I could login as root (via pubkey of course).

It does seem to be kernel-related, I just ran into the same problem on a different machine with the same kernel version. I rolled back the kernel on the original box to 3.6.11 and the problem seems to have gone away (I'll upgrade the kernel when I get to it in person). The second box has been upgraded to 3.10.7, I'll have to see if that helps

Though I had other issues with gentoo-sources-3.8.13 I have not seen the corruption issue on my ext4 machines._________________Intel Core i7 2700K@ 4.1GHz/HD3000 graphics/8GB DDR3/180GB SSDWhat am I supposed to be advocating?

Crud, it happened again, this time on 3.6.11. Seems to start when there's an unclean shutdown. Running fsck manually has it remove some deleted inodes - nothing critical yet, but it's only a matter of time until valuable files get lost. I thought using a journalling fs was supposed to help with that? Maybe I'm just doing it wrong.

Uh... No. Even with a journalling filesystem, just shutting down the machine abruptly (like cutting power) is not proper.

Journalling filesystems will *help* but does not prevent corruption. A proper shutdown is still needed.

If you must have a system that can handle this, it can help more if cached writes are flushed to disk as quickly as possible. It will reduce performance but will help against corruption._________________Intel Core i7 2700K@ 4.1GHz/HD3000 graphics/8GB DDR3/180GB SSDWhat am I supposed to be advocating?

As the shorter commit times is just a hack to just help limit the damage, I cannot condone this as a "solution". Journalling filesystems are already helping the problem a bit as it is (unless you somehow disabled the journal) but it's still not right.

The question that's going in my head: Why is the power going out so frequently that such is needed?

If it's due to laziness, people will need to figure out how to shut down normally.
If it's due to unstable power, a UPS or perhaps a laptop configured to do a clean shutdown is highly recommended, this is a "proper" solution.

How frequent is frequent? Also what is the function of the machine, is it writing stuff to disk constantly? A disk that's merely just read most of the time should not suffer as much corruption from unclean shutdowns.

Remember, even with these faster commit options, if power goes out while committing, you will suffer problems as well._________________Intel Core i7 2700K@ 4.1GHz/HD3000 graphics/8GB DDR3/180GB SSDWhat am I supposed to be advocating?