IBM

2505-9050, or, eight digits of doom

So the POWER6 (which the Apple Network Server 500 is subbing in for) did indeed blow its system backplane. Unfortunately it appears to have taken the RAID 5 array with it. There is data still in the auxiliary battery-backed cache, but the cache directory is apparently hosed and SRN 2505-9050 popped up in the logs, indicating it does not recognize the cache data as belonging to the array. The associated MAP 3131 to resolve it strongly suggests I'm more hosed than a fire truck at a gay pride parade, with ugly words like "data loss" and "delete then recreate array."

Diagnostics shows the disks are there, and recognized as belonging to the array, but the array itself is listed as Failed.

My IBM tech friends aren't sure what to do with it either, but I thought I'd ask here in case anyone is a POWER Systems god. The system was relatively quiescent the day it went bad, so there shouldn't be a LOT in that write cache, mostly log files (the 57B7/8 card pair has a modest 175MB of write cache). This is AIX, so these are all JFS2 file systems.

The way I figure it, /, /usr and /opt haven't seen much action since Obama's first term. They should be clean of writes; there shouldn't be anything in the write cache for those partitions. /tmp is expendable. /home is backed up daily, so I can restore it. The logs in /var are almost certainly toast, but the database hasn't been written to in several weeks, mail is backed up several times a day, and the web server and gopher server are backed up daily and weekly respectively. I don't care about the journal volume or the paging volume, since I already assume those are fried.

So, given this, my thought is to just reclaim the write cache and wipe it, and fsck and hope for the best. The drives still organize themselves into an array, just a failed one. The journaling should keep the file system in a sane state, even if I've lost some writes.

Or do you think I'll be rebuilding the server from scratch and backups?

Seems like you're covered either way. It doesn't take long to install AIX so why not try to get the array back, but have your CDs ready if you're in for a reinstall. Or maybe this was a call for bets and we should all get those in? In which case, what's the spread?

At $WORK our power systems are all using the volume manager only, no raid. Which through our consultant for a loop. Then again, we don't have a raid controller not believing all the parts belong together.

Most of the power systems are even without raid controllers, but one of the 520s has a raid controller - I found this out because the battery died - how does one go about managing such a thing? It's not like a LSI card in a PC server where I press a key combo at boot time - or I don't know what to press at least.

I have a mksysb image of the OS, patched up the way I like it, though it probably got some minor modifications that I don't have in that .iso. So I'd rather not rebuild from scratch if I can avoid it, though it does look like my backup strategy is at least adequate ...

What I'm trying to find out is if anything
other
than write cache is in the RAID, and how often it gets written out. I'd hate for it to queue up a full 175MB of data before it tries to emit any, and I'd
really
hate for it to be caching things like the superblock.

On the p520 and I think p720, the planar can be either "flat" or RAID. There is an enablement card set that you can install as an option (that's the 57B7/8 in mine) that enables the planar backplane RAID and provides the cache.

SMS (and for that matter, Open Firmware) does not know how to manage RAID controllers; a crashed array will in fact be completely invisible to SMS and won't even show up when you list devices. You have to boot from the AIX diagnostics CD to view the status of the RAID array and attempt repair. This is not the same as an AIX install set. Incredibly, you can download an ISO of that from here, at least until IBM puts that under support-contract-only too:
https://www-304.ibm.com/webapp/set2/sas ... /home.html

What's the status of the RAID component devices look like? Is the whole array showing as failed because too many component devices are thought to be failed? If so, does the management interface let you manually/forcibly re-enable the component devices?

I have zero experience on POWER systems or their RAID controllers. But I used to have an IBM ServeRAID 6M in a PC server. Just about any unclean shutdown of the system (but particularly those that happened as a result of bad RAM) would cause the 6M to think that two devices in its RAID-5 array had died. Booting from the ServeRAID CD and using its interface to change the drives' status from "Faulty" back to "On-line" was all that was needed. The data on the disk(s) was all there so the controller didn't re-sync/re-build the array. So I had no data loss at all. In my case, it was just a confused controller but a healthy array.

Again, I have no idea how your POWER system compares to my ServeRAID 6M, but your situation (hardware failure leading to a crash leading to a hosed array) sounds just like what I experienced. It might be worth a try if the tools allow it, and if you can accept the fact that it might not work at all.

The component drives show up as RWProtected in AIX diagnostics, not Failed -- only the array itself appears as Failed. This is the designation the SAS RAID uses for a drive that is still part of an array, but the array has failed "safe," so the controller will lock it down until repairs are effected. See, for example,
http://pic.dhe.ibm.com/infocenter/power ... ration.htm

I'm going to test the individual drives, too, but I very much doubt that's the problem. Near as I am aware, you can't set the individual status of the drives manually, at least not in AIX or System i; the controller has to be "happy" first before it will release the protection lock.

At least it's good to know that wiping/reclaiming the controller cache is a recoverable operation, if it has to be done. That's not at all clear from IBM's technical documentation. JFS' resiliency as a file system is probably a big part of that.

And sorry the info I posted wasn't useful. I had hoped that there might be some similarity between IBM's PC ServeRAID cards and the ones for their POWER systems. What I hadn't realized (and really should have
) was that your system was a generation or two newer (using SAS) than the old ServeRAID 6M (U320 SCSI) that I'd used. I hope I didn't waste too much of your time with my irrelevant ramblings.