Friday, September 12, 2008

My Triumphant Return To Work, And How My New Assertiveness With The Users Paid Off

I got back from Canada a couple of weeks ago late on a Monday night, took two more days off to decompress and get my head straight, then returned to work on the Thursday determined to Get Things Done

A sad mistake.

I had left for Canada happy in the knowledge that although people were still dicking about and not making decisions that needed to be made, they would have two weeks while I was gone in which to send a memo containing the project-starting phrase "yes, you have permission to pull out one cable from the old disk arrays and plug it into the new arrays". It will come as little surprise to everyone that when I opened up my mail client I had a digital avalanche of missives concerning the project, and that not one contained the magic project de-stalling words I was longing to hear.

The Chief of the Users was now more upset that her training machine was hors de combat, a situation necessitated by the fact that the so-called "new arrays" were doing duty as training machine disk storage and we had no extra disks to spare as I previously dribbled on about in this posting so while the disk upgrade for the production server was underway (or not underway pending permission to recable) there would be no test machine. I had only explained this twelve trillion times to everyone concerned. The Chief of the Users had written asking when she could have her training machine back.

I replied that as soon as the migration was done, getting the training machine up and running was priority number one so could I please recable?

She responded by asking for a timetable for the return of the training machine.

I replied that it was impossible to put a hard date on that since I couldn't even start the task unless I had finished the other one, which involved recabling so could I have permission to do that please?

She reposted by asking if giving permission would speed up the return of the training machine.

I replied that I wasn't making myself clear. Without permission to recable nothing whatsoever could be done. No production machine upgrade. No array swap. No training machine rebuild.

She called a meeting.

At this meeting, the first order of business was for the vendors (three of them, teleconferencing in on what sounded like a baked bean can linked to the telephone network with a length of knotted hairy string) to deny hotly that my Boss's Boss's plan for a sort of clever data walk from one array to the other could be done. This was something of a surprise since I had spent a couple of hours on the phone to them the day before I left for Canada confirming that it would work in every detail. Now they claimed that not only would it not work, but if I disabled any of the spooler files (the first step in the Boss's Boss's Plan) the meta data would immediately self destruct so quickly and completely that were I to immediately re-enable the spooler file i had just disabled, none of the information in it would be useable.

I had to admit that this was a masterpiece of data fragility. How their programmers had arranged for the internal map the software had of the spooled data to evaporate spontaneously with no visible activity to account for it I'll never know. No doubt this is some sort of data protection anti-hacker requirement levied on the vendor by the government. Would that the mortgage companies had the same policy when it came to the tapes packed with my personal details they keep losing track of. I digress.

The vendors came up with a plan remarkably like my original one for those bits of the file system, so I let it go. My Boss's Boss was another matter, and he resolutely insists that everyone is being stupid and that he clearly remembers Larry the Consultant1 doing this very thing. I've taken to avoiding him in our cube-farm. Since he is over six feet tall, he can see over most of the cube walls so there is some skill involved in escaping detection. It passes the time, though I am resolved never again to seek emergency refuge in the "recyclables" hamper. The office staff have evidently become lax in proper disposal protocols of late and I nearly disemboweled myself on the chassis of an eviscerated workstation some clod had dumped in with the greenbar continuous fan-fold paper when I executed an ad-hoc detection escape leap last Tuesday morning.

Back to the meeting.

We reached the bit about unplugging one cable from each array and cross connecting the production server to both. This caused much consternation in their camp and not unconsiderable amounts of frustration in mine.

I explained2 that there were two cabled paths to each array originally. These provided a fault-tolerant path between the server and the arrays. However, I said, only four arrays3 can be placed on a loop, so in order to connect all eight to the production machine so we can copy data, file systems and so forth between them, we will need two loops, once on each interface, or as we Solaris geeks say "HBA4". This will require that I disconnect the redundant paths and disable the hubs they connect to.

"Is it safe?" asked the Chief of the Users.

"Perfectly" I responded with confidence. "The wires are designed to be safely removable. If they weren't, they wouldn't be any good for fault tolerance would they?"

We bounced this around for a while, then it was agreed that it all sounded ok and that everyone agreed permission could be safely given. Said permission itself was carefully not given. I would come in the following Tuesday, defined by the Chief of the Users as a "quiet period", call her and she would give verbal permission.

Tuesday rolled round, I called but the Chief of the Users was not in. I left a message and sat down with an egg sandwich and a comic book technical manual, and around 10-ish I got a call nervously giving permission for the cable disconnect and reconnect ops.

"Don't worry" I reassured the Cheif of the Users. "It will be completely impact-free for you and your team. You won't even notice it when I pull out the plug".

I trotted down to the network room, fired up a console and got busy disconnecting wires after first tracing them very carefully. Wouldn't do to pull out the wrong wire.

It was a great triumph.

Pulling out the wire caused no problems, and the system seamlessly began using only the one wire to contact its disks. If only I had stopped there I could have avoided three days of death threats and insults.

Much emboldened, I grabbed the other array's cable and plugged it into the hub that normally provided the fail-over path. I issued a couple of commands to provoke the system into recognizing the "new" disks, which it did, and returned to my desk to continue the migration from there.

When I sat down at my workstation I saw a couple of puzzling things on the GUI display for the file system virtualizer software (which makes the disks look like one big pool of space and adds yet more resilience and fault tolerance). The GUI is reputedly buggy at the level we have, so I ran some traditional reports to see what was what. I got back two status lines for each disk, one that said everything was fine, one that said everything was severely broken. I had a think, and it suddenly hit me that what was being reported was not the disk names, but some sort of node name used to virtualize the dual path scheme that had been in place only ten minutes ago.

It made sense. The file system was telling the server that when it wenty through cable A, the files were all where they should be, but when it went through cable B there was no information of any kind, which it though was a crashed disk. It was the only way to explain the weird double status lines and it needed fixing fast. There was no way of telling which cable would be used at any one time, and successive requests for information (computers often issue several internal requests to honour what looks like one request from the desktop) would return intermittent disk-crash emulations of an entirely undesired kind. Action was called for.

I could pull out the cable again, or I could disable the so-called Dynamic Multi-Pathing software, something I should have done anyway but in all the rush had forgotten about5 I quickly located my training course notes for the virtualizing software and put fingers to keys, which is when my Boss's Boss hoved into theater, sat down and demanded an explanation.

I gave him the ten second version, which earned me "I don't understand", so I did the two minute blow-by-blow account of my mistake and my plan for fixing it. My Boss's Boss, for reasons that escape me entirely, chose to believe that I had simply not grasped the way Solaris names disks (something so basic I had it down before I joined the department) and gave me an extended lecture, paced for the simpleton I obviously was on how it all worked. I gibbered as the GUI display once again refreshed with garbage instead of a pictorial representation of 144 disks and begged in a shrill falsetto to be allowed to disable the multipathing software. Permission denied, and an alternate plan of calling the virtualizing software vendor for an opinion was suggested.

At which point the virtualizer crashed, taking with it a major application of much importance.

His work done, my Boss's Boss lumbered off to do whatever he does when he's not slowing things down to a manageable speed, leaving me with a mess of titanic proportions. I yelled for the Original Admin, who wandered over, listened to the history of what went wrong, and suggested that we do a reconfiguration reboot of the server. This would fix everything he promised.

Unfortunately, the server thought it should add a bunch of fixes and patches before it shut down, and it took about thirty minutes to finish doing that. Then the virtualizer started reporting an error.

"let it run. It will clear itself" said The Original Admin.

"I dunno" I replied. "It is reporting an error with the same component every time. I think it's looping".

And it was, but the Original Admin wouldn't believe it until another half hour had passed, at which point he showed me the secret "shut down anyway" command.

The machine rebooted, finding all the disks where they should be and sorting out that it should only use one wire for one set and the other for the other set. Once again I begged to be allowed to shut down the dynamic multipath software, and once again I was told not to do so.

It took about four hours to get everything working again. All sorts of other stuff broke and vendors had to be called. The Chief of the Users called me a bad name over the phone, then sent me an e-mail to confirm I had received her insult, which included a pronunciation guide and the etymology of the words she chose. The vendor called me a bad name. I informed him the Chief of the Users's bad name was not only more insulting, but more grammatical. The vendor called me another bad name. My boss called me a bad name via e-mail, and cc'ed the entire department. When I went to grab a sandwich around 3 pm, people on the street I didn't even know called me bad names.

It was not my finest hour.

Two days later a disk failed in a bizarre and permanent fashion, and since no-one was around to stop me I finally disabled the multipathing software. Since then I've had no unfixable problems.

In the interests of spicing up things, there is documentation, nine years old but unredacted as far as I have been able to discover that says with these "new" arrays you can only stack three arrays on a loop. The Original Administrator disagrees with this, but cannot back up his stance other than to insist "It works". Interesting times are they not, Mr Fong?↑

2 comments:

The problem is that I am quite good a deducing what is going on in a system I can see in motion, but have inadequate in-depth knowledge as far as Solaris is concerned. Had I had the confidence to just ignore the "advice" while I was getting it, things would have gone down a different path. It is very frustrating to know what needs doing but to be at a total loss when it comes to knowing how to go about it. I've been here before, but Mr Brain was more agile in '84 than now. Lodging information in there so it sticks is a very tiresome process. In '84 I would just cram the information in, drink a half-pint of rum that evening and let everything percolate overnight.