Archive for the ‘cluster’ Category

Ok, this is part III of the rather long story. As shown in last series the problem was really tricky, we cannot run anything in the event of loosing storage (you cannot read binaries if you don’t have storage). Ok, so how HACMP/PowerHA deals with it?

If you lose all storage Virtual FibreChannel connections, this is going to reported as a loss of quorum for a VG in AIX’s error reporting facility. In theory this error should be propagated to HACMP/PowerHA failover mechanism so that RG is evacuated from the affected node. Sometimes this behaviour is titled as “selective failover behavior”. Technical details are pretty interesting as it shows the lack of real/solid integration between AIX kernel and HACMP/PowerHA. Primary source of information is AIX RAS reporting facility (see “man errpt” for more info) and the errdemon process. Errdemon takes proper actions based on the configuration of “errnotify” ODM objects. The installation of PowerHA adds special hooks so that any special messages about failing hardware/OS/probes (in this case loosing of quorum on VGs) are propagated to the HACMP/PowerHA scripts. This can be verified as follows:

.. so the method of rescuing from losing SAN storage (also on SAN booted hosts) seems to be the script /usr/es/sbin/cluster/diag/clreserror. Of course this script is also located on SAN in that particular case…

And rest assured that this script calls a lot of other scripts, of course that can be unavailable if the rootvg is on the same physical storage as the affected VG.

There are two good findings here actually. First one is that if you are going to loose all SAN-based hdisks you are going to be flooded with thousands entries in errpt facility. Those can be undetected by errdemon because of the overflowing the log in memory. Workaround for the first case seems to be trivial, just enlarge the error log buffer. This is documented here:

Additionaly it seems to have some value mirroring some of the LVs on the affected VGs. This might add some stability to the detection of loosing LVM quorum, i.e – this shows properly mirrored LVM loglv4 across 2 hdisks(PVs)…

Second finding is that even with all those changes it has very high probability of failing (i.e. PowerHA RG move won’t work). Personally the risk is so high that for me it is nearly a guarantee. The only proper solution of this problem that i am able to see is to add special handler to the err_method in errdemon code. Something like err_method = KILL_THE_NODE. This KILL_THE_NODE should be implemented internally by running all the time errdemon process. The process should be running with memory protected from swapping (something like mlockall())… because currently it is not running that way.

The first scenario we have performed was to disable 100% of the MPIO storage paths to the active HACMP node by un-mapping Virtual FibreChannel adapters (vadapters from both VIOS protecting the active node [LPAR]). On both VIOS we have performed the following command:

$ vfcmap -vadapter vfchostXYZ -fcp

where the vfchostXYZ was server-side (or better VIOS side) vadapter handling FC I/O traffic for this LPAR. The result? The LPAR with HACMP/PowerHA active Resource Groups on it (jkwha001d) after some time evicted itself from the HACMP cluster, and the old passive node jkwha002d became the active one (Oracle started there). The root cause of the jkwha001d LPAR is the following one:

As you can see AIX6.1 generated a “SYSTEM DUMP” which is actually a system panic indicator. Normally AIX saves the memory image of it’s kernel on runtime to the Logical Volumes configured with “sysdump” type and then reboots. This allows investigation why the machine crashed, but here you won’t see anything. You’ve lost FC storage connectivity (even rootvg) , so it even couldn’t save it . So from where it knows it? Probably the state of the LPAR crash can be saved somewhere in POWERVM firmware area. It’s one of the RAS things in AIX/POWER. So far, so good, the HACMP/PowerHA was able to recover services…

OK, but we wanted to have a real proof, so we have performed a double-check. We suspected that VIOS can have some magic way to communicate with LPAR. We wanted exclude that factor, and we’ve performed simulation of disconnecting the FC from storage level. The initital state was stable HACMP cluster, with RG active on jkwha002d (and jkwha001d being passive), all MPIO paths to Netapp cluster (storage1, storage2) were reported as “Enabled” by “lspath”. The igroup term in Netapp concept is “inititator group” and is responsible for things like LUN masking. If you remove access to some FC WWPNs on the LUN, it is going to end-like in situation in which AIX will have hdisks point to non-existing SCSI adapters and AIX SCSI stack will get a lot of errors (FC is just wrapper around SCSI).

On the 2nd storage controller (storage2) the igroup jkwha002d_boot was controlling access to the OS level LUNs (rootvg, etc):

The same igroup was present on 1st storage system, controlling the remaining LUNs (yes, this is active-active configuration, where reads/writes are being managed by AIX LVM/VG consisting on LUNs on two controllers):

.. and HACMP failover won’t work. The active LPAR node (jkwha002d) is going end up in zoombie state. If you have opened SSH sessions before, everything would indicate that is working, ls on /etc, some commands utils too. But the reason is because some things are cached in AIX’s filesystem cache. Everything is going to fail with I/O errors, lspath will cry too (it won’t display all MPIO paths as failed but this is story for another post), examples:

but what’s most interesting is that Oracle will ask AIX to write it to the alert log file – and will be available to read by commands like “tail”, but “cat” command won’t work (you won’t be able to read whole alert log file becasue you don’t have I/O!). What’s even more interesting is that you won’t see those messages after rebooting! (after kernel memory is gone!). If you don’t have I/O how you are going to write/fsync this file???

Another additional thing is that the active HACMP node still will be active, it will be handling Resource Groups, etc. Failover won’t happen. Possible solutions to this problem should be an kernel-based check that would verify that at least /etc is accesible. Why kernel-based? Because you have to have some form of heartbeat in memory (like AIX kernel module or uncachable binary always present in RAM, running in Real-Time scheduling priority) that would continuesly . If it would fail for several times it should reboot the node (that would trigger Resource Group failover to 2nd node in the cluster).

Note: typical HACMP scripting – at least in theory – is not enough, even if it would force running /sbin/halt, how can you be sure that /sbin/halt all required libc contents are in memory??

Together with Jedrzej we’ve exposed rather interesting weaknees in IBM PowerHA 5.5 solution (in the old days it was called HACMP). Normally you would assume that in case major cataclysm such as *complete* storage disappear on the active node, PowerHA or AIX has internal mechanism to prevent downtime by switching the services to the next active node (as defined in PowerHA policies/confguration). This is starting to be really interesting when we start talking about SAN BOOT-ed AIX LPARs. As everybody knows any form of assumption is bad (this is starting to be my mantra), and as we show it here avoid this pitfall requires requires SOME PLANNING AND CONFIGURATION to avoid ending in long downtime….

No posts on my blog for long time, need to change that. So i was trying to get MAA (Maximum Availability Architecture by Oracle) lab again in shape for writing Master of Science thesis…

Somewhere near Janurary/Februrary this year:

Primary VM RAC nodes prac1, prac2, prac3 are working again, but database db3 is not (unable to archive to db3dg on srac1,srac2). Main root cause was that experiments in April of 2009 with log_archive_min_succeed_dest=2 setting caused losing sync betweeen primary and standby

Problematic thing is that after failover/switchover due to differences in primary (64-bit) and standby (32-bit) i have to invalidate & recompile all PL/SQL packages (very time consuming on old hardware! and seems that Broker is unable to handle that case):

Also i’ve upgraded Grid Control (OMS) to 10.2.0.5 (next step will be to 11g), OMS database repository to 11.1.0.6 (fro 10.1.0.4).

Next i deployed 32-bit clusterware (11.1) on those notes, played a little bit with OCR corruptions [metalink note ID 399482.1] after hitting mysterious listener outages (OCR corruptions were not the case for it, it was permission issue on single directory – doh!)

Created clustered ASM, and created 32-bit RAC database named “db4.lab1″.

Building MAA for “db4.lab1″, plan is to create DataGuard for it on srac1, srac2 VMs (32-biit too, they already host DataGuard for “db3.lab1″). But this one is going to use DataGuard Broker to get FailStart Failovers working (2 node primary RAC with 2 node standby RAC)

Extending primary RAC to prac9 (to be created), so to have 3 node primary RAC for “db4.lab1″ protected by DataGuard broker with 2 node standby RAC

11/04/2009:
1) Finally got some time for cleaning up Grid Control (dropping ora2 and ora3). Secured all agents (on VMs: gc, prac1, prac2). I’ve also cleaned up XEN dom0 (from quadvm). These VMs are not needed anymore. db3.lab (RAC on prac1, prac2) is in GC. Installed 10.2.0.5 32-bit agent on srac1 (single node standby).
2) Testing application of single-node RAC standby for differences in Standby Redo Logs processing (verifcation performed by using read-only mode).
3) LNS (ASYNC=buffers_number in LOG_ARCHIVE_DEST_2 parameter) performance fun.
Prepared srac2 for future RAC extension (to two nodes: srac1, srac2). Also installed GC agent on srac2 (10.2.0.5).
4) prac3: cloning and adding it into the Clusterware prac_cluster (addNode.sh from prac2). Deploying GC 10.2.0.5 agent on this node (prac1 and prac2 are both 10.2.0.4, in future I’ll try to upgrade it via GC). Later manually creating +ASM3 and db33 instances (redo, undo, srvctl, etc.). It means that I have 3 node primary RAC
5) srac2: Plan is to add it to the srac_cluster and make it 2 node standby RAC. +ASM2 was running, but more work is needed (mainly registrations in CRS/OCR).
6) Flash Recovery Area on standby ASM’s diskgroup +DATA1 was exhausted (thus MRP0 died) so I performed full RMAN backup with archivelogs to QUADVM dom0′s NFS and afterwards I’ve deleted archivelogs to reclaim some space. On SRAC standby I’ve changed archivelog deletion policy (in RMAN) and then restarted MRP0.
Unfortunatley I’ve lost my RAID5 array on synapse (dom0 hosting srac_cluster: srac1, srac2; it’s and old LH 6000R HP server) — 2 drives have failed, so my standby RAC is doomed until I’ll rebuild synapse on new SCSI drives (to be ordered)
UPDATE: I’ve verified backups of my srac1 and srac2 VMs but the backups for ASM diskgroup +DATA1 failed. Also my OCR and voting disks are lost. It will be real fun & challenge to recover this standby RAC enviorniment (this will be also pretty like restoring non DataGuarded RAC enviorniment after site crash). I belive I won’t have to rebuild my standby from primary, because I’ve backuped this standby earlier. OCR hopefully can be restored from Clusterware auto-backup location.

26/01/2009: Reading about migration of single instance ASM to full clustered ASM/RAC. Experiments with NOLOGGING and RMAN recovery on xeno workstation (db TEST).

27/01/2009: I’ve managed to migrate to full working RAC for db3.lab1 {nodes prac1 and prac2} with ASM storage (ASM migration done using DBCA; RAC migration performed by using rconfig). Deployed GC agent on prac2.

This picture below (click to enlarge) shows what I’m planning to deploy in my home lab in order to prepare better for OCP certification. It can be summarized as full Maximum Availbility Architecture implementation… Grid Control is being used to increase productivity, but I don’t want to integrate Oracle VM into the stack, just systems and databases:

17/01/2008: Installation and fight with Grid Control on VM gc. Preparing VM linux template named 01_prac1 from which other machines are going to be cloned (simple as recursive “cp” in dom0).

18/01/2008: Installation & fight with Grid Control after I’ve dropped “emrep” database (main GC repository database). This happened while I was playing with cloned database “misc1″ from “emrep”. I didn’t read message while running “DROP DATABASE ..” from RMAN and I’ve sent both to /dev/null, because the DBID was the same for the orginal one and the “misc1″ clone. The primary reason was that I wanted misc1 cloned from emrep but it failed ). Did I say that I’ve also deleted backups? After new, sucessfull fresh installation of GC, I’ve taken full backup (from XEN dom0) of 01_gc VM for future purposes. I’m starting to regret that I’ve haven’t used LVM in dom0 for snapshot/fast-backup purposes…

20/01/2008: Setting up 09_ora1 VirtualMachine from VM template 01_prac1. Installing single 11g database named “db1.lab1″ with dedicated tablespace & user for sysbench version 0.4.8 (0.4.10 doesn’t work with Oracle).

23/01/2008: Cloning ora2 from ora1. Changing hostnames, IPs (trick: the same ethernet MACs but on different XEN bridges, changes performed from console:)). Uninstalling Agent10g, vanishing db on ora2. Setting up DNS server on quadvm (in XEN dom0) for whole lab. Installing GC agent on ora2 – something is wrong… GC console doesn’t catch up new targets (primary I’m looking for “Host” component). Agent is itself discovered by GC…. starting from the beginning (rm -rf 08_ora2 ) and so on…
Finally got ora3 up running instead of ora2. Then it turned out that Laurent Schneider has blogged post about metalink note in which the procedure of agent removal is described. So finally I’ve got ora2 back in GC (with gc, ora1 and ora3).

Next step was setting up host prac1 as for single instance non-RAC ASM database “db3.lab1″. First clusterware has been installed. I wanted to have it 32-bit version, because my future standby RAC hardware is only 32-bit capable but it appears that I would have to install 32-bit userland RPMs, so I decided to try in the long term 64-bit primary RAC with 32-bit standby RAC… Also Grid Control agent was installed on prac1.

24/01/2008: Raised RAM for 01_prac1 to 1.6GB from 1.2GB (due to frequent swapping occuring, I want my 11g memory_target for that fresh db3.lab1 database to be left at 450MB). Succesfull migration of ASM storage /dev/hde to /dev/hdi (simulating storage migration – thanks to ASM it is straightforward. I’m going to blog about it in my next post).

I definetley need a rest(!). It’s my priority one. The problem is that I’m addicted to DOING something…

On 08.02.2008 I successfully got my Bachelor in Science. Basically we have implemented cluster using Solaris, Solaris Cluster, Oracle, Linux Virtual Servers, Apache2 and JBoss (I had to use Oracle DataGuard instead of RAC… as all of it was implemented in-home, see below for diagram). I’ll probably release webpanel for Solaris Jumpstart+FLARs (MySQL, PHP) some day under GPL. It was one of add-on projects for that engineering work.

Since about 15.02.2008 I’m studing for Master of Science on Computer Science, on Data Processing Technologies speciality track (emphasis is put on all databases related stuff)

Quick intro for non-Oracle people out there… Oracle DataGuard is High Availability solution for Oracle Database. For thousands pages of documentation, concept guides, troubleshooting, HOWTOs about it please visit docs.oracle.com

As of May I’m very busy architecting & implementing cluster for Java Enterprise Edition on comodity hardware (mainly x86_32 based) for my engineering work – to obtain BEng title. Our subject is:
“Web service based on scalable and highly available J2EE application cluster”. We have team consisting of 4 persons in which I’m responsible for all kind of systems/hardware scaling/clusters/load balancing/databases/networking/tunning everything . What kind of portal we are creating is to be decided by developers (it will likely be some kind of Web 2.0 portal).
Rest of the team is dedicated to J2EE programming. We are mainly playing with technology.
Currently rock-solid base core cluster architecture looks like this:

We are utilizing:

Load balancers: Linux Virtual Servers with DirectRouting on CentOS5 (configured as a part of Redhat Cluster Suite)

SNMPv2(LVS,OSes,JBOSS,Oracle) to monitor everything with single (selfwritten) Java application which graphs everything in realtime.

As this is basic configuration with database as an single point of failure, in Septemer I’m going to setup DataGuard for Oracle. Also I’m testing more advanced scale up. Currently I’m in process of setting up Solaris Cluster with Oracle RAC 10gR2 implemented on iSCSI storage provided by third node based on Solaris Nevada with iSCSI target to test Transparent Application Failover. I’ve been scratching my head over this one for awhile now. Yeah, it is real hardcore… more over that’s not the end of the story – Disaster Recovery with some other interesting bits of technology is going to be implemented later on… all on x86_32 comodity hardware Also we are going to put C-JDBC(Sequoia project) under stress…