Fault Tolerance

We have a client site running an in house PRO-IV application with 30 users.
They are looking at improving their position with regard to hardware failures (there has been one failure in the last five years).

In the past months the client has studied various alternatives concerning changing the present server and the now out-dated Oracle database (version 7.4.2). The brief is to implement a more scalable and fault-tolerant solution on a more recent version of Oracle, presumably Oracle 9i.

They have short-listed some potential solutions as regards to fault-tolerance. Besides the technological difference between the two there is also an element of cost.

Option 1
A hot backup server with identical configuration to the production server, in terms of the PRO-IV application and the Oracle db. In the event of a failure, they would switch to the backup server and the users would lose any data input that day.

Option 2
Implement Oracle Failsafe on two server nodes, each sharing a shared storage medium. In the event of a node failure, the Oracle instance on the failed node is replicated and executed on the passive node. This solution involves the disconnection of users from the failed node and the automatic reconnection of the same users to the backup node. We understand that such a procedure will involve the abrupt disconnection of PRO-IV clients therefore once reconnection is activated on the backup node, these clients require to be manually removed (Killed) from the previous session.

Option 3
Implement Oracle Real Application Clustering. Again this involves two server nodes and a shared storage medium. In the event of a node failure, there should be a seamless transfer of operations to the other node without any downtime. However I presume the link between the PRO-IV client and kernel would be lost?? The client is currently on PRO-IV V4, due to move to V5 later this year.

Does anyone have option 2 or option 3 live on a client site, any comments on how they would perform with PRO-IV?

Another alternative would be to use a heartbeat setup available for most unix flavours including linux. This would be the equivalent of option 1 (hot). It works by both machines apearing to have the same IP and processing everything identically - PROIV, Oracle etc in failure the heartbeat element will route ip packets to the remaining machine so providing a continous service.
Licences are available for disaster recovery purposes from both PROIV and Oracle - contact your account managers for more information.

Thanhks for the reply - Unix solution is not a runner as customer experience/skills is mainly NT/Windows

The client site is currently NT & ORACLE 7 - in coming month moving to Win2000 & ORACLE 9 - as part of this upgrade we are considering now what is the best failsafe procedure for a Win2000 & ORACLE 9 configuration

Whether a Failsafe/Cluster/Hotserver solution there are licence implications

What we'd really like to know is the experience of other sites in implementing the Best Solutions for system recovery in the event of system fialures?

Is everyone out there that using pro-iv with NT/Win and ORACLE generally relying on Standby Hotserver as the solution for System Recovery, with the re-entry of data ?

Obviously I don't know anything about your application. However, unless it is mission-critical in some way or non-availability is potentially very costly it would be unusual (in my limited experience) to use Failsafe or Cluster technology to support 30 users.

I would have guessed initially that all you need is an Oracle Standby database. Provided this is geographically separated (as is usual), then this also provides for disaster recover in the case of natural disasters - something that should be provided for all critical applications. A Standby database can be almost up-to-date if you have decent comms, there is no need to lose a day's input - what kind of fault tolerance is that?

Of course it's your problem to keep all the software (OS, database, application etc..) on the standby server(s) in precise sync with the 'primary' server(s). Do not underestimate the work involved (This can also apply to the multiple nodes of a cluster in some circumstances). I'm not clear if your ProIV app is running on the same machine as the database.

I know of a few ProIV applications running on Oracle Parallel FailSafe (OPFS). These systems are two-node clusters where one node is active (primary) and one passive (secondary). Oracle is (normally) up on both nodes but applications only connect to whichever node is currently the primary. The ProIV applications do not run on the database cluster but on separate machines. All the examples I'm aware of use Unix for both the database and application servers, I haven't heard of any on Windows yet.

With OPFS, database failover is fully automatic but ProIV applications have to be restarted and users have to reconnect. The applications in question were specifically designed to fail cleanly and to be recoverable/restartable with respect to in-flight transactions and all batch/offline processing.

You should always consider independently the connection between the users and ProIV and the connection between ProIV and the database. Note ProIV does not support Oracle Transparent Application Failover (TAF) and there are no plans to do so as far as I am aware. In any case, TAF is not effective with read-write transactions.

I am involved in some ProIV work with Oracle Real Application Clusters (RAC) but today this is largely aimed at improving scalability rather than availablity. Be aware that RAC provides the foundation for applications that can 'seamlessly' handle database connection failures but that your application must be designed and coded to do this (the ProIV kernel isn't). RAC cannot make your application more reliable or do anything about the connection(s) between your application and online users - it's not a magic bullet!

We currently run Glovia 5.2 and Oracle 8i on OFS (Windows 2000). We have a two node, shared - nothing configuration. We run our distribution center software on one node and Glovia on the other. The set up works well enough for us, but we have one large problem that you may need to know about: Pro IV requires a second license for the stand-by node. This was very costly for us.

We are aware that there are additional PRO-IV licence costs for Standby/Cluster Servers, this is not our main concern, a lot of Software Suppliers charge additional for licences for standby/cluster servers. PRO-IV have been very accomodating with Temp Licences for testing and other tasks.

I'd be looking seriously at option 3 - Oracle 10g on RAC with at least 2 nodes in the cluster, and a (cold) standby in the usual far-off place to avoid the issue of site loss. RAC is beginning to get much more serious traction in the market and allows you/your client to do some wonderful things. The New Zealand Stock Exchange just consolidated 21 separate Oracle DB's into a single instance running on RAC and got significant (a claimed factor of 1000 for some queries) performance benefits.

And then run your PRO-IV app somewhere else (ie not on the Oracle cluster) and if you're brave, cluster the PRO-IV servers. Run the 9i client on the PRO-IV servers and you should have no problem with comms to the DB. You'll need decent (Gb) ethernet between PRO-IV servers and the Oracle cluster, but in my opinion they should be separate anyway. And this also assumes that pretty much your *entire* app uses exclusively Oracle (no PRO-ISAM other than bootstraps).

RAC does appear to allow you to use 'cheap' hardware (that is, 10s not 100s of $K/box) and stories in Australia seem to be good too. And 10g's admin capabilities are far in advance of what 8i or 9i have offered in the past. I'm not sure about whether you'd want to go to their Linux solution, or if your client's already a 64-bit Oracle customer it would be better to stick to 64-bit Oracle - and that means clustered Sun boxes, for example. I'll be looking with interest at Solaris 10 on the new Sun/Opteron servers when Oracle releases on the Solaris x86 platform 'in the near future'.

I wouldn't go near the Oracle FailSafe solutions - its not Oracle's future direction, their new (ie RAC) stuff is definitely their future direction.

All this assumes that your client has a reasonable budget - RAC isn't cheap to implement - but in a high-volume environment where you need very high availability (<1 failure in 5 years is pretty high, I think), I don't think you can go past it.

We went LIVE last week, in recent months lots of new development while system testing the new kit/system sort of delayed us, but business comes first, generally things went smoothly, just imported the dump made form ORACLE 7 into ORACLE 9 and worked first time. "still fingers crossed".

The client site went with option 2 - the system is as described by the "E O H" poster
"I know of a few ProIV applications running on Oracle Parallel Failsafe (OPFS). These systems are two-node clusters where one node is active (primary) and one passive (secondary). Oracle is (normally) up on both nodes but applications only connect to whichever node is currently the primary. The ProIV applications do not run on the database cluster but on separate machines. "

utilising pro-iv 5.5, ORACLE9i and Win2003.

ORACLE 9i was the customer decision as they have other business applications on this DB. We cannot just move to 10g.

The keys to a no. of the primary/master tables are allocated via a counters/keys file which is pro-isam.
We built a solution that in the event of a fail over the counters/keys file on the secondary node is automatically synchronised with the DB before users connect via the secondary node. Generally there is a few minutes of user downtime if fail-over from the primary to secondary node occurs.

Query 1) Has anyone easily converted a large application from using a pro-isam counters/keys to using an ORACLE solution for the allocation of the keys. I did see a somewhat related post but it seems to relate to keys for files that had also "user" in the key or solutions where the data-entry was made to work-files to avoid contention/record locking. It has been a tiring few months so the more detail anyone can supply on how to do this would be appreciated.

Query 2) From the above, you've a picture of how our application structure. Our clients h/w and s/w environment provides them with the "feature" to action 'hot-mode' backup of the ORACLE DB. Of course in the current environment with both pro-isam and oracle db used then backing up an active oracle db will result in the DB not being in-sync with the pro-isam files. Again we have a solution that refreshed the pro-isam file.
The application is generally an 8-6 system with nightly batch processing occurring after last user logs-out.
My preference is that after the nightly batch that the backup occurs of the pro-isam files and the oracle db but the processing time etc sometimes does not leave large enough window for both batch and backup to occur.

Anyone out their got any experience of hot back-ups on an ORACLE 9i DB ?
Is there a need for architecting/design your functionality/functionality in any particular style to accommodate that hot back-ups occur ?

Thanks for taking time to read this, would really like to hear from anyone who has done anything similar to the above requests.

George, We are not on a cluster but Oracle 10G. We converted our Pro-Isam key file to Oracle sequences. Very simple to to do and you have no record locking or file locking. Oracle cashes as many next sequence number as needed. Ours are cashed around 50 numbers.

We are aware that there are additional PRO-IV licence costs for Standby/Cluster Servers, this is not our main concern, a lot of Software Suppliers charge additional for licences for standby/cluster servers. PRO-IV have been very accomodating with Temp Licences for testing and other tasks.