Planning for Availability

Rightsizing Availability

To plan availability of systems and applications, assess the availability
needs of the user groups that access different applications. For example,
external fee-paying users and business partners often have higher quality
of service (QoS) expectations than internal users. Thus, it may be more acceptable
to internal users for an application feature, application, or server to be
unavailable than it would be for paying external customers.

The following figure illustrates the increasing cost and complexity
of mitigating against decreasingly probable events. At one end of the continuum,
a simple load-balanced cluster can tolerate localized application, middleware,
and hardware failures. At the other end of the scale, geographically distinct
clusters can mitigate against major catastrophes affecting the entire data
center.

To realize a good return on investment, it often makes sense identify
availability requirements of features within an application. For example,
it may not be acceptable for an insurance quotation system to be unavailable
(potentially turning away new business), but brief unavailability of the account
management function (where existing customers can view their current coverage)
is unlikely to turn away existing customers.

Using Clusters to Improve Availability

At the most basic level, a cluster is a group
of application server instances—often hosted on multiple physical servers—that
appear to clients as a single instance. This provides horizontal scalability
as well as higher availability than a single instance on a single machine.
This basic level of clustering works in conjunction with the HTTP
load balancer plug-in, which accepts HTTP and HTTPS requests and forwards
them to one of the instances in the cluster. The ORB and integrated JMS brokers
also perform load balancing to application server clusters. If an instance
fails, become unavailable (due to network faults), or becomes unresponsive,
requests are redirected only to existing, available machines. The load balancer
can also recognize when an failed instance has recovered and redistribute
load accordingly.

Adding Redundancy to the System

One way to achieve high availability is to add hardware and software redundancy to the system. When one
unit fails, the redundant unit takes over. This is also referred to as fault tolerance. In
general, to maximize high availability, determine and remove every possible
point of failure in the system.

Identifying Failure Classes

The level of redundancy is determined by the failure classes (types of failure) that the
system needs to tolerate. Some examples of failure classes are:

System process

Machine

Power supply

Disk

Network failures

Building fires or other preventable disasters

Unpredictable natural catastrophes

Duplicated system processes tolerate single system process failures,
as well as single machine failures. Attaching the duplicated mirrored (paired) machines to
different power supplies tolerates single power failures. By keeping the mirrored
machines in separate buildings, a single-building fire can be tolerated. By
keeping them in separate geographical locations, natural catastrophes like
earthquakes can be tolerated.

Using HADB Redundancy Units to Improve Availability

Using HADB Spare Nodes
to Improve Fault Tolerance

Using spare nodes improves fault tolerance. Although spare nodes are not mandatory, they provide maximum
availability.

Planning Failover Capacity

Failover capacity planning implies deciding how many additional servers
and processes you need to add to the Application Server deployment so that
in the event of a server or process failure, the system can seamlessly recover
data and continue processing. If your system gets overloaded, a process or
server failure might result, causing response time degradation or even total
loss of service. Preparing for such an occurrence is critical to successful
deployment.

For example, consider a system with two machines running one Application
Server instance each. Together, these machines handle a peak load of 300 requests
per second. If one of these machines becomes unavailable, the system will
be able to handle only 150 requests, assuming an even load distribution between
the machines. Therefore, half the requests during peak load will not be served.