This article presents the best practices for high availability, security, and scalability that commonly have significant success on a Sun ONE Portal Server software solution. In addition, the article includes guidelines for creating a Sun ONE Portal Server software solution that can be easily supported. This article is ideal for the advanced reader.

Like this article? We recommend

When proposing a technical solution for an specific problem, the first step is to collect functional and nonfunctional require - Title Page

When proposing a technical solution for an specific problem, the first step
is to collect functional and nonfunctional requirements. Generally, these
requirements fall into the following categories:

Performance

Risk

Cost

Schedule

Most of the time, especially in complex systems, such as portals where
content is aggregated from many different sources, these requirements conflict
with each other. Usually security and ease of use call for different approaches.
A very secure site can be hard to use because it requires complicated,
hard-to-remember passwords, or there are stringent session and inactivity
timeouts. In addition, availability and performance sometimes conflict. For
example, to provide session failover, it is necessary to keep the session in
sync on all of the servers. This synchronization adds some delay to each
transaction.

The process of creating a solution involves understanding the trade-offs
between all conflicting requirements and deciding what is more important for a
successful implementation. This article presents some architectural guidelines
that are frequently applied to Sun™ ONE Portal Server 6 software
implementations and will help you to identify and understand potentially
conflicting requirements on the performance and risk categories. After you
understand these categories, you should be able to include the cost and schedule
requirements when you define the final solution.

This article presents the best practices for high availability, security, and
scalability that commonly have more impact on the success of a Sun ONE Portal
Server software solution. In addition, the article includes guidelines for
creating a Sun ONE Portal Server software solution that can be easily
supported.

Before you read this article, you should have a detailed technical
understanding of the Sun ONE Portal Server 6 software and the Sun ONE Portal
Server Secure Remote Access components, such as the Gateway, the Netlet, the
Rewriter proxies, and the search engine. Also, an in-depth knowledge of the
embedded Sun ONE software products (for instance, the Sun™ ONE Directory
Server, the Sun™ ONE Web Server, and the Sun™ ONE Identity Server)
is required.

High Availability

Delivering high services levels is a top priority for all Sun ONE Portal
Server software implementations. You can determine the availability of a system
by using a simple equation, as shown in FIGURE 1.

As the equation shows, if you decrease the downtime of the system, you can
increase the availability of the system. However, when you measure the downtime,
you must measure the total amount of time the system is unavailable, which
should include the planned downtime (for example, maintenance, backups, and
repairs to the system) and the unplanned downtime (for example, system or
network failures). Some studies show that planned downtime can account for up to
80 percent of the total time a system is unavailable.

Thus, when you are architecting a solution, you must consider both the
planned and unplanned downtime to ensure that you create a highly available
solution. Availability is affected by all of the components in the system, such
as the following infrastructure components:

Hardware

Network

Operating system

Applications

In addition to these infrastructure components, availability is also affected
by people and processes, so when you are architecting a highly available
solution, you must ensure that the people who will be supporting the solution
have the proper training and skill sets, and you must ensure that clearly
defined processes are in place to support the system.

For background information on the concept of availability, refer to
"Availability - What It Means, Why It's Important, and How to Improve
It" (Sun BluePrints™ OnLine, 1999).

In reference to availability, system types can be defined in four ways:
noncritical, task critical, business critical, and mission critical. The
noncritical system type is a basic system that has no requirements for
availability. If the system goes down, it can be repaired in a matter of days
without affecting users. This type of system is not important to the discussion
of availability in this article.

Task-Critical Systems

Unlike the noncritical system, the task-critical system does have
availability requirements. If the system goes down, it would affect users, and
the performance of the system could be affected. The best way to achieve the
availability levels required for this kind of system is by using redundancy of
services. To optimize the usage of the system resources, all of the redundant
components should be active (that is, they should not be in standby mode).
Replication, load balancing, and service redundancy must be used to achieve this
goal. FIGURE 2 shows the basic design of a task-critical system.

Gateway Server Availability

As FIGURE 2 shows, in this architecture, there are at least two gateway
servers that are front-ended by a load balancer so that all of the
requests are spread across the gateways. The load balancer must also be
configured to detect failures in the gateways. If a gateway fails, then the load
balancer sends all of the requests to the surviving gateway.

The gateways are a stateless process, so if a gateway fails, all of the
sessions associated with that gateway can be redirected to the other gateway.
Users will not perceive any downtime because the Portal Server session is
maintained on the Portal Server nodes, not in the gateways.

When you use gateways, it is likely that the resource servers that are being
accessed through the gateways will be on a private network that is protected by
a firewall. In this case, you might want to use a web proxy to access these
resource servers so that a single hole is open in the firewall. Even though the
Sun ONE Portal Server software includes a rewriter proxy, it is not a fully
functional web proxy server. For example, it does not support caching, the
Internet Caching Protocol (ICP), and URL filtering. However, the Sun ONE Web
Proxy Server software is a reliable, inexpensive, and highly configurable web
proxy server, which in addition to these features, provides generic protocol
support for a firewall traversal by using SOCKSv5.

Portal Server Instances

In the architecture depicted in FIGURE 2, the Portal Server instances
are installed on the Sun ONE Web Server as web containers. The Sun ONE Web
Server software does not support replication of user sessions across instances,
so when a Portal Server instance goes down, all of the sessions maintained on
that instance are lost. The same is true if the web container used is an
application server, such as the Sun ONE Application Server Standard Edition,
that does not support session failover.

To increase availability of the Portal Server, you can create multiple
instances of the Sun ONE Web Server on the same machine or have multiple
instances on multiple machines. In this way, the number of users affected by a
Portal Server instance failure is minimized. Users that are affected would have
to log in to another server.

If an instance fails, the gateway detects the failure and reroutes the
requests to one of the surviving instances. If you are not using the Sun ONE
Portal Server Secure Remote Access software, you must have load balancers to
perform the functions of the gateways, and the load balancers have to detect the
failure of a Portal Server instance and send the requests to one of the
surviving instances.

Directory Server Availability

Another important component of the Sun ONE Portal Server software solution is
the Directory Server where the user and services profiles are stored. To remove
this component as a single point of failure, you can use the Sun ONE Directory
Server software's multi-master replication (MMR) configuration or the
Sun™ Cluster software framework. Because of the loosely consistent
replication mechanism of the Sun ONE Directory Server software, it is possible,
albeit very unlikely, that an update can be lost. If a system failure occurs
right after a change has been accepted by one master, but before the change is
replicated to the second master, it is possible that the change will be lost,
and there is no easy way to detect this fact.

In some very demanding environments, the possibility of losing an update
might not be acceptable. In this case, the best option is to use the Sun Cluster
software to achieve high availability of the Directory Server. The use of the
Sun Cluster software can increase the availability of the system, but the
configuration, maintenance, and monitoring of this environment require more
specialized knowledge and very well defined operational processes.

When you are installing the Portal Server software, you can only specify one
LDAP server, and this server must be a master LDAP server because the
installation process is affected by the propagation delay of the LDAP
replication process. To add multiple LDAP masters or to point the Sun ONE
Identity Server to use a consumer after the installation, you must edit the
serverconfig.xml file and add a Server element for each
additional LDAP server. The following example shows the format of the server
entries:

The Identity Server uses the first entry as the LDAP server for all requests
of service, roles, organization, and user profiles. If that LDAP server fails,
the Identity Server fails over to the next server in the list. There is no
round-robin or failback between the LDAP servers, so if you want to design a
solution in which all of the LDAP servers are used evenly, you will have to use
the Sun ONE Directory Proxy Server software. A load balancer cannot be used
because the Sun ONE Identity Server software uses a pool of connections that are
kept open and are reused. The same is true for the LDAP and membership
authentication modules and for the Policy Configuration service. They can use
primary and backup LDAP servers, but you have to add the failover servers after
the installation by using the administration console.

Planned Downtime

With the architecture shown in FIGURE 2 on page 4, there is
redundancy of services, so most of the unplanned downtime can be minimized or
eliminated. However, the planned downtime is still an issue. For instance, if
the Portal Server software must be updated, services could be affected. If the
upgrade or patch includes changes to the Sun ONE Directory Server software
schema used by the Sun ONE Identity Server software, all of the software
components must be stopped to update the information stored in the Directory
Server.

In addition, the Solaris™ Operating System (Solaris OS) patch
installation process does not work if the application services are enabled.
Thus, you must shut down all of the services, patch the system, then bring the
system back online. In some environments, the downtime incurred during the patch
process might not be acceptable. But, with a highly available solution with
duplicate services and components, you can use a phased approach for maintaining
the system. For instance, you could remove one Portal Server node from the
production configuration and upgrade it. Then, you could remove one of the
gateways from the production configuration and upgrade it. Afterwards, you could
integrate that silo back into the production configuration and repeat the
process for the other silo, resulting in an upgrade with minimal interruption of
service.

In theory, you would not have downtime; however, because of the architecture
of the Portal Server software, it is not possible to just remove one Portal
Server instance from the active configuration without affecting some users
because there is no way to prevent a user from logging in to the server and to
keep the active sessions untouched. Thus, you must create a mechanism to prevent
users from logging in to the server while the gateways still process request
from the already-authenticated users. This can be accomplished by using a
custom-developed authentication module.

Business-Critical Systems

The third type of system is the business-critical system. For this type of
system, availability is a critical requirement because if the system goes down,
it could lead to lost revenue, lost productivity, and customer dissatisfaction.
FIGURE 3 shows the typical configuration of a business-critical system,
which builds on the architecture described for the task-critical system and
includes all of its benefits.

To enhance the availability of the system, an application server is used as a
web container for all of the Identity Server and Portal Server services. This
configuration is needed to use the HTTP session failover features of the
Application Server. This maintains a database of all of the sessions that are
created in the system, and that database is accessible to all of the Portal
Server instances in the configuration. Thus, if one Portal Server instance
fails, the gateways redirect all of the requests to the surviving Portal Server,
and that Portal Server will be able to validate that the session is still valid
so that the operation will continue to work smoothly.

In the architecture depicted in FIGURE 3, the availability of the system
is much higher than the architecture discussed in the "Task-Critical
Systems" section. However, it is not possible to achieve absolute session
failover. Depending on how an application server instance fails and how and
where user sessions are stored and replicated, newly created individual user
sessions might not have been written to the common database, so they might be
lost.

The Sun ONE Portal Server software, version 6.2, supports BEA's WebLogic
6.1 and the Sun™ ONE Application Server 7.0 Enterprise Edition for session
failover. When using BEA's application server, the WebLogic Cluster
software is required to create an environment in which the sessions are
replicated; however, the Sun ONE Portal Server Secure Remote Access software
does not work with the WebLogic Cluster software. Thus, it is not possible to
implement a highly available solution with the WebLogic software if the gateways
are also required.

For the business-critical system, upgrades have a minimum impact. Either
Portal Server node can be taken out of the production configuration without
affecting the users because the sessions are in the shared database and because
requests will be handled by the available Portal Server node.

Mission-Critical Systems

The fourth type of system is the mission-critical system. For the
mission-critical system, failures could have catastrophic results for an
organization (for example, loss of life or serious injury, significant loss of
money, serious inability to conduct business, or serious operational chaos).
Most mission-critical systems are usually custom built using special hardware
such as fault-tolerant computers and software.