1 Monitoring and Tuning Oracle Fusion Applications

This chapter discusses how to find the information you need to examine so you can tune your system. It includes how to monitor and tune the database and Oracle Fusion Applications, and troubleshooting.

1.1 Introduction

Every system of hardware and installed applications is different. Even though Oracle Fusion Applications are written and installed using industry-standard best practices, you can custom tailor your system to improve how it supports your environment.

But to tune your system, you need to locate and examine data. This chapter will explain what data you need to examine, and what tools you will use to gather the data.

1.2 Monitoring and Tuning Oracle Fusion Applications

In general, most of the settings that come default in Oracle Fusion Applications are already tuned.

These guidelines are provided to help ensure your Oracle Fusion Applications instance runs optimally. Note that all metrics listed are from Oracle Enterprise Manager Cloud Control.

Monitor the key host metrics, shown in Table 1-1, to ensure the underlying server hosts are healthy. Rather than constantly checking the metric values, you can set up alert thresholds in Cloud Control and receive notification when thresholds are exceeded. For more information, see the see "Creating Monitoring Templates" in the Oracle Fusion Applications Administrator's Guide.

Monitor the key component metrics, such as WebLogic server metrics, to ensure each component is healthy.

Monitor the number of incidents and logs to ensure the application is configured properly and not constantly wasting resources generating error messages. Review log levels to ensure they are not set too low. See "Troubleshooting Oracle Fusion Applications Using Incidents, Logs, QuickTrace, and Diagnostic Tests" in the Oracle Fusion Applications Administrator's Guide for more information

Monitor the database to ensure it is operating optimally. Follow the guidelines in Chapter 3, "Tuning the Database," to make sure that statistics are being collected.

Table 1-1 Key Host Metrics

Metric Category

Metric Name

Warning Threshold

Critical Threshold

Comments

Disk Activity

Disk Device Busy

>80%

>95%

Filesystems

Filesystem Space Available

<20%

<5%

Load

CPU in I/O wait

>60%

>80%

CPU Utilization

>80%

>95%

Run Queue (5 min average)

>2

>4

The run queue is normalized by the number of CPU cores.

Swap Utilization

>75%

>90%

Total Processes

>15000

>25000

Logical Free Memory %

<20

<10

CPU in System Mode

>20%

>40%

Network Interfaces Summary

All Network Interfaces Combined Utilization

>80%

>95%

Switch/Swap Activity

Total System Swaps

>3

>5

Value is per second.

Paging Activity

Pages Paged-in (per second)

Pages Paged-out (per second)

The combined value of Pages Paged-in and Pages Paged-out should be <=1000

1.2.1 How to Analyze Host Metrics

Administrators will find it useful to study these suggestions on further analysis to undertake when a metric value exceeds threshold. The commands provided are for the Linux operating system.

The normal cause is misconfiguration between the host and the network switch. A bad network card or cabling also can cause this error. You can run /sbin/ifconfig to identify which interface is having packet errors. Contact network administrator to ensure the host and the switch are using same data rate and duplex mode.

Otherwise, check if cabling or the network card is faulty and replace as appropriate.

When Packet Loss Rate Is Beyond Threshold

The normal cause of this error is network saturation of bad network hardware.

Run lsof -Pni | grep ESTAM to determine which network paths are generating the problem.

Then run mtr <target host> or ping <target host> and look for packet lost on that segment.

Windows: Open the Task Manager, click the Processes tab and click the CPU column to sort the processes based on CPU usage.

If top processes are WebLogic Server JVM processes, conduct a basic WebLogic Server health check. That is, review logs to see if there are configuration errors causing excessive exceptions, and review metrics to see if the load has increased. Use JVMD for a more detailed analysis.

If top processes are Oracle processes, use Enterprise Manager to look for high load SQL.

High system CPU use is also frequently related to various device failures. Run {{dmesg | less}} and look for repeated messages about errors on some particular device, and also have hardware support personnel check the hardware console to see if there are any errors reported.

When Filesystem Usage Is Beyond Threshold

The normal cause is an application that is logging excessively or leaving behind temporary files.

Run lsof -d 1-99999 | grep REG | sort -nrk 7 | less to see currently open files sorted by size from largest to smallest. Investigate the large files.

Run du -k /mount_point_running_out_of_space > /tmp/sizes to get space used for directories under the mount point. This may take a long time. While it is running, run sort -nr /tmp/sizes and find the directories using most space and investigate those first.

When Total Processes Is Beyond Threshold

The normal cause is runaway code or a stuck NFS filesystem.

Linux: Run ps aux. If many processes are in status D, run df to check for stuck mounts.

Windows: Run Task Manager, click the Processes tab, and check the list of running processes.

If there are hundreds or thousands of processes of a particular program, determine why.

Run ps o pid,nlwp,cmd | sort -nrk 2 | head to look for processes with many threads.

When Disk Device Busy Is Beyond Threshold

Check for disk drive failure.

Linux: As root, check /var/log/messages* and /var/log/mcelog to see if there are any error messages indicating disk failure. For a RAID array, the disk controller needs to be checked. The commands will be specific to the controller manufacturer.

Windows: Run perfmon and look at the Alert logs. Run chkdsk to check for disk failure.

Look for processes that are using the disk. From a shell window, execute ps aux | grep ' D. ' several consecutive times to look for processes with "stat" D.

1.2.2 How to Check for Network Connectivity Issues

Poor performance is a major indicator of network connectivity problems.

1.2.3 How to Analyze WebLogic Server Metrics

These metrics provide an indication of whether the WebLogic Server is in a healthy state. Performance may degrade if any of the metrics is exceeding its threshold.

Table 1-2 describes the WebLogic Server metrics you should monitor in Cloud Control. See the "Creating Monitoring Templates" section in the Oracle Fusion Applications Administrator's Guide to create a monitoring template.

Table 1-2 WebLogic Server Metrics

Metric Category

Metric Name

Warning Threshold

Critical Threshold

Comments

Datasource Metrics

Connections in Use

>250

>400

Connection Requests that Waited (%)

>10%

>20%

Connection Creation Time (ms)

JVM Garbage Collectors

Garbage Collector - Percent Time spent (elapsed)

>10%

>20%

JVM Metrics

Heap Usage

>90%

>98%

Response

Status

=Down

This provides instance availability.

Server Servlet/JSP Metrics

Request Processing Time (ms)

>10s

>15s

Server Work Manager Metrics

Work Manager Stuck Threads

>5

>10

JVM Threads

Deadlocked Threads

>2

>5

Module Metrics By Server

Active Sessions

When CPU Usage On Host Is Beyond Threshold and WebLogic Server Process Is Identified as Top CPU Consumer

Examine the % Time spent in the GC metric to see if JVM is doing excessive GC (>60 percent). If so, follow the process for diagnosing WebLogic Server heap pressure.

Look for incident creation rate and error logs and see if something is triggering a massive amount of logging/errors.

In JVMD, select the CPU state filter and look at top methods. Look for threads that are consistently in a CPU state.

When There Is a Spike in Active Web Sessions

Check access logs to see if there is a spike in the number of users.

Check if there are stuck threads, which could cause users to log in again.

Check session distribution across WebLogic Server managed servers and see if there is a problem with the load balancer.

Check session timeout in web.xml, and see if it is too high or too low.

When There Are Stuck Threads On the System

Get the ECID from the stuck thread error in the WebLogic Server log.

From the Request Monitor, search for the ECID and get details from JVMD.

Alternatively, use JVMD to search for stuck threads and see the timing breakdown.

A stuck thread will also result in an incident with a JFR recording. Use JRMC to analyze the recording.

When There Are Deadlocks Detected On the System

In JVMD, inspect the threads that are in a blocked state.

Deadlock threads normally also will be reported as a stuck thread in the WebLogic Server log. Use the Request Monitor to search for the ECID and expand down into JVMD to show the blocking thread.

When Request Processing Time Is Beyond Threshold

Examine the % Time spent in GC metric to see if JVM is doing excessive garbage collection

Look for incident create rate and error logs and see if something is triggering a massive amount of logging/errors.

In JVMD, look at the thread states and see where most processing time is going.

Check the metric Garbage Collection - Invocation Time (ms) under the JVM Garbage Collectors metric category. Sometimes if you run many managed server instances on the same host, you may be able to reduce time spent in garbage collection by reducing the number of garbage collector threads in each JVM. The default is based on the number of CPUs and could be too high if there are multiple active JVMs running on the same machine. In those cases, if you are using JRockit, add the -XXgcThreads=4 option when starting the JVM. To add the option, edit the DOMAIN_HOME/bin/fusionapps_start_params.properties file, look for -Xgc:genpar and add the -XXgcThreads=4 option after it (for example -Xgc:genpar -XXgcThreads=4). The value 4 directs the JVM to use four threads to perform garbage collection. You can try different values from 4 to the number of CPU cores and observe if the % Time spent in GC metric improves. For other platforms, see Section 1.3, "Tuning Platforms for Oracle Fusion Applications."

When Percent Time Spent in GC Is Beyond Threshold

Check the session count. If there is a sudden surge of sessions due to user load, the JVM could be short on heap. Increase heap if possible, or add additional managed server instances.

Look at the stuck threads count. Stuck threads could increase the number of active session, as users could be launching new sessions hoping for a faster response.

Look at the incident creation rate and error logs and see if something is triggering a massive amount of logging/errors. The incident creation/logging operations could be causing a high amount of object creation and garbage collection stress.

Generate a heap dump using JVMD and analyze the top retainer of memory.

Use JRMC to connect and extract a JFR recording. Examine the Memory panel and allocation details to see what is doing a lot of allocations.

When Percent Connection Requests Waiting Is Beyond Threshold

Examine the number of sessions and request rate, and see if there is a spike in the load that would account for an increased demand for connections.

In JVMD, see where time is spent. For example, requests could be running longer due to slow SQLs (and retain the connection longer). In that case, identify and tune slow SQLs.

Check request throughput to see if load has increased. If the increased load is expected and CPU and memory resources on the OHS host has not exceeded threshold, consider increasing ServerLimit/MaxClients and ThreadsPerChild in httpd.conf.

Check request process time on both OHS and underlying WebLogic Server to see if requests are taking longer. If WebLogic Server response time is increasing, check the key metrics for the WebLogic Server.

If possible, ensure the client browser cache is enabled to reduce number of requests submitted.

Check OHS Response Code Metrics. If there is a sudden increase of HTTP 4xx errors or HTTP 5xx errors, check the health of the underlying WebLogic Servers.

Check and increase the minimum and maximum spare threads for Oracle HTTP Server.

In the httpd.conf file located in instance_home/config/ohs/<ohs_name>/httpd.conf:

Increase MaxSpareThreads to 800.

Increase MinSpareThreads to 200.

When Request Processing Time for a Virtual Host Exceeds Threshold

Check the key host metrics to ensure the OHS host is healthy.

For each URL requested, OHS will first check DocumentRoot before passing the request to WebLogic Server. Check the utilization and health of the disk to which the DocumentRoot is pointing. If it is a NFS mount, check the health of the NFS mount point.

Check the key metrics for the underlying WebLogic Server(s) and see if they are healthy.

OHS accesses /tmp for each POST request, so check the performance of the /tmp filesystem.

1.2.5 How to Analyze Oracle Business Intelligence Server Metrics

These metrics provide an indication of whether the Oracle Business Intelligence Server is in a healthy state.

Use Cloud Control to monitor the Oracle Internet Directory and Oracle Identity Manager databases. For information on creating monitoring templates in Cloud Control to obtain metrics, see the "Creating Monitoring Templates" section in the Oracle Fusion Applications Administrator's Guide.See Table 1-4 for Oracle Identity Manager metrics.

1.2.7 How to Analyze Key Enterprise Scheduler Metrics

The metrics shown in Table 1-6 provide an indication of whether the Enterprise Scheduler instance is performing well. See the "Creating Monitoring Templates" section in the Oracle Fusion Applications Administrator's Guide.

Table 1-6 Key Enterprise Scheduler Metrics

Metric Category

Metric Name

Warning Threshold

Critical Threshold

Comments

Completed Job Summary

Average Elapsed Time (ms)

You can define different thresholds for different job names.

Long Running Job

Elapsed Time (ms)

WorkAssignment Metrics aggregated across Group Members

Average Wait Time for Requests in Ready State (seconds)

When the Value of Average Elapsed Time for the Completed Jobs Is Higher Than Expected

Check the key host and WebLogic Server metrics and see if any component that could be involved in process batch jobs is in an unhealthy state.

Description: Oracle Identity Management stack WebLogic Server log levels are too fine-grained and need to be set to Severe.

Solution: In all WebLogic Servers in the Oracle Identity Management domain, change log levels to SEVERE. This is a two-part process.

Part 1: Manually edit the logging.xml file, or by using the Oracle WebLogic Server Administration Console.

Edit the logging.xml file that is in each server directory of the Oracle Identity Management Domain domain, such as OAM_Server1, OIM_Server1, and SOA, and set level='SEVERE' for all log_handlers and loggers. The path to each logging.xml file will resemble:

DOMAIN_HOME/config/fmwconfig/<servername>

Part 2: Edit the log levels in the Oracle WebLogic Server Administration Console:

Log in to the console (http://hostname:port/console).

Click the Servers link.

Click the desired server.

Click the Logging tab.

Scroll down and click the Advanced link.

In the Message destination(s) section, change the log levels as shown here:

Solution: Change orclmaxcc to 10 and tune the number of OID processes:

Name the sample script config_oid_tuning.ldif. You will need to set cn=oid1 to your component name. In a multi-component environment, this needs to be changed accordingly. You will need to set orclserverprocs to the number of cores in the OID server that is used.

Add this entry to the config.xml file in ./oid/user_projects/domains/oid_domain/config/ and the ./oim/user_projects/domains/oim_domain/config/ directories for each WebLogic Server in the Oracle Identity Management domain:

To set the access log format, add this string to the httpd.conf file in the /u01/ohsauth/ohsauth_inst/config/OHS/ohs1 path.

LogFormat "%h %l %u %t \"%r\" %>s %b %D %{X-ORACLE-DMS-ECID}o" common

Increase Policy Cache Timeout

Description: By default, entries in the security policy cache time out every 12 hours. When these entries time out, sporadic slowness may be experienced because they need to be repopulated. To avoid this, increase the timeout value.

Solution:

Open the DOMAIN_HOME/config/fmwconfig/jps-config.xml file.

Add the following entry to the <serviceInstance name="policystore.ldap" provider="policystore.provider"> section and select a timeout value (in milliseconds):

1.4 Tuning Oracle HTTP Server

This section lists several Oracle HTTP Server configuration changes that may improve performance. These settings are the default if your environment is newly provisioned. If your environment in upgraded, you will have to manually apply these setting changes.

Follow these steps to tune the Oracle HTTP Server.

Avoid restarts of httpd-worker processes by increasing MaxSpareThreads and MinSpareThreads.

These restarts affect the recreation of connections and threads in Oracle HTTP Server processes during varying load patterns and could negatively affect performance. The recommendation is to increase the minimum and maximum spare threads for Oracle HTTP Server to 200 and 800 respectively.

Edit the httpd.conf file located in instance_home/config/OHS/<ohs_name>/httpd.conf so it resembles:

Set FileCaching OFF in the <instance home>/config/OHS/<ohs name>/mod_wl_ohs.conf file.

By default, the Oracle HTTP Server Weblogic plugin will first write the content of any POST request that is larger than 2Kb to the /tmp directory. In some cases, if there are many concurrent accesses to /tmp, or if the underlying disks are busy servicing other processes on the same host, you may notice periodic spikes in response time.

Setting FileCaching to OFF in the mod_wl_ohs.conf file, as shown here, will disable this step and should eliminate the spikes and improve performance.

Note that there is one scenario where it may be beneficial to leave FileCaching ON. If there are many clients connecting through a slow network, it may be beneficial to first write the POST data to /tmp. Otherwise, resources in Oracle HTTP Server and WebLogic Server will be tied up waiting for the POST data to arrive, and would not be able to service other requests. If you notice response times degrading after turning off file caching, you may be running into this scenario and you should reverse the setting.

Increase the ThreadsPerChild setting from 50 to 250. Oracle HTTP Server processes maintain shared resources, such as connection pools to back-end servers, and memory for various purposes such as storing a cache of static files, and storing server information by pinging servers to obtain a dynamic server list from the cluster.

It would be efficient to have more threads created within a process so that they can effectively share the common resources. Setting more threads per process reduces the memory footprint and improves the efficiency of using the connection pool to the back-end servers. To create more threads within a process, change the details in the <instance home>/config/ohs/<ohs name>/httpd.conf file, as shown here.

By default, a connection between an Oracle HTTP Server and a WebLogic Server is closed if it is idle for 20 seconds.

Since there is a cost in re-establishing these connections, it is beneficial to increase this timeout to 5 minutes. To do this, the mod_wl_ohs.conf and config.xml files for the target WebLogic Server domains need to be changed.

Edit the <instance home>/config/OHS/<ohs name>/mod_wl_ohs.conf file so it resembles this example:

Add this line to the <instance home>/config/OHS/<ohs name>/moduleconf/mod_deflate.conf file.

SetEnvIfNoCase Request_URI \.(swf)$ no-gzip dont-vary

The .swf files are already compressed; there is no need for Oracle HTTP Server to compress them again.

Set the Expires header for Business Intelligence static resources. Adding the HTTP Expires header for static resources related to business intelligence improves performance by allowing the browser to locally cache those artifacts instead of making repeated requests for them.