The idea to create this new thread about OBIEE 11g Performance Tuning as it is a real challenge and very few people even the experts have limitation to cater all tuning aspects. This is only because a monster seating in front wheel which is driver of all stuff called “Weblogic” and magnitude of knowledge with evolve of Weblogic 11g has certain limitations.

We are going to explore nuts and bolts of all possible tuning aspects which essentially can help a lot of people. The scope of this blog considers tuning OBIEE 11g(11.1.1.7 )Linux platform which is more stable and more reliable. We have huge # of customers accessing analytics everyday with high # of queries. The Customer base is huge each day 24×7 accessible across every part of the universe. The reports and pages rendering needs to run very fast without any performance concern at-least that is the expectation. Having this in mind we started tuning 11g platform and performed regression test 5 times in row. Below are the tuning what we did so far in different layers of the application. Overall it was a great testing effort conducted ! We did performance testing with 1000 concurrent user ramped up in the system within 1 minutes. We use performing testing tool to emulate user login using a software and browse through across different dashboard pages and the logout mimicking users activities.

Lets see what are the different layers of Performance tuning we have done on our platform:

Operating System(OS) / Hardware Level Tuning

We have 3 Virtual servers in Horizontal cluster and each of them running only 1 instance. No Vertical clustering. All in Highly available platform with Fail-over and Load balancing capability and has future scalability opportunity. Note those servers are not Physical yet. If you are planning to do Physical that will give more dedicated power and Oracle recommend this but as of now if you plan to do with Virtual it will give you similar capability if you mastered it with proper level of tuning.

Assign Proper Hardware resource: Initially we had 3 CPU Cores but with the increase # of loads on CPU we increased that to 8 Virtual CPU’s to balance out the load. We have Intel(R) Xeon(R) CPU E5-2637 v2 @ 3.50GHz and each server has 128 GB of RAM.

Change TCP Max Sync: When the server is heavily loaded or has many clients with bad connections with high latency, it can result in an increase in half-open connections which is called “tcp_max_syn_backlog” . To give CPU enough room to handle too many open files and to avoid OutofMemory error while creating new native threads we have tweaked couple of below parameter in /etc/security/limits.conf file. Let root edit file to include below: Also check the parameter using ulimit -n and ulimit -u command in Linux.
* soft nofile 131072
* hard nofile 131072
* soft nproc 131072
* hard nproc 131072

Change TCP Fin Timeout: cat /proc/sys/net/ipv4/tcp_fin_timeout , this will give you the existing #. If the # is 60 make it 30 . Reducing the value of this entry, TCP/IP can release closed connections faster, providing more resources for new connections.

You might thing of to change: /proc/sys/net/ipv4/tcp_keepalive_time to be less than 7200. Follow this to make both the changes on TCP level is updated in system:

I have face a very weird scenarios where we had to touch Cisco router settings to Allow Urgent Flag(Disable Clear Urgent Flag) to have better communication between DB server and OBIEE servers. We had issue where Usage Tracking causing trouble to render one prompts .. ! Sounds weird ! yes its true , it was a connection issue between UT table in DB and UI causing trouble for one of the multi-select prompt never rending and hanging forever. Also I have seen issues where Scheduler process has severe intermittent connection issue with DB (RCU Schemas). Both of them are addressed by this Cisco Firewall level changes below:

Middle Tier (Weblogic EM/Console) Layer Tuning

1) In EM against data source Increase # of Connection Pool to 150. This includes all the data source added there as per below image. For us we haven’t touch default 50 and you might want to do that. Increasing this value could cause potential impact on Database as it will try to open that many open cursors and processes to the database server.

2) We have tuned 64 bit JVM’s. In JRockit JVM (R28.x), the heap grows faster than before. The JVM also ensures that the heap size grows up to the maximum Java heap size (-Xmx) before an OutOfMemory error is thrown.So we have tweaked some JVM related parameter under below file: [MiddlewareHome]/user_projects/domains/bifoundation_domain/bin/setOBIDomainEnv.sh

3) Tune Analytics (WebLogic Server app plug-in) Connection pool:

There is a connection pool between WebLogic Server analytics app and OBIPS, and the default value is 128 inadequate for a large number of concurrent users which is typically expected in a BIEE system with high users concurrency. When the number of connections reaches the maximum limit, any new requests are kept waiting. Hence, it is recommended to increase this pool to 512 for your BIEE system to support more concurrent users.

This is very important as highlighted it in RED. It is also known as HOGGING Thread . In a high user volume testing environment seeing this STUCK Thread issue is pretty common in log files. We need some exceptional level of tuning to get rid of this STUCK thread issue.

WebLogic Server automatically detects when a thread in an execute queue becomes “stuck.” Because a stuck thread cannot complete its current work or accept new work, the server logs a message each time it diagnoses a stuck thread. A thread might get stuck due to various reasons.

For example: When large BI report is running and the time it takes to complete is say 800 seconds, then, as the default stuck thread timing is 600 seconds in WebLogic Server, the thread allocated for that query waits for 600 seconds and goes to stuck state.

As best practice we have changed basic Thread Parameters as below under Console. And this needs to be done all the Managed servers clustered participating nodes including Admin server.

In our environment we have faced this issue multiple times and we took couple of thread dump when STUCK thread occurs and BI Server stops responding and stop accepting any new connections:

A typical thread dump has below details and observe that there are lot of TIMED WAITING events:

Below are 3 important parameters which tell you the server health plus Hogging thread count . If you see Hogging thread count is potentially high then you are in trouble with STUCK Thread . This below snapshot is very ideal managed server nodes with no Hogging/Stuck thread at all.

However this is most notorious STUCK thread issue . See the Count of thread waiting here. You must be in trouble in such scenarios:

Memory consumption will tell you no issue with CPU / RAM . See this for 3 clustered nodes. Each has 128 GB memory though we are using hardly 15% of it and CPU is 50% free almost every time. So its not an Hardware bottleneck for sure. Network bottleneck ??? May be ! right now in our optical fiber network channel I am assuming its not an issue.

When you see Hogging thread count exceeds 20+ you will get below kind of error in dashboard page for all sessions already In and no new sessions will be able to open connection and will throw login error:

Now see the way you can get details of which request is Hogging . It will not tell you exact dashboard and report name but will tell you which application part/URI request at very high level. When more and more such request piled up in Middle-tier it will results a unresponsive server and application will not be able to respond which potentially leads to server crash.

More such requests you can find here . All became STUCK or HOGGING thread .

See JDBC Data source performance as well:

Below shows 2 nodes of 4 node Exadata RAC performance so STUCK Thread is not a bottleneck due to Database and very less usage/load has been observed in DB nodes.

HOW TO TRACE STUCK THREAD and SEE where is the bottleneck:

Always leverage EM to see the performance in sessions plus performance in GET and POST request.

Use WLS EM to find out the load of various components:

Try to use Performance URL like below to get the online details during the regression test execution http://<server:port>/analytics/saw.dll?Perfmon

Load this , analyzed this to get some alarming #’s which will tell you precisely where the problem lies.

<CatalogXml> <!– Remove from the cache everything older than N minutes –> <MaxAgeMinutes>240</MaxAgeMinutes> <MaxLastAccessedSeconds>14400</MaxLastAccessedSeconds> </CatalogXml> <Query> <MaxEntries>5000</MaxEntries> <!– AbsoluteMaxEntries is the enforced maximum number of entries. When this maximum is reached –> <!– subsequent queries will fail until the maximum is no longer exceeded. –> <AbsoluteMaxEntries>20000</AbsoluteMaxEntries> <!– CruiseEntries is amount of entries the OracleBI Presentation server tries to maintain in its cache. –> <CruiseEntries>3000</CruiseEntries> <!– Forces the cache to attempt to remove an old entry when MaxEntries is exceeded. –> <ForceLRU>true</ForceLRU> </Query> </Cache>

As we know that Starting point of OBIEE Authentication is “Authentication Init block”. We have found minor lags on query performance because it was taking more than expected time and ~7-8 seconds. Since our Init block query is heavy with almost 20 joins the concurrent user login could cause a issue for us. So we moved the entire query to Database procedure , use TABLE functions and pass the :USER to query only and get the variables populated by the returned value of the functions. We have seen 50% performance improvement in this process as it was leveraging the DB cache feature a great extent

During 10g to 11g upgrade RPD connection pools are not changed. However best practice is to separate connection pools for init block and connection pools used for Datawarehouse query. So we have segregated one connection pool to two and then assign 30% of Datawarehouse connection pool to Init block totaling ~ 760 connections to 1 RPD spread across 5 databases.

Apply patch at server level for “ADF Warning” which will reduce ADF warning in logs

Make LOGLEVEL for users not more than 0.

Web/Front Layer Tuning

Bandwidth Savings: 1) Enabling HTTP compression can have a dramatic improvement on the latency of responses. By compressing static files and dynamic application responses, it will significantly reduce the remote (high latency) user response time.
2) Improves request/response latency: Caching makes it possible to suppress the payload of the HTTP reply using the 304 status code. Minimizing round trips over the Web to re-validate cached items can make a huge difference in browser page load times.

This uses a web accelerator mechanism to compress HTTP files (static and dynamic content) and add caching mechanism to help rendering the UI faster. Also in URI and Content compressor the packets transmits the web reduced to 10x in number of bytes and increase the usability of network traffic and faster hand-off between clients and server request. See this and learn more about HTTP Compression: