Main menu

Monthly Archives: March 2016

Post navigation

I have been to many clients for the past few years and one of common issues is the excessive logon activities. From what I observe, excessive logon activities are mainly from two areas.

1) The first one is the logon storm. This behavior usually happens in a database that is shared by multiple applications. As poorly performed queries from one application can easily have impact on the performance of other applications, other application teams might feel not have enough db resource dedicating to their application and use their own ways to get the resource they needs by firing up hundreds of threads within a tiny time window, logon to db, do some insert and update work, and logoff. This one can be easily spotted from OEM by a huge spike in Active Sessions. I am not going to discuss this one in this blog, but talking about the second category.
2) The second category is almost unnoticeable, continuous logon activities that comes in a way like memory leak. Database performs ok, but over time audit table AUD$ grows at a much faster rate than it should be. At the time when AUD$ table is so big, any insert or query activities on AUD$ table can be easily seen as one of top 10 queries on OEM. Just like the house surrounded by the water. When want to get out, it is too late. This issue is usually caused by the design or configuration issue on application side. This blog discusses how to identify this issue from database side and provide useful information for application side to track down the cause of the issue.

How to identify whether we have excessive logon activities?
First, goto OEM’s Performance Home page and pay attention to the Logon activities at the bottom of the pages. If you see a constant line with logons > 0 all the time, it means something is not right in the logon, especially if the database is small and not many users are using it. The following screenshot shows a logon rate of 10 all the time. We could have logon rate of 10 for certain time with a pattern of up and down, but sustain this number all the time usually means issue in logon activities.

Next, I usually run the following script to capture who are the top contributors of this excessive logon activities. It shows the users with logon frequency > 100 for the past 15 minutes.

The above result shows APP_RTL01_ORD user has the largest number of logon/logoff activities and has about 12,000+ logon activities from two hosts, kprodw447 and kprodw448.

What’s the logon activities look like?
This kind of excessive logon session usually has a very short life: logon, executes one or a few queries, and then gone. So I have to run the following query multiple times within a few seconds to create my own snapshot. For this test, I am checking APP_RTL01_ORD user and kprodw447 host.

The following is the execution result. We can see only the first two sessions (SID 1329 and 2330) are in the all of three snapshots while the rest of sessions are gone. If those sessions are gone, it means the old sessions were closed and new sessions are opened. So I picked the 2nd snapshot in the middle and used the sessions with audid column is red to do further analysis.

You might notice there is audsid column in the query. The reason why I includes it in the query is that there is no SID and Serial# columns in SYS.AUD$ table. AUDSID is Auditing Session ID, a unique identifier for the user session. Confusingly, this column is called AUDSID in V$SESSION, but SESSIONID in SYS.AUD$. When a new user session is created, the SYS.AUDSES$ sequence is incremented using NEXTVAL and the new value is assigned to the session as its AUDSID. In case of sessions connecting as ‘internal’, AUDSID is set to 0 and SYS.AUDSES$ sequence is not incremented. The common used query select userenv(‘SESSIONID’) from dual is actually getting this AUDSID.

How long each logon session last?
Although I feel the session has short life, I don’t know the exact time for the session. So next step, I need to pull the logon and logoff records from dba_audit_trail. I used the audsid information captured from the last step and run the following query.

The result proves my assumption is right. All of these session shows the same pattern: logon and logoff within a second.

What kind of queries used in these short life sessions?
In the 2nd step, I marked those queries in red and let me check out the content and execution history of the queries. As usual, I used Kerry‘s dplan.sql and awr_plan_change.sql to get plan content and execution information.

Among these queries, pay more attention to SQL bunvx480ynf57 in red above. I remembered when I run my check out logon activities scripts at around 2:30pm on March 1, there was 15,000 logon for the past 15 minutes. The above screenshot of SQL execution history shows the total number of execution of SQL bunvx480ynf57 is around 28,000 executions in a 30 minute snapshot window for the same time period. So it seems there is a strong correlation between this query and frequent logon activity. To verify my assumption, I recheck the total logon/logoff activities between 2pm and 3pm today on Mar 2.

Yes, my assumption is right. There are 43,600 (34,000 + 9,600) logon/logoff between 2pm and 3pm on Mar 2 while the total number of execution of SQL bunvx480ynf57 during the same timeframe is 19,000 + 26,000 = 45,000. These two numbers are very close.

Based on this finding, I would say the main purpose of this frequent session logon/logoff thing is related to SQL bunvx480ynf57, or SELECT 1 FROM DUAL. For SELECT 1 FROM DUAL query, it is usually used at application layer to keep session alive if the session is idle for too long and avoid session to time out. In addition, the action for LOGOFF is For LOGOFF BY CLEANUP, it usually means the connection on client-side vanishes without a clean disconnect. It’s the client-process that “dies abruptly” and then database PMON process cleans up these abandoned connections. Therefore, it seems to me that this keep alive logic does not work as expected at application layer. Anyway, this is a purely an application design or configuration issue.

Additional Note:
My client’s DBA, Rick, also brought up another nice point about the possibility of this issue at application layer: another common use for ‘SELECT 1 FROM DUAL;’ is to ‘test’ connections. However this model is falling out of favor. The combination of the excessively high login/logout rate and the most frequent sql listed could be caused by improper connection pooling config OR usage. Example of bad usage would be issuing an explicit CLOSE() in the code instead of a RELEASE_CONNECTION() or something better.

When Test Connections on Reserve is enabled, the server instance checks a database connection prior to returning the connection to a client. This helps reduce the risk of passing invalid connections to clients.

However, it is a fairly expensive operation. Typically, a server instance performs the test by executing a full-fledged SQL query with each connection prior to returning it. If the SQL query fails, the connection is destroyed and a new one is created in its place. A new and optional performance tunable has been provided in WebLogic Server 9.x within this “test connection on reserve” feature. The new optional performance tunable in 9.x allows WebLogic Server to skip this SQL-query test within a configured time window of a prior successful client use (default is 10 seconds). When a connection is returned to the pool by a client, the connection is timestamped by WebLogic Server. WebLogic Server will then skip the SQL-query test if this particular connection is returned to a client within the time window. Once the time window expires, WebLogic Server will execute the SQL-query test. This feature can provide significant performance boosts for busy systems using “test connection on reserve”.