But changing the above DB parameters also helped my clients to achieve a record transaction which they havent seen till date. But yesterday it was not at all a high transaction day and we haven't performed the m/c restart. From morning onwards, High concurrency, TCP Socket waits along with contention of latches was shown in ADDM.

And the node evicted as usual at the busy business hour at 5:40pm due to heart beat fatal error.

When one of the nodes was evicting, the other node's alert log was showing this:

1) You seem to be getting some advantage from the restart, but that doesn't make sense because by restarting the database you are emptying the shared pool and buffer cache, meaning all SQL will need to be reparsed when it comes to the server, which will add to your latch contention.

2) You mention RAC. Have you put any time into application partitioning? RAC is really good for READ-READ operations and not too bad at WRITE-READ operations, but pretty awful at WRITE-WRITE operations. By that I mean when both sides for the cluster are doing writes to the same tables. Why? Because they invariably starting pinging blocks back and forward over the interconnect. I've never seen a good implementation of RAC where application partitioning has not been implemented. By that I mean, try and focus each node on different aspects of work. You can do this using services with a preferred nodes. If the node goes down, it will still work on the other nodes, but while it is up, it will make sure the nodes aren't fighting with each other over resources.

3) A little googling on "High TCP Socket KGAS" suggests this is a problem with your network, or the TCP server your database is communicating with. Are you doing something like lots of emails with UTL_SMTP, connecting to a slow mail server? Either that, or you just don't have enough network bandwidth to the server? You really need to find out what sessions are causing this event and what work they are doing to cause it.

4) Cursor Sharing : This is typically down to not using bind variables. Adding more shared pool will not help this. It will just give you a massive shared pool full of unique statements. If you can't correct the code you should consider using the CURSOR_SHARING parameter to force bind variables. Be *careful* though. This can have a negative impact also. You have to test it properly.

5) Contention for latches : You don't say which ones, so I'm guessing this is library cache and down to to many hard parses again. Sort out the bind variables and things will get better.

Partitioning is not the simplest thing to discuss because the answer will nearly always be, "It depends!".

Questions:

1) How is the table typically used? If access is always via UK or PK, then the size of the table is not a big factor in access speed. Binary chops of indexes are very quick!

2) What are you hoping to improve by use of the partitioning? Performance or management?

3) If you want to improve performance, then partition pruning will only ever come into play if the partition key is in the predicates of the SQL being fired at the table. If not, then performance may be worse.

4) Have you done load testing using different partitioning schemes and different index partitioning schemes on those partitioned tables.

Remember, there is no one-size-fits-all solution. If it were that simple the database would do it automatically...

1. The table I am talking about is a very large table mainly used for bulk inserts and select. Access is mainly via FK. Problem with the partition that I talked about is that it contributes for CONCURRENCY.

2. By the use of partition, I want to reduce the concurrency ultimately the performance..

3. Here I made the mistake , partitioning is done on sysdate column which is not the FK column which is used in the SQL statements..

4. I will do it asap.

Now I am asking a very generic question that to reduce the concurrency waits thereby improving the performance would you sugest for HASH PARTITIONING?..

Honestly speaking, you have opened my eyes to many ideas. I was not able to open up with a good start but your views to my posts helped me for a better start.

What are the concurrency issues you are seeing? Is it hot blocks that are being transferred over the interconnect? If so, then there are several things you can do to reduce this...

- Reverse key indexes reduce the likelyhood of consecutive rows being stored in the same block.- Smaller block size reduces the number of rows stored in the same block.- Your HASH partitioning may help here, but don't understimate the impact of having to calculate the HASH for every SQL that interacts with the table.

I got the below query to find out the hot blocks for the session and learnt that most of the tables have no Indexes and many are not analyzed and they are frequently accessed mainly for inserts and selects.. Please have a look at the below query

The objective of that SQL that I shared with you is to find out the hot blocks. Now I have removed the aggregate functions as you told me but the modified query is taking a lot of time and even after waiting for 15 mins it is not returning any result then ultimately I terminate my query using "Ctrl+C".

My objective is to find which segment is a victim of the an wait event say gc buffer busy.

Today ADDM reported:Finding: Global Cache Service Processes (LMSn) in other instances were not processing requests fast enough.RecommendationsAction Increase throughput of the Global Cache Service (LMSn) processes. Increase the number of Global Cache Service processes by increasing the value of the parameter "gcs_server_processes". Alternatively, if the host is CPU bound consider increasing the OS priority of the Global Cache Service processes.Rationale The value of parameter "gcs_server_processes" was "2" during the analysis period.

Inter-instance messaging was consuming significant database time on this instance.

As you suggested in your first reply to my post:

"2) You mention RAC. Have you put any time into application partitioning? RAC is really good for READ-READ operations and not too bad at WRITE-READ operations, but pretty awful at WRITE-WRITE operations. By that I mean when both sides for the cluster are doing writes to the same tables. Why? Because they invariably starting pinging blocks back and forward over the interconnect. I've never seen a good implementation of RAC where application partitioning has not been implemented. By that I mean, try and focus each node on different aspects of work. You can do this using services with a preferred nodes. If the node goes down, it will still work on the other nodes, but while it is up, it will make sure the nodes aren't fighting with each other over resources.

3) A little googling on "High TCP Socket KGAS" suggests this is a problem with your network, or the TCP server your database is communicating with. Are you doing something like lots of emails with UTL_SMTP, connecting to a slow mail server? Either that, or you just don't have enough network bandwidth to the server? You really need to find out what sessions are causing this event and what work they are doing to cause it."

I spoke with my network team to enquire about the network bandwidth issues. They do agree about bandwith problems at times but they are saying there are no TCP errors. As I don't understand much about network & interconnects, I would request you to help me.

There are two approaches to improving the performance of the interconnect:

1) Make the interconnect faster. Like buying a faster interconnect such as infiniband. Or using Jumbo frames to reduce the impact of large block sizes over small frames. Your network guys should understand this point.

2) Reduce the load on the interconnect, so any interconnect performance issues are not such a big deal. This is where application partitioning comes in.

So, my first suggestion to you would be to identify distinct workloads in your system. Create a service for each of them, each with a preferred node, and try to partition your application, thereby reducing the traffic over the interconnect. As I said previously, I've never seen a successful implementation of RAC when people have ignored application partitioning.

I do not believe the gcs_server_processes setting is relevant. Look it the doc for this. The default should be adequate. Making it begger will make more processes available, which will all be trying to do coms over a slow network. That is never going to help.

As suggested, we have implemented Jumbo Frames in our network(interconnects) and we saw certain improvements till last friday. But today was a high business transaction as per the expectations and calculations made by my client. Today at around 5:30pm when the total sessions reached at around 1200 in node1 and 1349 in node 2 my node 2 got evicted by node 1 saying heart beat fatal error. Still I am seeing the spike in total sessions as well as the active sessions intermittently.

A node eviction will probably not be caused by that. It is more likely that something else is the cause...

If the CPU usage stays consistently high for a long period of time, the clusterware can go screwy. Why? Because the clusterware relies on the network. Networking involves the CPU. If the CPU is being hammered, the network processing on the server may slow enough to make the cluster think the heartbeat is broken, hence node eviction. This is only one scenario of course.

Question:

1) With 1000+ sessions connecting to the database, are you using dedicated or shared server? I would have thought that many user processes would not be a great idea, so you should consider testing shared server, so see if the overall footprint of your connections is reduced.

2) What is actually connecting to your server? Are your applications using connection pooling? If so, how many connections do they spawn. If you check out Graham Wood's talk on connection pooling, you will see you shouldn't have more that 1-10 connections per core hitting the database. Beyond that, the performance tails off badly. Connection pooling is designed to reduce the footprint of connections to reduce wasted resource on your database servers.

As per your suggestions we have seen that there are no interconnect issue after implementation of "Jumbo frame" between the two DB instances. However, asmentioned by you that N/w also depends upon CPU and high clogging of CPU by active waiting sessions will result in poor N/w performance and hence evictionthis is still happening on high transaction days.

Our observation - <1> Not high number of sessions per node but <2> High number of sessions with High number of Waiting Active Sessions (waiting for CPU) leads to heartbeat failure and thus Node Eviction.

But we are facing another peculiar issue after introduction of Jumbo Frame and this is unpredictable and has nothing to do with High Sessions (per node) or High number of Waiting Active Sessions. The problem is that we notice that either one of the two instance is not having any sessions. Say just 1 or 2 sessionsand ther other isnatnce is taking all the load (say 1010 sessions, 998 sessions or 1300 sessions etc). What we are doing is "SHUTDOWN ABORT" of the instance with few sessions and then STARTUP. After waiting for few minutes we see that total number of sessions in the instance gradually increases and eventually the session load gets balanced.