We are having very annoying connection issues. Every 2-3 days the jdbc driver gets stuck while executing a request to the Oracle database. This, of course, only occurs on the production environment and we have never experienced this issue on either our pre-production environment or on our test environment. When this happens we have no choice but rebooting the server.

Every 2-3 days the application gets stuck while executing a request. Looking into the problem we always find that: - We have to reboot the whole server to get the application back up-and-running. - The ojdbc drivers has no error in its log (we have installed the debug version and set the log level to WARN). - On the Oracle server we have the following error :

- The firewall between the database and the server is also used in pre-production. Therefore, I would expect to have the same problem if the firewall was the problem. - We have set up all the following time-out with no effects : - oracle.net.CONNECT_TIMEOUT=3000 - oracle.net.READ_TIMEOUT=10000 - The application is composed of rougly 10 jobs that query the database regularly (every 1 sec for some jobs and every hour for some others)

We have been having this for months now, and we are running of ideas now. Would someone have ideas to help us investigate this problem ? Or, would someone have a genius idea and point out where the problem could be ?

Thanks for posting the version and app information, that is helpful. But what version of Java are you using?

But when you post you not only need to describe the problem but you need to be very specific about what steps you took to diagnose the problem and what indications you saw that relate to the problem.
>
the jdbc driver gets stuck
. . .
the application gets stuck
>
Sorry - but neither of these have any meaning at all. You need to describe what the symptoms are and how you come to this conclusion.
>
When this happens we have no choice but rebooting the server.
. . .
We have to reboot the whole server
>
Same with this - who said you have to reboot the server? Maybe you do and maybe you don't but how did you reach this conclusion? What indicators did you examine to see what the server itself was doing?

Which server are you talking about? The virtual server? The database server? The Tomcat instance?

Your description isn't clear about how many physical and logical servers even exist. Is this one physical server that also have an embedded VM server? Or two physical servers: one with the database and one running VM of some sort? Whose VM are you using, Oracle's?
>
On the Oracle server we have the following error :
Fatal NI connect error 12170
TNS-12535: TNS : le délai imparti à l'opération est écoulé
>
Did you search the web for references to that error? If so did you find the forum article that suggests this?

You might also want to check MOS Doc 1286376.1 (Fatal NI Connect Error 12170, 'TNS-12535: TNS:operation timed out' Reported in 11g Alert Log)
{quote}
The posted the error message is not in English - this is an international forum and posts should be in English so people don't have to conduct their own search to try to understand them. That message is
{quote}
ORA-12535: TNS: operation timed out
{quote}
When and where does this error appear? In the alert log? Before or after you reboot the server?
You show this
{quote}
- We have set up all the following time-out with no effects :
- oracle.net.CONNECT_TIMEOUT=3000
{quote}
But again - you are not specific about what you did. Where and how did you make these changes? On the client? Or on the server (which server)?
Did you try setting the SQLNET.INBOUND_CONNECT_TIMEOUT as suggested in the article?
http://davidalejomarcos.wordpress.com/2011/01/18/removing-fatal-ni-connect-error-xxxx-from-your-alert-log/
Timeouts, in and of themselves, are not necessarily a problem. The problem might be that the application code using the connection isn't written in a way that it can detect the timeout and so just waits forever for a reply that isn't going to arrive. This could indicate a problem in the app code that needs to be addressed.
Which begs the question of whether it is the client or the server that is causing the timeout. With Tomcat, Spring, Hibernate and DBCP connection pool there are a lot of candidates just on the client. Can you tell from any of the logs what DB related activity is happening just before the apparent 'freeze'?
In short, for a problem like this that happens sporadically you need to collect as much information as possible using tracing, higher log levels and the like to try to isolate where the problem is occuring. It could be as simple as the connection pool dishing out a stale connection that isn't alive anymore.

Are you running out of database connections? If so, do you get a connection (from the connection pool), use it, and close it as quickly as possible using try/catch/finally blocks, or are you holding onto connections for the duration of a person's login time (which is bad).

I suggest sending the sql statement string to the log file just before every read or write to the database. This way, when it dies, you can look at the log and see what was the last SQL statement that was running. You can also log the start time and end time each SQL statement takes.

Because its happening in production only, I suspect it may be cause by the heavy load of users. You may have a concurency issue (two or more users accessing a non thread safe function at the same time).

You also might have a memory leak. A profiler may help you find the issue.

Lastly, you can put log statements throughout your code a key locations and see which one was successfully sent to the log file before it hung up.
Then put additional log statements near there and run it again until it locks. This way, you can track down the problem.

942253 wrote:
and we have never experienced this issue on either our pre-production environment or on our test environment.

Do those boxes match what is in production?
Do you run those boxes in such a way that they simulate the production syste>
Do they databases used for test have similar types and volume of data.

- The ojdbc drivers has no error in its log (we have installed the debug version and set the log level to WARN).

Presumably your problem is that there is a call to the database and it never returns which eventually results in a timeout.

The firewall between the database and the server is also used in pre-production. Therefore, I would expect to have the same problem if the firewall was the problem.

Reasonable conclusion however did you examine the rules? And by you I mean you rather than taking the word of the IT guy.
And if the data/load is not the same then it doesn't preclude a firewall problem.

We have been having this for months now, and we are running of ideas now

One solution that will at least make things a bit more regular - reboot the server every night. Obviously that doesn't fix the problem but at least it eliminates the server from going down at random times while you see if you can replicate it.

You might consider making your pre-production setup your production setup for awhile to see if the problem goes away. Alternately, you might also consider switching the production server(s) with your pre-production server(s) one at a time to see if the problem goes away.

The application is based on 15 quartz jobs that regularly connect to the database (some every sec, some others every hour). To execute jobs, Quartz manages a pool of threads; each time a job needs to be executed Quartz takes an idle thread from the pool and executes the job.

When I mentioned : "the application gets stuck executing a request", I mean that one of these threads is stuck waiting for a function to return which never does, and when I do a thread dump of the JVM and get that :

org.springframework.scheduling.quartz.SchedulerFactoryBean#0_Worker-16" - Thread t@33
java.lang.Thread.State: RUNNABLE
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.read(Unknown Source)
at java.net.SocketInputStream.read(Unknown Source)
at oracle.net.ns.Packet.receive(Packet.java:300)
at oracle.net.ns.DataPacket.receive(DataPacket.java:106)
at oracle.net.ns.NetInputStream.getNextPacket(NetInputStream.java:315)
at oracle.net.ns.NetInputStream.read(NetInputStream.java:260)
at oracle.net.ns.NetInputStream.read(NetInputStream.java:185)
at oracle.net.ns.NetInputStream.read(NetInputStream.java:102)
at oracle.jdbc.driver.T4CSocketInputStreamWrapper.readNextPacket(T4CSocketInputStreamWrapper.java:124)
at oracle.jdbc.driver.T4CSocketInputStreamWrapper.read(T4CSocketInputStreamWrapper.java:80)
at oracle.jdbc.driver.T4CMAREngine.unmarshalUB1(T4CMAREngine.java:1137)
at oracle.jdbc.driver.T4CTTIfun.receive(T4CTTIfun.java:290)
at oracle.jdbc.driver.T4CTTIfun.doRPC(T4CTTIfun.java:192)
at oracle.jdbc.driver.T4C8Oall.doOALL(T4C8Oall.java:531)
at oracle.jdbc.driver.T4CPreparedStatement.doOall8(T4CPreparedStatement.java:207)
at oracle.jdbc.driver.T4CPreparedStatement.executeForDescribe(T4CPreparedStatement.java:884)
at oracle.jdbc.driver.OracleStatement.executeMaybeDescribe(OracleStatement.java:1167)
at oracle.jdbc.driver.OracleStatement.doExecuteWithTimeout(OracleStatement.java:1289)
at oracle.jdbc.driver.OraclePreparedStatement.executeInternal(OraclePreparedStatement.java:3584)
at oracle.jdbc.driver.OraclePreparedStatement.executeQuery(OraclePreparedStatement.java:3628)
- locked <65c752> (a oracle.jdbc.driver.T4CConnection)
at oracle.jdbc.driver.OraclePreparedStatementWrapper.executeQuery(OraclePreparedStatementWrapper.java:1493)
at org.apache.tomcat.dbcp.dbcp.DelegatingPreparedStatement.executeQuery(DelegatingPreparedStatement.java:96)
at org.apache.tomcat.dbcp.dbcp.DelegatingPreparedStatement.executeQuery(DelegatingPreparedStatement.java:96)
at org.hibernate.jdbc.AbstractBatcher.getResultSet(AbstractBatcher.java:208)
at org.hibernate.loader.Loader.getResultSet(Loader.java:1953)
at org.hibernate.loader.Loader.doQuery(Loader.java:802)
at org.hibernate.loader.Loader.doQueryAndInitializeNonLazyCollections(Loader.java:274)
at org.hibernate.loader.Loader.doList(Loader.java:2533)
at org.hibernate.loader.Loader.listIgnoreQueryCache(Loader.java:2276)
at org.hibernate.loader.Loader.list(Loader.java:2271)
at org.hibernate.loader.hql.QueryLoader.list(QueryLoader.java:452)
at org.hibernate.hql.ast.QueryTranslatorImpl.list(QueryTranslatorImpl.java:363)
at org.hibernate.engine.query.HQLQueryPlan.performList(HQLQueryPlan.java:196)
at org.hibernate.impl.SessionImpl.list(SessionImpl.java:1268)
at org.hibernate.impl.QueryImpl.list(QueryImpl.java:102)
....

The thread is waiting for "java.net.SocketInputStream.socketRead0." to return, which never does nor times out. From this stack trace I drew the conclusion (which may be wrong) that the driver is stuck executing a request to the database.

When this problem occurs, at the very same time, the Oracle database is logging the error "ORA-12535: TNS: operation timed out" into alert.log. I am not saying that this timeout is the problem but it certainly says that there is something wrong.

The parameter oracle.net.CONNECT_TIMEOUT=3000 is set up in the connection pool parameters (in context.xml of Tomcat), which is given to the ojdbc driver. If I set this value to something really small, I do get connection timeout. The same works for oracle.net.READ_TIMEOUT=1000 : if I try to execute a request that takes longer than the specified time out, I do get a timeout exception. Therefore, I would presume that these 2 parameters are used properly by the driver.

Are we running out of database connections ? No, I asked twice the DBA to check. Is the pool running out of DB connection, I believe we don't for 2 reasons
- We have a "borrow timeout" on the pool which would throw a timeout exception if this happened.
- In the thread dump we would see the thread stuck waiting for a connection to be released or something like that.

The question that I am asking myself is, how come the driver never reaches a timeout or detects that is something wrong ?! The Oracle database does detect inactivity ("ORA-12535: TNS: operation timed out" in alert.log).
If the driver did throw an exception, this would at least prevent the application's thread from getting stuck.

Note : for the moment we don't have heavy load in production. We do have sql requests executed regularly, but they return very few records.

942253 wrote:
The parameter oracle.net.CONNECT_TIMEOUT=3000 ...The same works for oracle.net.READ_TIMEOUT=1000

Without tracking those it down specifically those look wrong.

I am presuming the units are seconds for those. If not then they are too low. Comments still apply though.

Establishing a connection should take very little time. A read should take longer. How much longer depends on what you are doing.

In a modern data center establishing a connection should take less than a second. You certainly wouldn't want to wait 50 minutes for a connection failure.

If something in the database is deadlocking then you can get a timeout. The server itself has parameters that impact how long something can wait. A poorly design schema and/or query could take a very long time to process if there was a lot of data. If that were the case then the input data variance over serveral days could invoke the problem and thus end up with the intermittent failure.

>
The parameter oracle.net.CONNECT_TIMEOUT=3000 is set up in the connection pool parameters (in context.xml of Tomcat), which is given to the ojdbc driver. If I set this value to something really small, I do get connection timeout. The same works for oracle.net.READ_TIMEOUT=1000 : if I try to execute a request that takes longer than the specified time out, I do get a timeout exception. Therefore, I would presume that these 2 parameters are used properly by the driver.
>
You never responded to the question I ask earlier -

Did you try setting the SQLNET.INBOUND_CONNECT_TIMEOUT as suggested in the article?
http://davidalejomarcos.wordpress.com/2011/01/18/removing-fatal-ni-connect-error-xxxx-from-your-alert-log/