Suddenly getting "Too Many Open File" error

Dear All,
I have java listener which keep reading for incoming data to the particular port. Most of the times it works fine only at times it gets too many open files error. Below is my codes what could be wrong? Thank you.

What operating system are you running on? Do you have the entire stack trace?

This error normally comes up when you end up depleting your quota of the maximum acceptable file handles which can be open at a given time which in turn happens if you don't close file/socket streams which you have opened. By having a look at your code, it seems that you never close the InputStream retrieved from the client socket (the variable ' r ') which might be what is causing these errors.

A few more things:

Split your logic into methods instead of pushing the entire thing is just a single method

Instead of concatenating queries, use PreparedStatement which are immune to SQL injection. For repeated execution they are faster since once a statement is "prepared" the cost of re-compilation of query is no longer there.

Spawning threads on demand is expensive; use thread pools. If you are using Java >= 5, look into ExecutorService .

Spawning connections on demand is also expensive; use connection pooling if your profiling suggests that the application takes a lot of time in "preparing" database resources

Dear SOS,
My O.S is fedora linux. I do not know how to get the entire stack trace printed? Actually my program is running in a wrapper program. So normally anything I system.out.println will print in the wrapper log file. So do you think I should do the same for the stack trace? Another thing regarding the closing of the input stream I read that if w is close then automatically it take of the r too. I dont know how to split the logic into different function as there are all related to each other I find it quite difficult to split too. Yes I am using java 5. I will try to look into the executor service and also other pooling. I am sorry quite new into all this pooling etc.

What operating system are you running on? Do you have the entire stack trace?

This error normally comes up when you end up depleting your quota of the maximum acceptable file handles which can be open at a given time which in turn happens if you don't close file/socket streams which you have opened. By having a look at your code, it seems that you never close the InputStream retrieved from the client socket (the variable ' r ') which might be what is causing these errors.

A few more things:

Split your logic into methods instead of pushing the entire thing is just a single method

Instead of concatenating queries, use PreparedStatement which are immune to SQL injection. For repeated execution they are faster since once a statement is "prepared" the cost of re-compilation of query is no longer there.

Spawning threads on demand is expensive; use thread pools. If you are using Java >= 5, look into ExecutorService .

Spawning connections on demand is also expensive; use connection pooling if your profiling suggests that the application takes a lot of time in "preparing" database resources

My O.S is fedora linux. I do not know how to get the entire stack trace printed?

Just make sure that you don't catch any exceptions and gobble them (i.e. don't do anything). Either log that error using Log4J's FATAL logging level or at least do a e.printStackTrace() . Also, make sure that when you start this application on your *nix box, you don't redirect the error stream to /dev/null which might end up gobbling up all your stack traces which might have been printed.

If there was no exception stack trace, how did you know the error was 'too many open files'? Also, how are you starting the application on your *nix box? As a background process?

So normally anything I system.out.println will print in the wrapper log file. So do you think I should do the same for the stack trace?

Exception stack traces are printed out to the *error* stream instead of the standard output stream, so just make sure that your wrapper program captures both error and standard output streams (STDOUT, STDERR).

Another thing regarding the closing of the input stream I read that if w is close then automatically it take of the r too

I don't think so since nothing along those lines is written in the Javadocs for the Socket class. I'd rather play safe and close *all* open resources which were opened; this involves closing socket, input stream and output stream.

I dont know how to split the logic into different function as there are all related to each other I find it quite difficult to split too

If you haven't done any re-factoring before, of course you'd find it difficult. I'd suggest looking at your application as a collection of components, components which:
1) Parse the incoming user input
2) The database component which retrieves data using the user input
3) Render output to user

You can also wrap up all that resource closing code in a single utility method to avoid repetitive boiler plate code.

Follow the above steps (at least the closing all resources thing), load test your application (an automated script which hits your application continuously) and see if you again get the same problem. Ideally, the default open file descriptors allowed (which I guess is 4096) should be good enough for your case.

Dear Sos,
Ok let me explain how I run this. I am using the the YAJSW wrapper from this site http://yajsw.sourceforge.net/. So yes it runs in the background.Below is a sample from the wrapper's log file. Each time I am actually wrting to a text file too. Is this the problem is it?

OK, I just looked at the source code of the Socket classes and it seems that closing any one of those three (InputStream, OutputStream, Socket) terminates the connection so that really shouldn't be a problem in your case. But the error message you posted really implies that your system has run out of file descriptors to allocate. If you run into this problem again, make sure you get the entire stack trace for it (based on my previous suggestion of using e.printStackTrace(System.out) .

Another thing you should do is find out the current open file descriptor limit on your machine and see the number of file descriptors used when your server process is running. For all we know, it might be a genuine issue of 'low defaults'. I'm no pro-system admin so you might also want to consult your system administrator for the same.

Dear Sos,
Actually in linux there few different file descriptor command one is ulimit-n. So in my case which one is the correct method to use to determine that my program is using. Ok then about the closing I will leave the w.close to take the rest as it is.

Dear Sos,
Thank you for the links. So according to your previous reply you said closing any one anyone wil be sufficientt so I think for now I will leave that and concentrate on the e.printStackTrace(System.out). I have put e.printStackTrace(System.out) into each of my catch statements. Do you it will be printed into the wrapper log file?

Yeah, based on the behaviour of the log file you posted, it seems only STDOUT is captured which means you can't use the normal e.printStackTrace() (since it redirects to STDERR).

Re-throwing the exception and handling them at a single place and using a logging library like log4j would be a good idea, but maybe leave those things when you get time. For the time being, just concentrate on capturing the stack traces and monitoring the file descriptor usage/limit on your system.

Dear Sos,
Ok it happened again below is my latest codes and below is the log file entry which show where there error is happening. What I notice is showing this error at commServer.main(commServer.java:29). So meaning it is referring to this line Socket socketConn1 = serverSocketConn.accept();. So what do you think is the problem?

Basically it means that the process has reached the limit for its open file descriptors and the server can no longer accept connections from the client.

So basically now it boils down to answering the following questions:
- How much is the current limit (ulimit -n) per process? (you can also verify this by running a Java code which simply keeps on opening files and never closing it)
- How many simultaneous connections are you getting from the clients?
- Are you sure your code is written such that all resources are closed even when there is an exception?

Dear Sos,
1. When I run ulimit-n I get 1024. I dont get your when your run a java code keep opening and never closing it?
2. The problem I can give you a fix answer the simultaneous connections as the clients are gps devices normally they will send data the interval of 1 minute. Roughly now we have around 300 devices point to this server.
3. Are you saying for each of the exception I run this code w.close is it?

1. When I run ulimit-n I get 1024. I dont get your when your run a java code keep opening and never closing it?

1024 should be good enough given that you have just 300 clients. I suspect a resource leak here given that this problem of file handles occurs after running your application for a given amount of time.
Regarding the test code, what I meant was to write a simple code which recursively traverses directory for files, opens them and never closes the file handle. Just print out a counter after opening a file. After the exception "Too many open files" is thrown, just have a look at the counter which was printed and that would give you the number of max file handles your process is allowed to create.

3. Are you saying for each of the exception I run this code w.close is it?

No, what I'm trying to say that is make sure you close *all* resources eventually, exception or no exception.

I would recommend that you carry out a small experiment: deploy your server code. Monitor the current used file handles count by your process. Hit your application with a client request. Now monitor the used handle count. Ideally it should be the same as before the server started receiving requests. If the number of used file handles keeps on proportionally increasing with the number of requests, you have a leak.

Dear Sos,
Ok about the directory traverse I will go and try to run a code of this type. What type of resource is leaked? Is there any method to detect it?
1.I ran this command to find maximum open file descriptor "more /proc/sys/fs/file-max" is equal 1180130 (which I think is quite big rite).
2.This command "more /proc/sys/fs/file-nr" the results show me 2176 0 1180130 (This results never change I monitor for quite some time)
3. "lsof | wc -l" this show me a different result ranging from 3960 to 3970.

I am bit new in this monitor. I hope you can give me some guidance do you notice anything abnormal here?

Dear Sos,
Sorry I run again this command more /proc/sys/fs/file-nr. There is changes once is was 1196 0 1180130 then even another time it was 2304 0 1180130. In addition lsof | wc -l give a higher number in the range 3970 to 3990. Hope this gives a better indication now.

System administrators are pretty good at these monitoring things so if you can grab one to help you out, that would be the fastest route out. Using forum system for communication/debugging has a visible disconnect as you can see.

Anyways, AFAIK, more /proc/sys/fs/file-max gives you the *total* number of files handles allowed by the *OS*. It is more likely that you are hitting a per-process limit here ( ulimit -n ). Also, running just lsof | wc -l would be little use since it monitors the file handles for *all* processes visible to you. Find out the PID (process identifier) of your process using the ps ux command. Then do: lsof | grep YOUR_PID | wc -l . This would give you the number of file handles used by *your* Java process.

Regarding monitoring; you need to start the monitoring when your server process starts and look for changes when clients request services from your server. The number of used file handles *should* go up(at least by 1) when a client connects to your server. Like I mentioned in my previous post, ideally, the number of file handles free at the start of your server should be equal to the number of free file handles at any given point in time. If it isn't, then you are leaking resources.

Additionally I ran this command netstat -nat |grep 9000 | awk '{print $6}' | sort | uniq -c | sort -n and what I notice just before the server crash the number of time_wait keep increating. Then by browsing the log file I notice there quite a number of this error. Can it be that this error is also one of the cause for this problem? Must I close the connection in this catch

FINEST|24190/0|10-12-14 20:12:33|MyError:IOException has been caught in in the main first try
FINEST|24190/0|10-12-14 20:12:33|java.net.SocketException: Connection reset
FINEST|24190/0|10-12-14 20:12:33| at java.net.SocketInputStream.read(SocketInputStream.java:168)
FINEST|24190/0|10-12-14 20:12:33| at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:264)
FINEST|24190/0|10-12-14 20:12:33| at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:306)
FINEST|24190/0|10-12-14 20:12:33| at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:158)
FINEST|24190/0|10-12-14 20:12:33| at java.io.InputStreamReader.read(InputStreamReader.java:167)
FINEST|24190/0|10-12-14 20:12:33| at java.io.BufferedReader.fill(BufferedReader.java:136)
FINEST|24190/0|10-12-14 20:12:33| at java.io.BufferedReader.read(BufferedReader.java:157)
FINEST|24190/0|10-12-14 20:12:33| at ConnectionHandler.run(commServer.java:85)
FINEST|24190/0|10-12-14 20:12:33| at java.lang.Thread.run(Thread.java:619)

Ok I did this command and below is the results. The problem I dont know which pid is for my java application is it 3073 or 3096.

PID 3073 runs the JAR file wrapper.jar whereas PID 3096 runs the main class WrapperJVMMain. So the question is, is the code written by you packaged in the wrapper.jar file or wrapped up in the WrapperJVMMain class? That would be the answer to your question.

Regarding the stack trace, that probably means that the connection was severed/reset by the client and hence reading any more from the SocketInputStream is causing an exception.

Do you think I am lacking any close operation in my code?

Your code is a maze of if...else statements so its difficult to find out whether any connections/resources escape closure if an exception is thrown. There are now two things IMO which you can try out:
1) Run the code *without* the wrapper and see if it shows the same behaviour. Why do you require a wrapper anyway?
2) Remote debug your application. Make the necessary changes to the environment variables of your Java process, remote debug using Eclipse and you should be able to realize what resources aren't getting closed.

Dear Sos,
Actually I am also not too sure about the flow of the wrapper. I would guess that my application runs in the wrapper.jar then this application is run in the WrapperJVMMain class.
1. Why I run in wrapper is that I need the java to run as a service. So what is other solution to this any suggestion from you? The probem is that the server is in remote location so I am using putty to do all my operation.
2. What do you mean by remote debug?I am not too clear about how eclipse will help here.

Another question why at times there is not single time_wait but suddenly the time_wait grow and crash the applcation.Is there anything causing the time_wait to happen and grow? Thank you.

Actually I am also not too sure about the flow of the wrapper. I would guess that my application runs in the wrapper.jar then this application is run in the WrapperJVMMain class.

Then you need to grab someone who knows how wrapper works (the person who implemented this or maybe ask in the wrapper tool forums).

1. Why I run in wrapper is that I need the java to run as a service. So what is other solution to this any suggestion from you? The probem is that the server is in remote location so I am using putty to do all my operation.

The simplest possible option would be to spawn the process in background.

nohup java -cp .:somejar.jar -Dother=options SomeClasss &

2. What do you mean by remote debug?I am not too clear about how eclipse will help here.

You can use Eclipse to remote debug your application. So when ever a request comes, you can set a breakpoint in your server code and step through your entire code as it handles the client request. Eclipse also has the facility of attaching breakpoints for "specific" exceptions. Just add breakpoints for IOException and SocketException and you should be good to go. Search the net for tutorials for the same.

Another question why at times there is not single time_wait but suddenly the time_wait grow and crash the applcation.Is there anything causing the time_wait to happen and grow?

As per the man pages, TIME_WAIT is state is when the socket has been closed by is still waiting to handle packets in the network. Saying "suddenly the TIME_WAIT grow and crash" doesn't help here. We need exact numbers. After how many client requests? How many TIME_WAIT after starting the server? How many TIME_WAIT after handling a single client? Try to test things on a "per client" basis; that would have a higher chance of you finding out where the problem is and how many file descriptors are being leaked.

Dear Sos,
Ok I will ask again the wrapper guy.
1.Ok I will keep the nohup method as possible replacement the nohup can also write to log files rite. I will read up the net on that too first.
2.Eclipse I also will read the net first.
3. The exact number is 230 TIME_WAIT is when the system crash. The problem now I am monitoring with this command netstat -nat |grep 9000 | awk '{print $6}' | sort | uniq -c | sort -n. I notice when the established connection is below 200 no problem at all running for many hours with out problem. The moment I increase to say 216 then I get TIME_WAIT increase if not there is no TIME_WAIT at all shown in the command. So can this be due to the increase of clients.So must I increase the system file descriptor limit? Thank you.

3. The exact number is 230 TIME_WAIT is when the system crash. The problem now I am monitoring with this command netstat -nat |grep 9000 | awk '{print $6}' | sort | uniq -c | sort -n. I notice when the established connection is below 200 no problem at all running for many hours with out problem. The moment I increase to say 216 then I get TIME_WAIT increase if not there is no TIME_WAIT at all shown in the command. So can this be due to the increase of clients.So must I increase the system file descriptor limit? Thank you.

OK, let's go over this again. *How* are you exactly testing this thing? Are you running the stats on your existing deployment? If so, don't do that.

Do a fresh build of your code on some dev box or some other location for the purpose of testing. Start your server. Look at the number of file handles used by all "java" processes by doing: lsof | grep java | wc -l . Look at the number of TIME_WAIT now. Lets call these numbers collected as "set 1".

After this setup is done, create a client which hits your server and requests for some data. Only a single client. Now collect the same two statistics as mentioned above (set 1). It would obviously be more than the numbers collected previously. Now wait for some time, around 5/10 minutes and check the same two statistics again (set 3).

set 3 now should be the same as set 1. If it isn't, you have got a leak.

Dear Sos,
Yes I was runnning this stat on the same machine. Ok let me be open with you the problem is that we have over 300 gps devices set on the vehicles and running to set. To point to another IP it will take a very long process and to come back to another server it will be another long process which we can not quite afford due to time frame. So what is your other solution or work around? I dont understand how when you said create client and blast it ?

OK, my point was that deploy your application on a dev/test box and use a *dummy* client which simulates a GPS device. Surely you can do that, no? Basically it means running both applications side-by-side where one application would anyways be used by the devices as usual + the new application which you can use for testing. Don't you have a spare box running the same OS/configuration?

Also, the *simplest* hack here would be to just bump up the file descriptor limit and cross your fingers hoping that things work out. I guess the syntax is something along the lines of: ulimit -n 4096 but confirm this with your sys-admin. If there is a genuine code bug, you would get a "Too many files" error at some point in time, it's just that it would take a bit longer to reproduce it.

Dear Sos,
Ok at last I manage to create a dummy device using the telnet method. First I did wat on my existing box and another on new box but different java version.
Below is the lsof | grep java | wc -l values.

1.Original box
Connection with out sending data
a.With just the box listening(41)
b.With single client connection (42)
c.With the single client close connection it goes down to (41)
Connection with data being send
a.With just the box listening(41)
b.With single client connection (42)
c.When data sent(46)
d.With the single client close connection it goes down to (45)
Connection with data being send(multiple clients sending data)
a.With just the box listening(41)
b.With single client connection & more clients it will grow by 1(42,43 etc)
c.When data sent by each client grow by 1(46,47 etc)
d.When all the client have close connection it will drop to(45)

What suprises me is that when the data is sent it drop to 45 if not it drop to 41. Is it because of mysql connection and text file being operated. When data is send that means this part of code is operating while ((m=r.read()) != -1) then is written to the text file, update and insert database too.

2.Testing new box
Connection with out sending data
a.With just the box listening(31)
b.With single client connection (32)
c.With the single client close connection it goes down to (31)
Connection with data being send
a.With just the box listening(31)
b.With single client connection (32)
c.When data sent(36)
d.With the single client close connection it goes down to (35)
Connection with data being send(multiple clients sending data)
a.With just the box listening(31)
b.With single client connection & more clients it will grow by 1(32,33 etc)
c.When data sent by each client grow by 1(36,37 etc)
d.When all the client have close connection it will drop to(35)

What can you conclude from this fundamental test? The last value either 41 or 45 for the old box remain after 5 or 10 minutes. The above test results is purely running from java. Then I did with using the wrapper the results flow is same except your have to add 166 to the above values of original box. Thank you.

The increased number of file handles even after the entire client-server communication is done with is a bit troubling. That shouldn't be happening.

OK, now one more test. For the time being forget about the original box since we now have a dummy server deployment which we can use for our testing. Now, instead of printing out the *number* of file handles, print out the the entire data and redirect it to a file. Steps:
1) Start the server and don't request any data for the time being
2) Do lsof | grep java > before.txt .
3) Test for connection with data being sent; preferable would be to send data from your local box so that you don't end up using the file descriptors on the server box for testing your client.
4) After a single client transaction (send/receive) is complete, wait for some time (around 5 mins).
5) Now do lsof | grep java > after.txt .
6) Compare the two files; what do you see?