java.io.IOException: Broken pipe
at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
at sun.nio.ch.IOUtil.write(IOUtil.java:51)
at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:487)
at org.glassfish.grizzly.nio.transport.TCPNIOTransport.flushByteBuffer(TCPNIOTransport.java:1252)

Note that it is easier to reproduce if the request goes over the internet but doesn't happen so frequently if using localhost, no sure about another box in the local network.

I have also seen the same errors in the logs when launching jws applications via JnlpDownloadServlet. To my undertanding the jws container does a similar thing by requesting the http headers of a jar but then aborting the download if the jar is in the jws cache and hasn't been modified.

Setting as blocker as it can make the server unusable and there one can not control how http clients interact with the server.

there is a server that sends us chunked requests (by POST), our application on glassfish takes these packages and process them. but somehow, some of the packages are not processed by the application, also there is no log activity about these packages. but when we capture the network activity we can see that the request has come to the server.
we tried "-Dcom.sun.enterprise.web.connector.grizzly.enableSnoop=true" parameter. with this parameter, for the lost request there is no log.
here is a lost request:
Host: 192.168.100.10:49205
Transfer-Encoding: chunked

you can find the capture attached.
frame 7824 and 7825 are the same request that the server sends to all subscribed clients. for this example there are two client subscribed. the gf application and the tomcat application. as you see, at the frame 7824 tomcat responded the request but at the frame 7825 fg did not respond the request.
regards,

I have an nginx http server load balancing a Glassfish cluster composed of 2 instances in different machines. These instances serve a Jersey application composed of several RESTful webservices.

After some time running, I noticed a great increase in CPU load in one of the instances, even though client request throughput remained the same. Looking at the server threads I see that two of them are constantly running, being the cause of this increase.

Checking a heap dump I see that both of these threads are running selectors with postponed TCPNIOConnection tasks, which in turn do not seem to be processed correctly.

After turning on Grizzly logs I see lots of entries like the one below:

Where the first IP address is the Glassfish instance and the second the nginx load balancer.

Checking the nginx machine I see no open sockets with the listed number.

My theory is that nginx closed the socket for some reason (timeout?) and now Glassfish is unable to connect to it. The task will keep running and consuming resources until a server restart (as selectors with postponed tasks will call non-blocking selectNow() method).

I added more logging to investigate the problem.
Can you pls. apply the patched jar [1] and enable FINEST logging for org.glassfish.grizzly.nio.DefaultSelectorHandler
After you reproduce the problem - pls. share the server.log* files.

Ok, the event happened again and another thread is looping. Unfortunately due to the heavy log rotation, I could not get the beginning of it. Server log can be downloaded below but it's mostly composed of the same entries.

Can you pls. apply this patch [1].
Please enable FINEST logging for "org.glassfish.grizzly.nio.AbstractNIOAsyncQueueWriter". The logging you changed earlier could be disabled.
Now you'll see even more logging messages

As far as I can see all the entries in the first log file in the sequence - server.log_2014-10-06T22-12-38 - were logged before the utilisation went up. Then around 2014-10-06T22:13:16.417+0200 in the log file - server.log_2014-10-06T22-13-16 - the problem starts appearing.

First signs of the issue I can see are in server.log_2014-10-07T10-22-36 It took a while for me to realise that the issue had reappeared, so there are plenty more log files, but they just seem to repeat the same entry.

Yesterday I applied the patch to one of our dev servers and to one of our production servers. The dev server is still running smoothly, but the issue has reappeared on the production server. It is possible that the dev server just doesn't see enough traffic for the issue to have been triggered again. In the past the dev server has also sometimes run smoothly for a day without the issue appearing.

I can turn on logging on the production servers, but unfortunately restarts have to be done between 22:00 and 06:00 CAT if I have to patch again. For the moment I'm going to ask someone in my team to try and reproduce on the dev server.

We've now succeeded in reproducing this issue "at will" on one of our dev servers. The setup is a very basic nginx proxy that forwards http requests to a GF4.1 server running a simple web application with at least one page that takes longer than 1s to process. The nginx configuration for the hostname being forwarded looks as follows:

By setting the connect, send and read time outs to 1 second each it guarantees that nginx will handle requests for every page that takes longer than 1 second to respond as a time out. With this configuration we can trigger the issue consistently be requesting a simple page from a servlet that has Thread.sleep(2000).

EDIT: It turns out that it is not quite as simple to reproduce this issue as I thought. This morning we were able to easily reproduce on the test server, but after a restart of the test GF server and the nginx proxy server we cannot reproduce again with the simple Thread.sleep servlet. If we do manage to consistently reproduce it I'll post the web app here.

@afcarv , we're running nginx 1.4.6 and 1.6 in front of the two servers where this issue most frequently occurs. We have not tried proxy_ignore_client_abort yet. I will test that tonight and report back.

We've not been able to trigger this issue again on the dev server since I've loaded the patch. It is still a bit inconclusive though because there were other times when we could also not reliable reproduce the issue on the dev server in 24 hours. We'll keep putting some effort into this on the dev server today and tomorrow and if everything seems good I'll give this a try on one of our production servers from Thursday night.

I'm going to try the patch on one of our production servers in a couple of hours from now. Although we've not been able to directly trigger the issue on our production servers in the past the issue has shown up consistently within a few minutes of starting up any of the production servers, presumably because of the higher volumes of traffic. I should therefore tomorrow be able to say with quite a lot of certainty whether the latest patch has solved the problem for us.

After having closely monitored all the servers where we've loaded the latest patch I'm happy to report that the issue hasn't reappeared on any of these servers. We are seeing a great improvement in CPU utilisation as well as a great improvement in the number of requests timing out between our nginx proxies and app servers.

Thought this was due to the pile of TCPNIOConnection tasks that kept accumulating, so I didn't mind it too much, but it looks like this is still happening, even after the patch, so I'm not sure if it's related to this issue now. Haven't tried to change max queue size yet.

I then took a thread dump which shows all threads in http-listener-1 in a state like the below:

"http-listener-1(31)" daemon prio=10 tid=0x00007ff19422b000 nid=0x6c61 waiting on condition [0x00007ff1f6eed000]
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x00000006036fd360> (a java.util.concurrent.LinkedTransferQueue)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
at java.util.concurrent.LinkedTransferQueue.awaitMatch(LinkedTransferQueue.java:735)
at java.util.concurrent.LinkedTransferQueue.xfer(LinkedTransferQueue.java:644)
at java.util.concurrent.LinkedTransferQueue.take(LinkedTransferQueue.java:1137)
at org.glassfish.grizzly.threadpool.FixedThreadPool$BasicWorker.getTask(FixedThreadPool.java:105)
at org.glassfish.grizzly.threadpool.AbstractThreadPool$Worker.doWork(AbstractThreadPool.java:557)
at org.glassfish.grizzly.threadpool.AbstractThreadPool$Worker.run(AbstractThreadPool.java:545)
at java.lang.Thread.run(Thread.java:744)

They seem to be waiting on LinkedTransferQueue take() method, which appears to be not responding (waiting for some item to arrive in the queue?). I also took a memory dump that I can look into if someone points me in the right direction.

Thought this could be related to GRIZZLY-1711 but after applying the latest version (2.3.17-SNAPSHOT) this behavior is still happening.

You could configure this to -1 or unlimited but this will likely lead to other issues such as out of memory etc

It would appear as though tasks are being added to this queue faster than the tasks can be processed. I would investigate why this occurring? If you could take a number of thread dumps over a period of time whereby we capture this issue occurring that would be useful.

Furthermore, it maybe worth increasing the value of your http request processing threads as this will allow more tasks to be handled concurrently off the http thread task queue and so it may reach the 4096 limit so quickly. What is the current value of the number of http request processing threads??

The issue you have here is very dependent on the number of http requests being pushed to the http task pool and how quickly these requests are being handled so appropriate tuning of the http parameters may resolve this issue. If this fails then looking at the state of all threads running over a period of time should give you a hint as to why requests are taking so long to process.

I don't think it's a configuration issue as the environment is able to handle much heavier loads - it may be due to specific and temporary environmental reasons (network latency, database slowdown, etc) but you'd think it would resume work as soon as this is normalized. This does not happen (waited for some hours to be sure), even if the consumer is stopped and there's no inbound traffic anymore - the only way to bring it back online is with a server/cluster restart. I did notice a lot of connections in the CLOSE_WAIT state in the server, though.

Haven't tried to raise the queue limit or remove it yet, but I suspect it would only delay the occurrence or eventually reach an OOM error.

I see no obvious reason for this looking at the thread pool dump, but will post it later for analysis. Thanks!

Please find attached the full thread dump and the server log file. The dump was taken about 30 minutes after the issue manifested itself - you can see all threads in the http-listener-1-kernel pool reach the queue limit and stop responding. http-thread-pool contains 32 threads; there are 8 acceptor threads.

Interestingly, http-listener-2 (SSL) continues to respond normally. http-listener-1 hanged and required a server restart. The configuration is a 2 instance cluster with a nginx load balancer in front; there are Jersey applications deployed serving RESTful webservices. The configuration can handle about 5x the average load on the server - no increased throughput was observed leading to the issue, which persisted after stopping all inbound traffic.

I will schedule a cron job to take a thread dump every 5 minutes to check on any odd behavior leading to the issue - thanks!

I see lots of instances of TCPNIOConnection (1500+) in what appears to be a closing state; as the latest snapshot was applied I can see these with a CloseReason of "LOCALLY". May explain the connections in the CLOSE_WAIT state I saw? Is it possible that nginx closed the socket before a FIN packet could be sent from the server, and now it is not able to end the connection properly? Not sure if this could be just a consequence of some other issue, though.

We're using b13 (couldn't find the tag for it so I selected the closest), but I believe it's the same version - 2.3.15.

Currently, 2.3.17-SNAPSHOT is applied.

By the way - we found out that what triggers this behavior is a DB performance degradation, causing the requests to queue and eventually reach the limit. No additional info on why queue isn't being reduced, though.

Could you take a number of thread dumps say every 5 minutes so that we can compare the different thread dump files leading up to the hang? Also could you send us on the contents of the <transports> which should contain the different tcp parameters used by Grizzly.

I had some monitoring in place but we've been focusing on fixing the db degradation so the issue won't be triggered - so far we've been successful, so it looks unlikely I'll be able to collect more data in the production system.

What I'm going to try next is to create a sample application and configuration that will be able to recreate the issue - doesn't look too hard; just a sample RESTful webservice that performs a db query that has an artificial delay in response, small db connection pool and a sample JMeter test script with http requests that will timeout before the app has a chance to respond. Should replicate the conditions pretty closely (don't know if nginx in front plays a role in this or not, so will try without it first).

We're facing very frequent out of Memory problems in our Glassfish Installation of an enterprise web application.

The application is running in a cluster of 18 JVMs. Initial heap size was 1 GB which resulted in the JVMs going down almost every hour and we have increased the heap size to 1.5 GB which makes the JVMs to crash every 3-4 hours.

The Heap dump does not show any application objects at all whenever the JVMs crash and shows the same objects from Glassfish/Grizzly consuming majority of the memory all the time. Below object is found to consume more than 65% of heap in most of the heap dumps.

762,277,600 bytes (52.71 %) of Java heap is used by 5,632,379 instances of org/glassfish/grizzly/memory/HeapMemoryManager$TrimmableHeapBuffer

Following error is also seen in the logs frequently (which is due a malformed URL having && in the request query string??). Can this be a potential cause of the problem? Running a 12 hours load test using similar URL does not produce out of memory in non production systems.

GRIZZLY0155: Invalid chunk starting at byte [387] and ending at byte [387] with a value of [null] ignored]]

The error is not reproducible in non-production environment but heap dumps from production can be provided if required.

Are there any know memory leaks in Glassfish 4.0 (Build 89)/Grizzly 2.3.1?

What does the object HeapMemoryManager$TrimmableHeapBuffer constitute of? Is there any way this can be avoided if there are no known/feasible fixes for Glassfish or Grizzly?

Additional Information (may not be relevant)

Each JVM is fronted by an Apache 2.4.9 HTTP Server using mod_jk to communicate with glassfish which are in turn balanced by Netscale load balancer.

Thanks oleksiys. Does this patch contain any fixes as well or it is just to gather additional logging information? We just want to make sure before promoting it in the production system. It has tested successfully in non-production environment.

Note: The JVM Time Zone is one hour behind the timestamps that are visible in the images of heap size and permgen size. Another observation that we are seeing is that the JVM crashes just when it tries to resize the permgen space. Max Permgen space for JVM is set to 512 MB. Can this be a part of problem and should we try setting min and max perm size to be the same value to avoid permgen resize?

The logs look fine the only piece of information missed (from the very latest log) is thread http-listener-3(44), from the thread dump I see this thread is constantly reading data, but I don't see any logging activity from this thread in the server.log* files.
For now it looks like the problem might be caused by a huge HTTP POST request, which represents form parameters.

Please let me know if you need any additional data. I am going to reduce the maximum post size for the glassfish instances to 1-2 MB and will enable the option to print the request size in HTTP Server (apache) to see if there was a big POST request leading up to the crash.

I enabled the request size directive in HTTP Server. The maximum size of GET or POST data is 171 KB leading up to the latest crash. Is that POST size big enough for the JVM to consume 1.5 GB memory and should I trey reducing this size?

A common pattern I have noticed is that there is one request that always takes more than 50 minutes to process with a return code of 502 leading up to the crash. The size of this request is very small (131 bytes). This request can be seen starting at 26-Sep-14 10:26:34, row number 26721 in the spreadsheet. The request completed eventually with a return code of 502 when the instance was killed and restarted. I hope this might give some extra details about the problem.

When the instance is healthy, the same request takes only 1-2 seconds.

so far so good. I patched one node with three instances on it and the instances have been very stable (never used more than 550 MB of heap) while the other instances that have not been patched are still crashing. I will update on Monday evening as we do not have comparable traffic over weekends as weekdays.

Alexis - I have slowly rolled out this patch to our entire environment and the JVM instances have been very stable following the patch. The patch has resolved the problem permanently and this issue can be closed.

Very curious to know why we were not able to recreate the problem in our pre-prod application using load tests though.

At least, we'd be interested into knowing how we could inject say our transport
classes (integrating spring http-invoker), and we'd also be glad to know if
there's any chance that it gets into GF if we provide a patch for it.

Looks like this is an interesting feature request. Thanks for filing it as such.
If I understand it correctly, you need some standard servlets that will tunnel
EJB calls via HTTP. This is useful for all the applications that have EJB
components, but don't have Web components that front end those EJB components.

IMO, JBoss makes you do too many configuration changes for this to be achieved.

> IMO, JBoss makes you do too many configuration changes for this to be achieved.
I totally, fully, wholeheartedly agree. This configuration is a nightmare.

What I would dream of would be to be able to configure a glassfish out of the
box using say a particular setup-firewallconstrained[cluster].xml and be done
(and providing the right jndi property to access).

More generally, for me, this feature request is about being able to reduce the
number of needed ports, to simplify a firewall configuration. And more
importantly, to delete the dynamic ports/be able to specify fixed ones. Kind of
UnifiedInvoker as it exists in JBoss.

If you think it'd be a good thing to file another FR and link it to this one,
please just let me know.
Thanks.

I know it's not exactly what you asked for, but given that you're separating
the client and server by a firewall, implying that they're more loosely
coupled, have you considered using JAX-WS to expose your EJBs as web services?
Yes, it's a somewhat different programming model, but it might be a better
match for your environment.

One of the features we have been discussing for GFv3 is port unification for all protocols.
This would mean that HTTP, WS, and IIOP could all be handled over the same port.
Does this solve the requirement, or do you need more than just the same port?
We do occasionally see requests to allow tunneling of RMI-IIOP requests over HTTP,
but so far this has not been sufficiently common to consider implementing it.

Of course, you can also look at invoking EJBs directly from the HTTP path,
as others have discussed.

> This would mean that HTTP, WS, and IIOP could all be handled over the same port.
Does this solve the requirement, or do you need more than just the same port?
That's not exactly the same thing, but it would definitely be a damned good
thing to convince people here about Java EE servers, and more precisely about GF
part.

So, I guess this could be sufficient.

> We do occasionally see requests to allow tunneling of RMI-IIOP requests over
HTTP, but so far this has not been sufficiently common to consider implementing it.
Do you see it as an interesting/acceptable request but don't have time to
implement it, or is it just not gonna be accepted even if we try to provide a
patch for it that'd suit you?

Well, programming in JAX-RS was already suggested as a solution and in that case there is nothing that the server has to do, the application needs to change.

The most concrete comments about a solution was use of port unification to ensure that IIOP traffic is on the same port as http. While this is not exactly a http tunnel, it might be enough for some scenarios.

I am going to request grizzly-kernel team to add a pointer to port unification documentation so that anyone referring to this issue has a handy reference on how to configure IIOP traffic on same port as http.

If port unification is not sufficient, given the availability of JAX-RS programming model and better integration between EJB and JAX-RS, the most prudent option is to change the application.

After adding the document pointer, please assign this back to orb sub-component.

The new emptyString support in Servlet 3.0 breaks the ContainerMapper mapping
algorithm. We patched the Mapper to execute in compatible mode for the 3.0
release, but we must fix it properly in ContainerMapper for the 3.0 release.

This occurs while there is no user interaction on the web page, but the page does make ajax calls and is connected via atmosphere using websockets. This also occurs on startup at once (although less frequently I believe) when there are no web browsers pointing at the app server.

The application has several war applications as well as a main ear application. The wars do web service calls to the ejb level web services (all on the same app server). So when the user is using the application there is a web session to the war and the war makes web service calls.

I am wondering if a web page without a session was trying to do ajax calls – although this doesn't explain the same stack trace on startup. Anyway hope any of this helps.

I'll update if I can come up with more examples of what makes this happen.

Okay exception during startup was actually the same cause. It is the atmosphere connection from a web browser – I had not found and closed all tabs before restarting the app server. Now I do not get this on startup.

The session and SSO / realm are most certainly expired – I do not know if that would contribute this issue. I will work on figuring out more about this and post back when I have more. Hopefully it is benign, but I have not seen these in the logs on gf 3.1.2.2 so it is a bit troubling.

We have a GlassFish server hosted on a datacenter and a jee application with a desktop application deployed via java web start running on it.

When the JWS client downloads the jars, we get very frequent network dropouts, mainly with jar files > 3 Mb where the JWS client reports: "EOF: Unexpected end of ZLIB input Stream" and the server reports the following exception.

ava.io.IOException: java.lang.InterruptedException
at org.glassfish.grizzly.utils.Exceptions.makeIOException(Exceptions.java:81)
at org.glassfish.grizzly.http.io.OutputBuffer.blockAfterWriteIfNeeded(OutputBuffer.java:958)
at org.glassfish.grizzly.http.io.OutputBuffer.write(OutputBuffer.java:682)
at org.apache.catalina.connector.OutputBuffer.writeBytes(OutputBuffer.java:355)
at org.apache.catalina.connector.OutputBuffer.write(OutputBuffer.java:342)
at org.apache.catalina.connector.CoyoteOutputStream.write(CoyoteOutputStream.java:161)
at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1793)
at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1769)
at org.apache.commons.io.IOUtils.copy(IOUtils.java:1744)
at jnlp.sample.servlet.DownloadResponse$FileDownloadResponse.sendRespond(DownloadResponse.java:257)
at jnlp.sample.servlet.JnlpDownloadServlet.handleRequest(JnlpDownloadServlet.java:187)
at jnlp.sample.servlet.JnlpDownloadServlet.doGet(JnlpDownloadServlet.java:120)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:687)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
at org.apache.catalina.core.StandardWrapper.service(

Clients connecting over a wireless internet connection get these errors much more frequently.

The client application talks to the server via http.

If i do a ping -t to the server, the number of lost packets is not that big and sometimes there is no packet loss at all when this exception happens.

Dropouts seem to last no longer than one second yet, the amount of failed downloads or calls seems to be too big and causes too many errors on the application.

If this is not a bug, i would appreciate if someone told me if there is any setting that we can configure on glassfish to enable http clients to continue reading after a drop out or to change the timeouts or retries or any other configuration to make the communication more robust?

When I try to access a web application through ie6 on glassfish 3.1.2.2, the following phenomenon has been happened:
If there's a keepAlive-timeout happened between "qin" and "qout", the connection will be broken. You will found from ※1 on the printed log that the connection has been broken.
BTW, when we access through ie6 after the first connection was broken, there will be another connection created(※2) and return back as normal. However, when we access the web application through ie8 after the first connection was broken, the second connection can't be created, so the log about ※2 can't be print to the log under this situation, the connecton will be broken here and it can't be return the right result.

Above all, the question is whether this phenomenon is a bug about ie or it is a glassfish internal issue?

I think the connection shouldn't be broken when the keepAlive-timeout happened between "qin" and "qout".

In addition， When the above phenomenon happened on glassfish 2.1.1, the following log will be print to
the server.log, the content of the log is same to the issue of GLASSFISH-20622：

server.log
-----------------------------------
java.lang.NullPointerException
at org.apache.coyote.tomcat5.OutputBuffer.addSessionCookieWithJvmRoute(OutputBuffer.java:689)
at org.apache.coyote.tomcat5.OutputBuffer.doFlush(OutputBuffer.java:370)
at org.apache.coyote.tomcat5.OutputBuffer.close(OutputBuffer.java:336)
at org.apache.coyote.tomcat5.CoyoteResponse.finishResponse(CoyoteResponse.java:594)
at org.apache.coyote.tomcat5.CoyoteAdapter.afterService(CoyoteAdapter.java:354)
at com.sun.enterprise.web.connector.grizzly.DefaultProcessorTask.postResponse(DefaultProcessorTask.java:625)
at com.sun.enterprise.web.connector.grizzly.DefaultProcessorTask.doProcess(DefaultProcessorTask.java:612)
at com.sun.enterprise.web.connector.grizzly.DefaultProcessorTask.process(DefaultProcessorTask.java:889)
at com.sun.enterprise.web.connector.grizzly.DefaultReadTask.executeProcessorTask(DefaultReadTask.java:341)
at com.sun.enterprise.web.connector.grizzly.DefaultReadTask.doTask(DefaultReadTask.java:263)
at com.sun.enterprise.web.connector.grizzly.DefaultReadTask.doTask(DefaultReadTask.java:214)
at com.sun.enterprise.web.connector.grizzly.TaskBase.run(TaskBase.java:267)
at com.sun.enterprise.web.connector.grizzly.WorkerThreadImpl.run(WorkerThreadImpl.java:118)
-----------------------------------

==============Here is the background===============
Below is the logic trace when use glassfish3.1.2.
Basicly it is similar with glassfish2.1, as two thread.
Firstly Thread <Grizzly-kernel-thread> built connection "conn"(marked with ★1), then it went to "qin"(marked with ★2),
after that another Thread <http-thread-pool> was woke up, and excuted as below sequence "qout"-> "recv"-> "send" (marked with ▲1,▲2,▲3).
Then after 15 seconds(keepalivetimeout) the connection was disconnected "disc"(marked with ★3)

when i was debuging these programs, i found that if the keepalivetimeout(15 seconds) is up it can "disc"(★3) directly after "qout"(▲1),
and it also can "disc"(★3) after“recv”(▲2)
==============Background ends===============

My first question，is it correct that after "qout" or "recv" the connection can still be disconnected as keepalivetimeout(15 sec) is up?

And when i was reproducing the sitiation , after "recv" then the connection was disconnected(keeptimeout is up), IE6 can build a second connection and finish the request
and return "200", the strange thing is when i use IE8, there is no second connection!! I was so confused, My second question is: what makes the IE6 and IE8 different?
BTW, i got the same consequence in both glassfish3.1.2 and glassfish2.1

keep-alive timeout should be applied only in cases, when the connection is idle (waiting for the next request to come).
And 15 seconds starts counting once the request/response is processed.
Are you sure it's keep-alive timeout we're talking about?

do you mean the 15 seconds starts counting from 8 and lasts from 8 to 9
or starts counting from 9 and lasts from 9 to 10("conn" of the second request)?
My opinion is the 15 seconds starts counting from 1, and lasts from 1 to 9(from the source logic),
do you agree?

In this case I have below situation: when the request is processing and it is not
completed(for example in step"recv"), at this time connection is disconnected.

just to be clear, we're talking about HTTP keep-alive, it means HTTP (TCP) connection is not getting closed between requests. Probably it's what you meant in your description, but I just wanted to make sure we're on the same page... so at the step 9. and 18. no real HTTP (TCP) connection termination happens.

Coming back to the steps you listed, confirm that keep-alive timeout is ticking between 9. and 10.

If you observe the timeout during read/write - it must be different timeout then.
First of all I'd recommend to switch to Glassfish 3.1.2.2 and check if the issue is still there.

>Coming back to the steps you listed, confirm that keep-alive timeout is ticking between 9. and 10."
>If you observe the timeout during read/write - it must be different timeout then.
Ok, seems we are talking about different timeout

Here is part of source code from glassfish(Grizzly), the keep-alive timeout I was talking about is
"idleLimit" below in "com.sun.grizzly.http.SelectorThreadKeyHandler.java".

I hava debugged several times and I am pretty sure this timeout is between read/write,
Can you try to debug from your side?

And I also listed part of the source code of "qin","qout","recv" and "send",marked with ★ below
Here is the debug steps:
1. Put the breakpoints at ●1,●2，★recv
2. When a request comes,<http-thread-pool> Thread will stop at ●2,wait for 30 seconds(default timeout), then <Grizzly-kernel-thread> Thread will come up, stop at●1
3. Keep on debuging <http-thread-pool> and after step over ★recv, go to <Grizzly-kernel-thread> and step over ●1
4. disconnect the debug

We recently noticed an issue where our Glassifish server, after running successfully for several hours, would suddenly peg one of the CPUs at 100%. Our application becomes unresponsive during this time. After restarting, the problem will eventually happen again (usually after several hours).

I ran this command to see what the threads were doing:

asadmin generate-jvm-report --type=thread

In the resulting output, one thread looked highly suspicious (consuming orders of magnitude more CPU time than any other thread):

attaching a patch, which enables selector spin workaround for all operation systems.
Can you pls. apply this patch on GF 3.1.2.2 (rename this file to grizzly-framework.jar and copy it over existing file w/ the same name in glassfishv3/glassfish/modules folder).
Pls. let me know if you still see this issue w/ the patch applied, if yes - pls. check and attach the GF server.log file.

I created an async servlet for users to upload files, that way the inputstream is buffered without having to load the whole file in memory, as there can be very big files uploaded.

When trying out to upload a 4.6GB file, after 30 seconds, the servlet enters the onError(Throwable t) method with the exception:

Severe: java.nio.channels.InterruptedByTimeoutException
at org.apache.catalina.connector.InputBuffer.disableReadHandler(InputBuffer.java:324)
at org.apache.catalina.connector.Request.asyncTimeout(Request.java:4418)
at org.apache.catalina.connector.Request.processTimeout(Request.java:4469)
at org.apache.catalina.connector.Request.access$000(Request.java:156)
at org.apache.catalina.connector.Request$6.onTimeout(Request.java:4300)
at org.glassfish.grizzly.http.server.Response$SuspendTimeout.onTimeout(Response.java:2131)
at org.glassfish.grizzly.http.server.Response$DelayQueueWorker.doWork(Response.java:2180)
at org.glassfish.grizzly.http.server.Response$DelayQueueWorker.doWork(Response.java:2175)
at org.glassfish.grizzly.utils.DelayedExecutor$DelayedRunnable.run(DelayedExecutor.java:158)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Warning: Context path from ServletContext: differs from path from bundle: /
WARN: WELD-000715: HttpContextLifecycle guard not set. The Servlet container is not fully compliant.

Now, AFAIK, because it's an NIO timeout exception, just by changing the tcp read and write timeout to 1 hour should avoid the issue, but still, the exception happens exactly 30 seconds later, as if the timeout configurations are being ignored.

I saw previous issues on grizzly ignoring the read and write timeouts, but they where closed as fixed in mid. 2013 so it shouldn't be an issue now.

A http worker thread goes into a busy conversation with Apache's mod_jk. Both the mod_jk thread and http-listener-1 pool thread are sitting at high CPU (together at around 100% for my single CPU test setup) and the listener appears never to return to the thread pool.

The conversation is as follows (captured using strace on the high CPU glassfish thread):
glassfish -> mod_jk: 0x41 0x42 0x00 0x03 0x06 0x1f 0xfa (Send me up to 8186 bytes of the body)
mod_jk -> glassfish: 0x12 0x34 0x00 0x00 (end of body)
This sequence of messages is repeated very very rapidly, suggesting that the glassfish side is not handling the end of body message from mod_jk.

The request is to an asynchronous JAX-RS method via the @Suspended annotation on an AsyncResponse parameter. The body of the message is JSON and is being deserialised in the container into another method parameter.

I think the following is occurring:

The browser sends a post request with a (in this case) 39260 byte body, but cancels it. This closes the spdy stream and causes mod_spdy to report an end of file to mod_jk internally within apache

Meanwhile glassfish begins to process the request. It has not read the whole request at this time. It invokes jackson to read the request which in turn reads from the InputStream that AjpHandlerFilter provides (see AjpHandlerFilter.handleRead in the stack trace).

The filter chain then appears to mishandle the AJP end of file message and instead continues to request more data. I didn't see AjpHandlerFilter in the read stack trace I captured so maybe it isn't intercepting correctly.

Restarting either apache or glassfish resolves the problem. This bug may be related to https://java.net/jira/browse/GLASSFISH-21202 which seems to have a similar trigger but results in memory growth rather than the high cpu and thread exhaustion I am seeing.

I have included stack traces for the write and read phases of the conversation as captured by jstack. I have also included strace output for the beginning of the problem, and to show how it recovers once Apache is restarted:
Thread 12324: (state = IN_NATIVE)

If you don't mind, just a procedural question: Is there any way I could trace this binary patch to a change in the source tree? Part of the open source policy of the company I work for is that I need to show how I would build the software from source if required.

Grizzly is suffering performance degradation when setSoLinger and setReuseAddess
starts throwing the following exception:

[#|2009-01-26T00:33:56.325-0800|WARNING|sun-appserver9.1|javax.enterprise.system.container.web|_ThreadID=17;_ThreadName=SelectorReaderThread-8084;_RequestID=11ae0030-c392-4217-8408-cfa7efe0a879;|setSoLinger
exception
java.net.SocketException: Invalid argument
at sun.nio.ch.Net.setIntOption0(Native Method)
at sun.nio.ch.Net.setSocketOption(Net.java:261)
at sun.nio.ch.SocketChannelImpl.setOption(SocketChannelImpl.java:166)
at sun.nio.ch.SocketAdaptor.setIntOption(SocketAdaptor.java:296)
at sun.nio.ch.SocketAdaptor.setSoLinger(SocketAdaptor.java:331)
at
com.sun.enterprise.web.connector.grizzly.SelectorThread.setSocketOptions(SelectorThread.java:1893)
at
com.sun.enterprise.web.connector.grizzly.SelectorReadThread.registerNewChannels(SelectorReadThread.java:93)
at
com.sun.enterprise.web.connector.grizzly.SelectorReadThread.startEndpoint(SelectorReadThread.java:121)
at
com.sun.enterprise.web.connector.grizzly.SelectorThread.run(SelectorThread.java:1223)

#]

[#|2009-01-26T00:33:56.327-0800|WARNING|sun-appserver9.1|javax.enterprise.system.container.web|_ThreadID=17;_ThreadName=SelectorReaderThread-8084;_RequestID=11ae0030-c392-4217-8408-cfa7efe0a879;|setReuseAddress
exception
java.net.SocketException: Invalid argument
at sun.nio.ch.Net.setIntOption0(Native Method)
at sun.nio.ch.Net.setSocketOption(Net.java:261)
at sun.nio.ch.SocketChannelImpl.setOption(SocketChannelImpl.java:166)
at sun.nio.ch.SocketAdaptor.setBooleanOption(SocketAdaptor.java:286)
at sun.nio.ch.SocketAdaptor.setReuseAddress(SocketAdaptor.java:399)
at
com.sun.enterprise.web.connector.grizzly.SelectorThread.setSocketOptions(SelectorThread.java:1910)
at
com.sun.enterprise.web.connector.grizzly.SelectorReadThread.registerNewChannels(SelectorReadThread.java:93)
at
com.sun.enterprise.web.connector.grizzly.SelectorReadThread.startEndpoint(SelectorReadThread.java:121)
at
com.sun.enterprise.web.connector.grizzly.SelectorThread.run(SelectorThread.java:1223)

Currently the ssl element in the domain.xml exposes an attribute called ssl2-enabled. This attribute has no real impact on the runtime (at least with the Sun and Apple JDKs) as ssl2 isn't supported by JSSE.

This has two main points of impact:
1) grizzly-config
2) documentation

We need to come to a final decision on deprecation or removal and take the approriate actions on each item.

I changed setting configs.config.server-config.network-config.network-listeners.network-listener.http-listener-1.port=4848 via GUI admin console and after that GUI console is not available and GF instance as well. Why is it possible to change the port to already occupied with process one?

I can't reproduce the issue but I get many and many Client Abort Exception:
It seems a client side communication interruption but
The server.log is filled by these becoming unreadable..

org.apache.catalina.connector.ClientAbortException: java.io.IOException: Broken pipe
at org.apache.catalina.connector.OutputBuffer.realWriteBytes(OutputBuffer.java:430)
at com.sun.grizzly.util.buf.ByteChunk.flushBuffer(ByteChunk.java:458)
at com.sun.grizzly.util.buf.ByteChunk.append(ByteChunk.java:380)
at org.apache.catalina.connector.OutputBuffer.writeBytes(OutputBuffer.java:455)
at org.apache.catalina.connector.OutputBuffer.write(OutputBuffer.java:442)
at org.apache.catalina.connector.CoyoteOutputStream.write(CoyoteOutputStream.java:160)
at org.apache.catalina.servlets.DefaultServlet.copy(DefaultServlet.java:2010)
at org.apache.catalina.servlets.DefaultServlet.serveResource(DefaultServlet.java:1040)
at org.apache.catalina.servlets.DefaultServlet.doGet(DefaultServlet.java:466)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:668)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:770)
at org.apache.catalina.core.StandardWrapper.service(StandardWrapper.java:1550)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:343)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:217)
at filter.AdwFilter.doFilter(AdwFilter.java:206)
...
...
Caused by: java.io.IOException: Broken pipe
at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:89)
at sun.nio.ch.IOUtil.write(IOUtil.java:60)
at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:450)
at com.sun.grizzly.util.OutputWriter.flushChannel(OutputWriter.java:108)
at com.sun.grizzly.util.OutputWriter.flushChannel(OutputWriter.java:76)
at

When Grizzly throws an "Invalid URI character encoding" exception, the URI is part of the stack trace but the HTTP request info isn't saved on the access log.
This is a problem if the request URI makes it obvious that the requester is trying an exploit/vulnerability.
Without the access log used, there is no way of seeing the IP/hostname of the requester to identify the source of this attack attempt.

After trying to replicate in a VM with the suggested build, a similar error is not thrown.
To specify, here is the stack trace of the URI decoding issue which is not being logged in the access log.

[#|2012-01-26T07:50:40.472+0100|WARNING|glassfish3.1|com.sun.grizzly.config.GrizzlyServiceListener|_ThreadID=23;_ThreadName=Thread-1;|Internal Server error: /../../../../../../../../boot.ini
java.io.IOException: Invalid URI character encoding
at com.sun.grizzly.util.http.HttpRequestURIDecoder.decode(HttpRequestURIDecoder.java:101)
at com.sun.enterprise.v3.services.impl.ContainerMapper.service(ContainerMapper.java:185)
at com.sun.grizzly.http.ProcessorTask.invokeAdapter(ProcessorTask.java:822)
at com.sun.grizzly.http.ProcessorTask.doProcess(ProcessorTask.java:719)
at com.sun.grizzly.http.ProcessorTask.process(ProcessorTask.java:1013)
at com.sun.grizzly.http.DefaultProtocolFilter.execute(DefaultProtocolFilter.java:225)
at com.sun.grizzly.DefaultProtocolChain.executeProtocolFilter(DefaultProtocolChain.java:137)
at com.sun.grizzly.DefaultProtocolChain.execute(DefaultProtocolChain.java:104)
at com.sun.grizzly.DefaultProtocolChain.execute(DefaultProtocolChain.java:90)
at com.sun.grizzly.http.HttpProtocolChain.execute(HttpProtocolChain.java:79)
at com.sun.grizzly.ProtocolChainContextTask.doCall(ProtocolChainContextTask.java:54)
at com.sun.grizzly.SelectionKeyContextTask.call(SelectionKeyContextTask.java:59)
at com.sun.grizzly.ContextTask.run(ContextTask.java:71)
at com.sun.grizzly.util.AbstractThreadPool$Worker.doWork(AbstractThreadPool.java:532)
at com.sun.grizzly.util.AbstractThreadPool$Worker.run(AbstractThreadPool.java:513)
at java.lang.Thread.run(Thread.java:619)

Exactly.
Because the exception is thrown on URI decode, Grizzly gives up there and nothing is written to the access log.
Which becomes very problematic when you have a rogue client trying some exploit like it's the case here.