Virtual Services stop responding, when they are left unconsumed for more than 2 days after certain transaction counts

We have few list of virtual services with capacity value 5 and on every Monday we have observed that these virtual services are going down. Till Friday there will be hits on the virtual services and over the weekend they are left untouched and when they are being hit on the following Monday they go unresponsive.

Note:

Even though the status of the VS is in 'Running' state, they remain unresponsive during the above period

Restart was required on them to make them responsive

What we tried?

We tried to run keep warm sample tests to hit those virtual services automatically for every 2 hours which runs 24*7 , after that the above problem was not occurring

What is required from the team?

The above was just a tactical way of identifying and confirming, if above is the case. Since it has been proven from the above trail, it would be helpful if the team could suggest the actual and permanent fix to keep the service continuity instead of keeping the above warm test cases.

What type of services are experiencing the stoppage? HTTP, JMS, TCP, ….

Have you had an opportunity to look at the vse.log and the VS_<service>.log during the window of time that the services stop working. I would be curious if there are any exceptions or unusual events occurring in the log.

Also, since you indicate this often occurs over a weekend, I would determine if a security scan was executed on the server. If you can get a look at the scan logs you might see the point at which.

If these are JMS-based services, take a look and see if the JMS Provider server was bounced. When this occurs, connections from VS's can be lost and a restart is required.

2. Have you had an opportunity to look at the vse.log and the VS_<service>.log during the window of time that the services stop working. I would be curious if there are any exceptions or unusual events occurring in the log. - I had a look at the below logs

1. vse.log - No Exceptions in the logs, I could only see the status of the service is shown as down

<html><body><title>Bad Request</title><p>The following line is not a valid HTTP request:<p><pre>

3. VS_<service>.log - No request or events are logged in this log file during this period

I have observed the similar behaviors as mentioned above during all 4 week period log dumps( only during Sundays' after 6pm).

There are similar copies of the same service which is deployed in the same servers are running fine. Only difference will be the type of application consuming the copies of the same virtual services will be different.

I have linked the reference error query that is related to this. Please find the ticket for more details:

Thank you for the details. The link you included shows that the service Listen step is throwing a NULL pointer exception in a JSON Path Filter in the Listen Step.

My suspicion would be that the Filter throws the NULL pointer exception, DevTest traps the exception and interprets this as an invalid request even though the request itself might be valid -- just missing the JSON element the JSON Path is querying.

If the assumption is true, it is likely that the If Environment Error Assertion in the Listen Step is trapping the exception. The Listen step, If Environment Error is most likely branching to End the Test. Perhaps, this is shutting down the service.

The above case is for one of the virtual services that throws error and goes offline by itself after certain number of increase in the error count.

But there are other virtual services which even though it is running without error (not going offline, no exception in the logs), they are not responding if they are left untouched in the weekend or after certain number of transactions.

If the services are stop and restarted from the Portal, then they are responding as expected.

The capacity value is also set to suffice the transaction counts based on the formula recommended in CA documentation.

Are you seeing errors in the Portal when VS is stopped? If so, by default the error count for the VS is set to 3 with the property lisa.vse.max.hard.error. When VS throws more than 3 errors, it automatically stops by design.

I think, you are right.. I have some services where I am using JSON 2.0 as Data Protocol in request filter. I will have to investigate if requests are being made without JSON data being sent resulting the filter returning NULL

I am also having the same problem. I get connection refused. I have already created a support ticket with CA and they had provided me a patch. But after applying the patch, all ports in the VSE got blocked, so took out the patch file and after rebooting our Linux server, the services are back. But services are still becoming unresponsive randomly.

Perhaps the below is completely unrelated (most likely) but just thought I would mention a situation whereby a virtual response becomes unresponsive without any noticeable error message:

That would be in the case where the virtual service model is not OOTB but instead adapted to query a database as part of the logic of providing a dynamic response. There are situations whereby the jdbc connections pool runs out of instances. It looks like in that case there are no error messages logged.

This problem would be fixed by adding a statement like 'lisa.jdbc.pool.maxPoolSize'='300' in the VSE's local.properties file.

As mentioned, not sure if the above is related to your problem. I wouldn't be able to explain how it becomes unresponsive when not used over the weekend. But should your vsm do some JDBC queries that it might be something to try.

Since you have the full stack trace it would likely be helpful to them.

A concurrent modification exception usually relates to a list object having an element removed while an iterator is iterating over the list.

Presumably, TextHTTPServer was iterating over a list of endpoints (getEndpoints, line 292) at the time the exception was thrown. The exception implies that something removed one of the Endpoints while the iteration was taking place. I would have expected the removal process to be synchronized, and probably outside the method that is iterating the list; but, you would probably need product engineering to help take a look.