when a test's heart beats it should also throw up (dump stack of all threads)

Details

Type: Bug

Status:Open

Priority: Major

Resolution:
Unresolved

Affects Version/s:
None

Fix Version/s:
None

Component/s:
None

Labels:

None

Lucene Fields:

New

Description

We've had numerous cases where tests were hung but the "operator" of that particular Jenkins instance struggles to properly get a stack dump for all threads and eg accidentally kills the process instead (rather awful that the same powerful tool "kill" can be used to get stack traces and to destroy the process...).

Is there some way the test infra could do this for us, eg when it prints the HEARTBEAT message?

I vaguely remember people on the mailing list expressed concerns even about the heartbeat messages so throwing stack traces at everyone would be probably a bit too much. I see a few options.

1) Stack traces are dumped at test/suite timeout, so you could simply override the default and lower this timeout locally:

For tests only:
-Dtests.timeout=5000!

For entire suites:
-Dtests.timeoutSuite=10000!

Note the '!' at the end which means the property should take precedence over annotations. The scale is millis.

2) The above workaround has an obvious problem that the stack trace is not dumped repeatedly which can be extremely useful in telling which parts of the code a thread was locked in. In fact, randomized runner once HAD a snippet of code which was doing it automatically – on timeout it would probe stack traces of all threads a few times at random intervals and then it would print the 'diverging' paths for each thread (along with the constant part). I removed it because it wasn't used by anybody

So... I could add yet another option that would dump stack traces on heartbeats but it would be turned off by default which I think is sensible. This option could also take a number which would mean roughly: "start dumping stack traces after the N-th heartbeat".

Dawid Weiss
added a comment - 27/Oct/12 07:40 I vaguely remember people on the mailing list expressed concerns even about the heartbeat messages so throwing stack traces at everyone would be probably a bit too much. I see a few options.
1) Stack traces are dumped at test/suite timeout, so you could simply override the default and lower this timeout locally:
For tests only:
-Dtests.timeout=5000!
For entire suites:
-Dtests.timeoutSuite=10000!
Note the '!' at the end which means the property should take precedence over annotations. The scale is millis.
2) The above workaround has an obvious problem that the stack trace is not dumped repeatedly which can be extremely useful in telling which parts of the code a thread was locked in. In fact, randomized runner once HAD a snippet of code which was doing it automatically – on timeout it would probe stack traces of all threads a few times at random intervals and then it would print the 'diverging' paths for each thread (along with the constant part). I removed it because it wasn't used by anybody
So... I could add yet another option that would dump stack traces on heartbeats but it would be turned off by default which I think is sensible. This option could also take a number which would mean roughly: "start dumping stack traces after the N-th heartbeat".
Let me know if (2) sounds good to you.

Oh, forgot – both props mentioned above are from randomized testing package so you'd have to add them to common-build.xml and pass them to <junit4:junit4> task. We could probably just replace all the props with a propertyset based on regexp so that all tests.* properties are passed automatically... but then it wouldn't be so explicit so maybe it's actually better the way it is.

Dawid Weiss
added a comment - 27/Oct/12 07:43 Oh, forgot – both props mentioned above are from randomized testing package so you'd have to add them to common-build.xml and pass them to <junit4:junit4> task. We could probably just replace all the props with a propertyset based on regexp so that all tests.* properties are passed automatically... but then it wouldn't be so explicit so maybe it's actually better the way it is.

Robert Muir
added a comment - 27/Oct/12 12:52
I vaguely remember people on the mailing list expressed concerns even about the heartbeat messages so throwing stack traces at everyone would be probably a bit too much.
But we have tests totally hanging and taking minutes of time. So I think their concerns can wait in line!
On the other hand if we like the test hangs and want to keep them

This is trickier than I thought. The problem is that heartbeats are in fact generated on the master side (not on the forked JVM) based on the activity in the event stream for a particular forked JVM (that's why it was possible to generate a heartbeat from an "unknown" location in the test execution timeline). The naming is confusing here, blame me.

To generate a stack trace of the forked JVM we'd need to do it remotely (and know its PID) or do it locally (and propagate it back to the master). This isn't a trivial refactoring, I'll need some time to do it. In the mean time I'll release the fix for the shutdown hooks issue – this should help in getting those stalled runs to complete (with a suite timeout on forked JVMs and corresponding stack traces).

Dawid Weiss
added a comment - 27/Oct/12 20:01 This is trickier than I thought. The problem is that heartbeats are in fact generated on the master side (not on the forked JVM) based on the activity in the event stream for a particular forked JVM (that's why it was possible to generate a heartbeat from an "unknown" location in the test execution timeline). The naming is confusing here, blame me.
To generate a stack trace of the forked JVM we'd need to do it remotely (and know its PID) or do it locally (and propagate it back to the master). This isn't a trivial refactoring, I'll need some time to do it. In the mean time I'll release the fix for the shutdown hooks issue – this should help in getting those stalled runs to complete (with a suite timeout on forked JVMs and corresponding stack traces).

Funnily – that's the hardest part about it... there's just no API to get a PID of the forked process – I think I've seen a patch to JDK 1.8 somewhere that adds it but for now there's just none. It'd have to be platform-specific trickery to get it.

Dawid Weiss
added a comment - 27/Oct/12 20:33 Funnily – that's the hardest part about it... there's just no API to get a PID of the forked process – I think I've seen a patch to JDK 1.8 somewhere that adds it but for now there's just none. It'd have to be platform-specific trickery to get it.

Robert Muir
added a comment - 07/Dec/12 12:51 http://jenkins.sd-datasolutions.de/job/Lucene-Solr-4.x-Linux/3089/console was another example of this.
alternatively we could try to do something like hackup the parent process (ant) to respond to some signals in
a way that helps prevent operator errors.

Actually getting to the step of printing the PID in the heartbeat message would be really useful!

The PID is accessible! The parent process cannot get id (the PID of forked process), but the child JVM can get its own PID using the javax.management API. I have to lookup example code. So the child JVM could print it on starting up.

Uwe Schindler
added a comment - 07/Dec/12 13:01 Actually getting to the step of printing the PID in the heartbeat message would be really useful!
The PID is accessible! The parent process cannot get id (the PID of forked process), but the child JVM can get its own PID using the javax.management API. I have to lookup example code. So the child JVM could print it on starting up.

There exists no platform-independent way that can be guaranteed to work in all jvm implementations. ManagementFactory.getRuntimeMXBean().getName() looks like the best (closest) solution. It's short, and probably works in every implementation in wide use.

On linux it returns a value like 12345@localhost (12345 being the process id). Beware though that according to the docs, there are no guarantees about this value:

Returns the name representing the running Java virtual machine. The returned name string can be any arbitrary string and a Java virtual machine implementation can choose to embed platform-specific useful information in the returned name string. Each running virtual machine could have a different name.

Uwe Schindler
added a comment - 07/Dec/12 13:03 Here is it:
There exists no platform-independent way that can be guaranteed to work in all jvm implementations. ManagementFactory.getRuntimeMXBean().getName() looks like the best (closest) solution. It's short, and probably works in every implementation in wide use.
On linux it returns a value like 12345@localhost (12345 being the process id). Beware though that according to the docs, there are no guarantees about this value:
Returns the name representing the running Java virtual machine. The returned name string can be any arbitrary string and a Java virtual machine implementation can choose to embed platform-specific useful information in the returned name string. Each running virtual machine could have a different name.
(from http://stackoverflow.com/questions/35842/how-can-a-java-program-get-its-own-process-id )

Uwe Schindler
added a comment - 07/Dec/12 13:09 - edited Maybe instead of printing "J0" and "J1",... just always print this identification string in all messages? It is platform specific but defined to be unique, so much better than J0, J1,...

I can probably print those identifiers when a VM doesn't respond (and is heartbeating) and once at start? We don't know how long these will be and because they're variable length they can become pretty verbose.

Dawid Weiss
added a comment - 07/Dec/12 13:18 I can probably print those identifiers when a VM doesn't respond (and is heartbeating) and once at start? We don't know how long these will be and because they're variable length they can become pretty verbose.