Activity

Passing node-specific scripts (which is what is needed to pass each node its IP addresses) would need a modification to jclouds, since ComputeService#runScriptOnNodesMatching runs the same script on all nodes. Calling this API for each node in turn with a different script is inefficient since jclouds queries for all the running nodes and filters each time.

Tom White
added a comment - 03/Jun/11 18:45 Passing node-specific scripts (which is what is needed to pass each node its IP addresses) would need a modification to jclouds, since ComputeService#runScriptOnNodesMatching runs the same script on all nodes. Calling this API for each node in turn with a different script is inefficient since jclouds queries for all the running nodes and filters each time.

I hadn't seen this before. We could use this along with runScriptOnNode() to pass IP addresses, etc. Is this what you had in mind?

One possible downside to this approach is that using "whirr run-script" wouldn't be able to set these environment variables, so any scripts that take advantage of them would could not be run like this.

Tom White
added a comment - 06/Jun/11 05:19 > I would use ScriptBuilder.addEnvironmentVariableScope for this.
I hadn't seen this before. We could use this along with runScriptOnNode() to pass IP addresses, etc. Is this what you had in mind?
One possible downside to this approach is that using "whirr run-script" wouldn't be able to set these environment variables, so any scripts that take advantage of them would could not be run like this.

Andrei Savu
added a comment - 06/Jun/11 21:02 > We could use this along with runScriptOnNode() to pass IP addresses, etc. Is this what you had in mind?
Exactly.
> One possible downside to this approach is that using "whirr run-script" wouldn't be able to set these environment variables
We could rewrite "whirr run-script" in terms of runScriptOnNode.
Actually I think it would be useful to have a runScriptOnNode method on the ClusterController that would set all the required environment variables and we could call that by using an executor service.

Andrei Savu
added a comment - 07/Jun/11 15:06 I've done some work on this one (trying to rewrite "whirr run-script") and I believe I've discovered a bug in jclouds.
It seems like runScriptOnNode ignores the new credentials provided as RunScriptOptions.
I believe the error should be fixed in BaseComputeService.runScriptOnNode and the code should be similar to BaseComputeService.TransformNodesIntoInitializedScriptRunners.apply

Tom White
added a comment - 07/Jun/11 19:12 > We could rewrite "whirr run-script" in terms of runScriptOnNode
Yes, this would work.
Rather than adding methods to ClusterController, I think we need a RunScriptClusterAction which encapsulates this logic. This would also address WHIRR-324 .

Tom White
added a comment - 15/Aug/11 21:30 This patch show the changes needed to StatementBuilder to set environment variables with the instance data (such as IP address). It still needs work to update the service scripts to use the variables.

Tom White
added a comment - 03/Sep/11 01:32 Updated version of the patch with an update to ZooKeeper, which removes some code. The patch is not ready for commit since it's not working reliably for me yet.

I updated the patch to the latest trunk. The ZK integration test passes on EC2 and Rackspace. However, the following message gets printed to the console, even though there is no problem starting the service:

Tom White
added a comment - 04/Oct/11 18:50 I updated the patch to the latest trunk. The ZK integration test passes on EC2 and Rackspace. However, the following message gets printed to the console, even though there is no problem starting the service:
Error running script:
aborting: jclouds-script-1317748237929 did not start
I'm not sure why this is happening - it seems to be in the jclouds "forget" bash function.
Andrei - it would be great if you could take a look/takeover the patch. Thanks!

I have seen the 'did not start' error before while working on run-script for BYON.
The problem occurred with scripts with a very short execution time. Worked around it by adding 'sleep 2' to those scripts, or setting wrapInInitScript=false.

Karel Vervaeke
added a comment - 05/Oct/11 08:03 I have seen the 'did not start' error before while working on run-script for BYON.
The problem occurred with scripts with a very short execution time. Worked around it by adding 'sleep 2' to those scripts, or setting wrapInInitScript=false.

Andrei Savu
added a comment - 05/Oct/11 13:25 The problem occurred with scripts with a very short execution time.
Do you think we should open an issue in jclouds for this? Adrian? I will try to add a test in jclouds to replicate the failure.

Andrei Savu
added a comment - 05/Oct/11 22:10 I've decided that I'm going to add a 'sleep 4' statement as a workaround. Unfortunately the zookeeper integration tests are failing for me on ec2 with this change. Will try again tomorrow morning.

Any time sleep is used for consistency the amount of time to pre-wait should be a tunable parameter because it is a heuristic that depends on system load. For instance, on EC2, different regions at different times might require different values to have 99% probability of being ready.

Moreover, if no retries are implemented yet for a particular operation, then it is even more important to be tunable.

The tunable property should be suggested in the error msg that is seen when the operation fails.

Paul Baclace
added a comment - 07/Oct/11 00:00 Any time sleep is used for consistency the amount of time to pre-wait should be a tunable parameter because it is a heuristic that depends on system load. For instance, on EC2, different regions at different times might require different values to have 99% probability of being ready.
Moreover, if no retries are implemented yet for a particular operation, then it is even more important to be tunable.
The tunable property should be suggested in the error msg that is seen when the operation fails.

Andrei Savu
added a comment - 07/Oct/11 00:07 In this patch I have updated all service to use the env variables. All integration tests pass on rackspace.
Next: test on aws, look into reliability issues as Paul suggested (thanks!).
Feedback is highly appreciated.

I agree with Paul that sleep times should be tunable in general, but in this case it's a local process (i.e. not calling out to other instances) and it only serves to suppress an error message as far as I can tell. So it's worth adding until it's fixed in jclouds, since the message is misleading for users, but I'm not sure if it needs to be tunable in this case.

Tom White
added a comment - 11/Oct/11 00:13 +1 looks good.
I agree with Paul that sleep times should be tunable in general, but in this case it's a local process (i.e. not calling out to other instances) and it only serves to suppress an error message as far as I can tell. So it's worth adding until it's fixed in jclouds, since the message is misleading for users, but I'm not sure if it needs to be tunable in this case.

Andrei Savu
added a comment - 11/Oct/11 15:05 Updated patch for current trunk and added parallel configure script execution on all machines. I have executed all integration tests on cloudservers-us and hbase on ec2.
@Adrian How is ComputeService.runScriptOnNode exposing ssh authentication issues? Is it a good idea to retry execution on IllegalStateException?