Composable services

Details

Description

The current design does not support composable services, so you can't, for example, run ZooKeeper on the same cluster as Hadoop (a better example would be running Flume agents on a Hadoop cluster). We should make it easy to do this.

Tom White
added a comment - 07/Dec/10 15:10 +1 to checking URLs (with a HEAD request?). WHIRR-123 will cover this.
> I didn't explicity set whirr.run-url-base or whirr.client-cidrs. Do we need to?
Not necessarily, but I often change whirr.run-url-base when I'm testing new scripts. And whirr.client-cidrs is needed from some networks.

looks like a whirr version error in runurl on the remote host due to not previously running mvn install. Do you think we could sanity check runurls before building against them? ex. checkExists(url, "url must exist: "+url); I can see in some cases this may be overbearing, so maybe even use assertions, so that in test cases we get the info? assert head(url) : "url must be reachable: "+url;

Adrian Cole (Inactive)
added a comment - 07/Dec/10 13:37 +1 to the patch
My troubles were related to something different:
looks like a whirr version error in runurl on the remote host due to not previously running mvn install. Do you think we could sanity check runurls before building against them? ex. checkExists(url, "url must exist: "+url); I can see in some cases this may be overbearing, so maybe even use assertions, so that in test cases we get the info? assert head(url) : "url must be reachable: "+url;

I re-applied the patch and ran zookeeper and hadoop integration tests - both passed. Are you using an overridden whirr.run-url-base? Or do you need to set whirr.client-cidrs? Do the tests work without the patch applied?

Can you rerun with the tearDown() method in the relevant test removed, so you can log in to see what's going on? I opened WHIRR-152 to make this kind of thing easier to diagnose in the future.

Tom White
added a comment - 07/Dec/10 00:45 I re-applied the patch and ran zookeeper and hadoop integration tests - both passed. Are you using an overridden whirr.run-url-base? Or do you need to set whirr.client-cidrs? Do the tests work without the patch applied?
Can you rerun with the tearDown() method in the relevant test removed, so you can log in to see what's going on? I opened WHIRR-152 to make this kind of thing easier to diagnose in the future.
Thanks for the review.

Adrian Cole (Inactive)
added a comment - 06/Dec/10 23:18 +1 much cleaner design than before 117 and significantly easier to read. It reads like a story.. I like it a lot.
note that I'm getting the following error on the zk test:
2010-12-06 23:07:16,971 WARN [org.apache.zookeeper.ClientCnxn] (main-SendThread(ec2-50-16-88-214.compute-1.amazonaws.com:2181)) Session 0x0 for server null, unexpected error, closing socket connection and attempting reconnect
neverending dots on cassandra
and the following on hadoop
Warning: Permanently added 'ec2-50-16-80-56.compute-1.amazonaws.com,50.16.80.56' (RSA) to the list of known hosts.
channel 2: open failed: connect failed: Connection refused
I suspect there's something awry in the firewall rules. Otherwise looks good!

Tom White
added a comment - 06/Dec/10 22:02 Tests pass on Rackspace with a minor change. Also re-instated deprecated
{Cassandra,Hadoop,ZooKeeper}
Service classes (with no code except the getName() method, since the code is now in the ClusterActionHandler implementations), so that the CLI works, which I tested manually.

Thanks for the feedback, Adrian - it's very helpful! I've updated the patch with your suggestions, and a few more changes:

I've introduced a ClusterActionEvent object, to make the ClusterActionHandler be in line with the standard listener pattern, like ServletContextListener.

Good point about runurl - I've generalized to Statement (from jclouds). I hadn't seen all the work you've done on scriptbuilder in jclouds - it looks very useful, and something we could use more in Whirr (e.g. the ability to build files via Statements.createFile), so the ability to use any Statement is a good improvement.

The Hadoop role names are now backwards compatible (nn, jt, dn, tt), but not yet fully composable since there is more work to be done it splitting the cloud-side scripts into individual roles. This can be tackled in a future JIRA.

ClusterAction implementations are now in a new package o.a.whirr.cluster.actions.

DestroyClusterAction is new; now all the common actions are implemented as ClusterActions.

ScriptBasedClusterAction is new, and is an implementation of ClusterAction for actions that run scripts on nodes. In the previous patch this was built into ClusterAction, which was too tightly coupled - e.g. a PalletClusterAction would use Pallet directly and not the jclouds script running machinery.

One area that could be improved is how the cluster objects are updated in ScriptBasedClusterAction. Actions and handlers may update the cluster object, and the logic for propagating the change to other events is a little delicate (although encapsulated in ScriptBasedClusterAction). Since this is essentially an internal API it could be improved later.

I've done some testing and so far everything passes (on EC2). I'm going to do some more tests on Rackspace.

Tom White
added a comment - 06/Dec/10 18:12 Thanks for the feedback, Adrian - it's very helpful! I've updated the patch with your suggestions, and a few more changes:
I've introduced a ClusterActionEvent object, to make the ClusterActionHandler be in line with the standard listener pattern, like ServletContextListener.
Good point about runurl - I've generalized to Statement (from jclouds). I hadn't seen all the work you've done on scriptbuilder in jclouds - it looks very useful, and something we could use more in Whirr (e.g. the ability to build files via Statements.createFile), so the ability to use any Statement is a good improvement.
The Hadoop role names are now backwards compatible (nn, jt, dn, tt), but not yet fully composable since there is more work to be done it splitting the cloud-side scripts into individual roles. This can be tackled in a future JIRA.
ClusterAction implementations are now in a new package o.a.whirr.cluster.actions.
DestroyClusterAction is new; now all the common actions are implemented as ClusterActions.
ScriptBasedClusterAction is new, and is an implementation of ClusterAction for actions that run scripts on nodes. In the previous patch this was built into ClusterAction, which was too tightly coupled - e.g. a PalletClusterAction would use Pallet directly and not the jclouds script running machinery.
One area that could be improved is how the cluster objects are updated in ScriptBasedClusterAction. Actions and handlers may update the cluster object, and the logic for propagating the change to other events is a little delicate (although encapsulated in ScriptBasedClusterAction). Since this is essentially an internal API it could be improved later.
I've done some testing and so far everything passes (on EC2). I'm going to do some more tests on Rackspace.

We can start with typed interface and in the future move to annotations like @Phase("configure"), etc.

For implementation, I'd suggest switching to a statementlist List<Statement> this is useful since we can type statements. For example InstallJava and then we can check the list to see if it is already there, etc. Also, it decouples the means to perform the command (runurl).

Adrian Cole (Inactive)
added a comment - 03/Dec/10 19:37 I like the event-based system. I'd probably go all the way and follow http://download.oracle.com/javaee/6/api/javax/servlet/ServletContextListener.html
We can also consider parallels to things like ServletFilters. One cool thing about a filter approach is that you can inspect the others in the chain.
We can start with typed interface and in the future move to annotations like @Phase("configure"), etc.
For implementation, I'd suggest switching to a statementlist List<Statement> this is useful since we can type statements. For example InstallJava and then we can check the list to see if it is already there, etc. Also, it decouples the means to perform the command (runurl).
Good stuff. it is easier to comment seeing the code.

There is an implementation for each role: e.g. one for zk, one for nn, jt, etc (although in the current patch it was simpler to do hadoop-master and hadoop-worker). In effect, it turns the Service implementation inside out, since the roles don't start the cluster themselves, instead they can hook into the lifecycle of a cluster being started, configured, etc.

The two pre-defined actions are bootstrap and configure (like Pallet). (This change will make it easier to implement WHIRR-88, amongst other things.) So the beforeAction method for the bootstrap action for zk would use the passed-in runUrlBuilder to add the runurls of the install scripts:

The configure step is more complicated, it uses the passed-in cluster object to work out the zk ensemble that is passed to the runurl as a parameter. It also sets the firewall up.

More complex services which have interacting roles can coordinate via the cluster object. For example, the handler for a Hadoop datanode can find the namenode from the cluster object, and use it to configure the datanode instances.

With this approach there is a single way to launch clusters - there is a single Service object that knows which handler to use for each of the roles in the template. Handlers register themselves using the java.util.ServiceLoader pattern, just like for Service and ServiceFactory (which will be replaced by this new approach). Look at the ClusterAction classes to see how this works in more detail.

The composability comes about because the RunUrlBuilder can compose runurls from different roles into a single script, de-duping as necessary. So if a cluster had zk and hadoop instances, then the previous snippet plus

This assumes that the runurls can co-exist. Currently the zookeeper install clobbers /etc/rc.local, which needs to be fixed.

Overall, this is quite a radical change, but it actually makes the code for each service shorter, since the boilerplate for launching instances is moved into a common class. I've managed to get the integration tests passing, but I haven't updated the CLI to use the new classes.

Tom White
added a comment - 03/Dec/10 05:18 I've fleshed out an implementation for this, and am looking for some review comments, especially on general direction. The basic idea is to implement service roles with a callback interface:
public abstract class ClusterActionHandler {
public abstract String getRole();
public Cluster beforeAction( String action, ClusterSpec clusterSpec, Cluster cluster,
RunUrlBuilder runUrlBuilder) {}
public Cluster afterAction( String action, ClusterSpec clusterSpec, Cluster cluster) {}
}
There is an implementation for each role: e.g. one for zk, one for nn, jt, etc (although in the current patch it was simpler to do hadoop-master and hadoop-worker). In effect, it turns the Service implementation inside out, since the roles don't start the cluster themselves, instead they can hook into the lifecycle of a cluster being started, configured, etc.
The two pre-defined actions are bootstrap and configure (like Pallet). (This change will make it easier to implement WHIRR-88 , amongst other things.) So the beforeAction method for the bootstrap action for zk would use the passed-in runUrlBuilder to add the runurls of the install scripts:
runUrlBuilder.addRunUrl( "sun/java/install" );
runUrlBuilder.addRunUrl( "apache/zookeeper/install" );
The configure step is more complicated, it uses the passed-in cluster object to work out the zk ensemble that is passed to the runurl as a parameter. It also sets the firewall up.
More complex services which have interacting roles can coordinate via the cluster object. For example, the handler for a Hadoop datanode can find the namenode from the cluster object, and use it to configure the datanode instances.
With this approach there is a single way to launch clusters - there is a single Service object that knows which handler to use for each of the roles in the template. Handlers register themselves using the java.util.ServiceLoader pattern, just like for Service and ServiceFactory (which will be replaced by this new approach). Look at the ClusterAction classes to see how this works in more detail.
The composability comes about because the RunUrlBuilder can compose runurls from different roles into a single script, de-duping as necessary. So if a cluster had zk and hadoop instances, then the previous snippet plus
runUrlBuilder.addRunUrl( "sun/java/install" );
runUrlBuilder.addRunUrl( "apache/hadoop/install" );
would be equivalent to
runUrlBuilder.addRunUrl( "sun/java/install" );
runUrlBuilder.addRunUrl( "apache/zookeeper/install" );
runUrlBuilder.addRunUrl( "apache/hadoop/install" );
since the sun/java/install script is only run once.
This assumes that the runurls can co-exist. Currently the zookeeper install clobbers /etc/rc.local, which needs to be fixed.
Overall, this is quite a radical change, but it actually makes the code for each service shorter, since the boilerplate for launching instances is moved into a common class. I've managed to get the integration tests passing, but I haven't updated the CLI to use the new classes.

As a start we could rename Service to ClusterManager (or similar). It would have the same interface, but implementations would be free to start more than one service in a cluster. It would be up to the ClusterManager to determine how to do this. E.g. PalletClusterManager would use a more Pallet-like spec (see WHIRR-125), and ZooKeeperClusterManager (etc) would do what we have today.

Tom White
added a comment - 02/Nov/10 18:16 As a start we could rename Service to ClusterManager (or similar). It would have the same interface, but implementations would be free to start more than one service in a cluster. It would be up to the ClusterManager to determine how to do this. E.g. PalletClusterManager would use a more Pallet-like spec (see WHIRR-125 ), and ZooKeeperClusterManager (etc) would do what we have today.
Thoughts?