Details

Description

Most of the Mahout features require running several jobs in sequence. This can be done via the command line or using one of the driver classes.
Running and configuring a Mahout job from Java requires using either the Driver's static methods or creating a String array of parameters and pass them to the main method of the job. If we can instead configure jobs through a Java bean or factory we it will be type safe and easier to use in by DI frameworks such as Spring and Guice.

I have added a patch where I factored out a KMeans MapReduce job plus a configuration Java bean, from KMeansDriver.buildClustersMR(...)

The KMeansMapReduceConfiguration takes care of setting up the correct values in the Hadoop Configuration object and initializes defaults. I copied the config keys from KMeansConfigKeys.

The KMeansMapReduceJob contains the code for the actual algorithm running all iterations of KMeans and returns the KMeansMapReduceConfiguration, which contains the cluster path for the final iteration.

I like to extend this approach to other Hadoop jobs for instance the job for creating points in KMeansDriver, but I first want some feedback on this.

One of the benefits of this approach is that it becomes easier to chain jobs. For instance we can chain Canopy to KMeans by connecting the output dir of Canopy's configuration to the input dir of the configuration of the KMeans job next in the chain. Hadoop's JobControl class can then be used to connect and execute the entire chain.

This approach can be further improved by turning the configuration bean into a factory for creating MapReduce or sequential jobs. This would probably remove some duplicated code in the KMeansDriver.

Activity

I think I understand your patch. You're leaving KMeansDriver as the shell with which to run it from the command line, but introducing one more layer of abstraction between it and running Hadoop MapReduces so that it can be invoked programmatically. Sounds fine to me.

My only bit of feedback then is about naming. We unfortunately have some conflicting naming here for the command-line class that runs MapReduces and implements Tool. It's a "*Job" in some places and "*Driver" in other places. (Anyone prefer one convention? I could JIRA that too.)

To avoid deepening the confusion, consider renaming KMeansMapReduceJob to something that doesn't end in either of those.

Sean Owen
added a comment - 22/Feb/11 01:15 I think I understand your patch. You're leaving KMeansDriver as the shell with which to run it from the command line, but introducing one more layer of abstraction between it and running Hadoop MapReduces so that it can be invoked programmatically. Sounds fine to me.
My only bit of feedback then is about naming. We unfortunately have some conflicting naming here for the command-line class that runs MapReduces and implements Tool. It's a "*Job" in some places and "*Driver" in other places. (Anyone prefer one convention? I could JIRA that too.)
To avoid deepening the confusion, consider renaming KMeansMapReduceJob to something that doesn't end in either of those.

Frank Scholten
added a comment - 22/Feb/11 19:41 - edited Updated and expanded the patch. Renamed KMeansMapReduceJob to KMeansMapReduceAlgorithm and added KMeansSequentialAlgorithm.
These implementations also create the points mapping by default, based on the runClustering flag.
The KMeansConfiguration can be used for both of these implementations.

Indeed! Removes a whole lot of paramaters from the function. Code looks a lot nicer now. I would like to have the configuration object serialized and deserialized at once in job/mapper. or merged with the configuration object in some generic way maybe a base MahoutConfigBase class. All these are nice to haves. If you can, I would really appreciate such a change.

But, before committing I will wait for the full change to all Jobs so that code is not in un-even state.

Robin Anil
added a comment - 22/Feb/11 21:06 Indeed! Removes a whole lot of paramaters from the function. Code looks a lot nicer now. I would like to have the configuration object serialized and deserialized at once in job/mapper. or merged with the configuration object in some generic way maybe a base MahoutConfigBase class. All these are nice to haves. If you can, I would really appreciate such a change.
But, before committing I will wait for the full change to all Jobs so that code is not in un-even state.

Robin I agree with your concern of avoiding an un-even state in the code-base. Given the anticipated amount of work that has to go into this, would make sense to track these changes in a separate branch to avoid the "one huge patch that touches everything at once" problem?

Isabel Drost-Fromm
added a comment - 23/Feb/11 10:08 Robin I agree with your concern of avoiding an un-even state in the code-base. Given the anticipated amount of work that has to go into this, would make sense to track these changes in a separate branch to avoid the "one huge patch that touches everything at once" problem?

This time I also moved the logic of composing output paths (canopies and points) from the CanopyDriver into the configuration object.

The config keys are now moved to CanopyConfiguration and CanopyConfigKeys is removed. Some keys are different because I renamed output to outputBasePath to make it clear the canopy and points outputs are relative paths under this outputBasePath.

Frank Scholten
added a comment - 25/Feb/11 16:49 Added patch for the canopy jobs.
This time I also moved the logic of composing output paths (canopies and points) from the CanopyDriver into the configuration object.
The config keys are now moved to CanopyConfiguration and CanopyConfigKeys is removed. Some keys are different because I renamed output to outputBasePath to make it clear the canopy and points outputs are relative paths under this outputBasePath.

Robin: Maybe I understand what you mean about serializing the config. At the moment the mappers and reducers still need to access values in the Configuration object via the config keys. Is it possible turn the (KMeans|Canopy)Configuration into a simple pojo, have it implement Writable and serialize it inside the Configuration and deserialize it at the mapper and reducer? Or does this have performance implications or other consequences?

We could maybe make a method in (KMeans|Canopy)Configuration

public Configuration asConfiguration()

{ ... }

where it serializes itself inside a Configuration and then returns it.

Frank Scholten
added a comment - 25/Feb/11 17:08 Robin: Maybe I understand what you mean about serializing the config. At the moment the mappers and reducers still need to access values in the Configuration object via the config keys. Is it possible turn the (KMeans|Canopy)Configuration into a simple pojo, have it implement Writable and serialize it inside the Configuration and deserialize it at the mapper and reducer? Or does this have performance implications or other consequences?
We could maybe make a method in (KMeans|Canopy)Configuration
public Configuration asConfiguration()
{ ... }
where it serializes itself inside a Configuration and then returns it.

Robin: Putting my Apache Hat on - I know how easy github makes collaboration, however it would be nice to keep development inside of our project, so until Apache supports for git r/w access, I was only wondering whether an svn branch would provide any benefit ...

Isabel Drost-Fromm
added a comment - 28/Feb/11 11:03 Robin: Putting my Apache Hat on - I know how easy github makes collaboration, however it would be nice to keep development inside of our project, so until Apache supports for git r/w access, I was only wondering whether an svn branch would provide any benefit ...

I find that keeping large patches up to date with only an SVN branch is infeasible. Thus, I opt to use git privately and use the SVN interface to push changes back to SVN when committing.

Once I am doing that, why not share my git repository so that others can comment on the work in progress? I still will have to be careful about what code I incorporate, but that is the responsibility of a committer in any case.

Hopefully, this should become irrelevant soon since Apache is making rapid progress on supporting git.

Ted Dunning
added a comment - 28/Feb/11 16:05 Isabel,
I find that keeping large patches up to date with only an SVN branch is infeasible. Thus, I opt to use git privately and use the SVN interface to push changes back to SVN when committing.
Once I am doing that, why not share my git repository so that others can comment on the work in progress? I still will have to be careful about what code I incorporate, but that is the responsibility of a committer in any case.
Hopefully, this should become irrelevant soon since Apache is making rapid progress on supporting git.

Frank Scholten
added a comment - 28/Feb/11 22:06 I started a MAHOUT-612 branch at https://github.com/frankscholten/mahout/tree/MAHOUT-612 and added the K-Means v2 and Canopy patches.
Robin: Ok, I'll look into the serialization issue for K-Means and Canopy next.

Frank Scholten
added a comment - 11/Mar/11 15:30 I added K-Means config serialization code at https://github.com/frankscholten/mahout/tree/MAHOUT-612 See 'kmeans-serialization' tag
Robin: Is this close to what you had in mind?

The methods serialized and deserialized can be made stricter using an interface and maybe renamed to make it explicit.

Maybe an interface which has the "SerializeInConfiguration() GetFromConfiguration()" method, to make things strictly uniform.

It looks good otherwise.

A Small Nit: Autogenerated Equals() and Hashcode() is ok, but do you see it being used? Made as keys in hashMaps? You can choose to ignore them if you wish (or throw an exception). IMO Config Object's primary purpose is to Deserialize or Serialize its members.

Robin Anil
added a comment - 22/Mar/11 18:45 Sorry about the late reply
The methods serialized and deserialized can be made stricter using an interface and maybe renamed to make it explicit.
Maybe an interface which has the "SerializeInConfiguration() GetFromConfiguration()" method, to make things strictly uniform.
It looks good otherwise.
A Small Nit: Autogenerated Equals() and Hashcode() is ok, but do you see it being used? Made as keys in hashMaps? You can choose to ignore them if you wish (or throw an exception). IMO Config Object's primary purpose is to Deserialize or Serialize its members.
Robin

May I start submitting the patches? the "v2" and "canopy" patches are ready to go?
What's the thinking about whether this will be considered done by 0.5 in a few weeks or should be an ongoing piece of work for the next release?

Sean Owen
added a comment - 04/Apr/11 08:34 May I start submitting the patches? the "v2" and "canopy" patches are ready to go?
What's the thinking about whether this will be considered done by 0.5 in a few weeks or should be an ongoing piece of work for the next release?

These patches are outdated. This will indeed be ongoing piece of work
and won't be done by 0.5

What's the thinking on how to include the work? Robin said: "But,
before committing I will wait for the full change to all Jobs so that
code is not in un-even state."

I would prefer to be able to submit a patch per job configuration, one
for K-means, one for Canopy and so on. The code will be in an un-even
state, true, however this will prevent a lot of effort to merge
changes in trunk later on, considering how active Mahout is being developed.

Frank Scholten
added a comment - 04/Apr/11 10:18 These patches are outdated. This will indeed be ongoing piece of work
and won't be done by 0.5
What's the thinking on how to include the work? Robin said: "But,
before committing I will wait for the full change to all Jobs so that
code is not in un-even state."
I would prefer to be able to submit a patch per job configuration, one
for K-means, one for Canopy and so on. The code will be in an un-even
state, true, however this will prevent a lot of effort to merge
changes in trunk later on, considering how active Mahout is being developed.

I think one big patch is preferable. It avoids risk that the patch can't be completed for some reason. It should be about as much work, including merge conflicts. But I don't think it is a big deal if you'd like to do it piece by piece too.

Sean Owen
added a comment - 04/Apr/11 10:56 I think one big patch is preferable. It avoids risk that the patch can't be completed for some reason. It should be about as much work, including merge conflicts. But I don't think it is a big deal if you'd like to do it piece by piece too.

Ian Helmke
added a comment - 08/Jul/11 20:50 Frank, are you still making changes here? Benson and I are looking to continue/complete the beanification of these jobs. Just wondering if you'd made any progress on it.

Cool that you're interested! I recently rebased my changes locally on a Mahout 0.5 branch and I'm making the Canopy configuration consistent with the KMeans configuration, wrt serialization and coding style. This is taking some time as I'm fixing a bunch of Canopy unit tests. I will push this to Github soon.

After this I think it's important that SparseVectorsToSequenceFile is refactored since it's almost always needed for clustering jobs.

Frank Scholten
added a comment - 08/Jul/11 23:10 Cool that you're interested! I recently rebased my changes locally on a Mahout 0.5 branch and I'm making the Canopy configuration consistent with the KMeans configuration, wrt serialization and coding style. This is taking some time as I'm fixing a bunch of Canopy unit tests. I will push this to Github soon.
After this I think it's important that SparseVectorsToSequenceFile is refactored since it's almost always needed for clustering jobs.

Frank Scholten
added a comment - 12/Jul/11 23:06 Just pushed a new branch to Github, https://github.com/frankscholten/mahout/tree/MAHOUT-612-0.5 , rebased at 0.5 with one commit of both KMeans and Canopy config. Next up is SparseVectorsFromSequenceFiles.

Putting this in backlog for now. As much as I like the idea of this patch, improving parts of the clustering code I work with regularly has a higher priority for me. So far the kmeans, canopy and seq2sparse jobs have been refactored to have a bean configuration. If you want to help with this, check out the github repo at https://github.com/frankscholten/mahout/tree/MAHOUT-612-0.5

Frank Scholten
added a comment - 31/Oct/11 20:50 Putting this in backlog for now. As much as I like the idea of this patch, improving parts of the clustering code I work with regularly has a higher priority for me. So far the kmeans, canopy and seq2sparse jobs have been refactored to have a bean configuration. If you want to help with this, check out the github repo at https://github.com/frankscholten/mahout/tree/MAHOUT-612-0.5

Grant Ingersoll
added a comment - 08/Nov/11 12:27 It seems like we shouldn't have to wait for the whole thing to be done on this. Forward progress towards where we want to go is better than no progress.