Details

Description

Snippet from javadoc gives the idea:

/**
* General-purpose driver class for Mahout programs. Utilizes org.apache.hadoop.util.ProgramDriver to run
* main methods of other classes, but first loads up default properties from a properties file.
*
* Usage: run on Hadoop like so:
*
* $HADOOP_HOME/bin/hadoop -jar path/to/job org.apache.mahout.driver.MahoutDriver [classes.props file] shortJobName \
* [default.props file forthis class] [over-ride options, all specified in long form: --input, --jarFile, etc]
*
* TODO: set the Main-Class to just be MahoutDriver, so that this option isn't needed?
*
* (note: using the current shell scipt, this could be modified to be just
* $MAHOUT_HOME/bin/mahout [classes.props file] shortJobName [default.props file] [over-ride options]
* )
*
* Works like this: by default, the file "core/src/main/resources/driver.classes.prop" is loaded, which
* defines a mapping between short names like "VectorDumper" and fully qualified class names. This file may
* instead be overridden on the command line by having the first argument be some string of the form *classes.props.
*
* The next argument to the Driver is supposed to be the short name of the class to be run (as defined in the
* driver.classes.props file). After this, if the next argument ends in ".props" / ".properties", it is taken to
* be the file to use as the default properties file forthis execution, and key-value pairs are built up from that:
* if the file contains
*
* input=/path/to/my/input
* output=/path/to/my/output
*
* Then the class which will be run will have it's main called with
*
* main(newString[] { "--input", "/path/to/my/input", "--output", "/path/to/my/output" });
*
* After all the "default" properties are loaded from the file, any further command-line arguments are taken in,
* and over-ride the defaults.
*/

Could be cleaned up, as it's kinda ugly with the whole "file named in .props", but gives the idea. Really helps cut down on repetitive long command lines, lets defaults be put props files instead of locked into the code also.

Activity

This is pretty nice, it gets to the point where relying on shell-history or ad-hoc mechanisms to manage command-lines kills me and this is a nice solution.

I've quickly skimmed the patch but I haven't tried it out. I see the TODO in there regarding short vs. long arguments. Do you have any thoughts on how to support single-dask arguments? Things the arguments supported by the GenericOptionsParser could be set in the properties file too.

Drew Farris
added a comment - 20/Feb/10 17:26 This is pretty nice, it gets to the point where relying on shell-history or ad-hoc mechanisms to manage command-lines kills me and this is a nice solution.
I've quickly skimmed the patch but I haven't tried it out. I see the TODO in there regarding short vs. long arguments. Do you have any thoughts on how to support single-dask arguments? Things the arguments supported by the GenericOptionsParser could be set in the properties file too.

Jake Mannix
added a comment - 20/Feb/10 19:07 The TODO refers to the issue that I think there, but am not sure: what does GenericOptionsParser do if you have a command line input like this:
programName --input foo.txt -i bar.txt
where --input is the long argument name for -i as short name? Which one wins? Is it deterministic?

{blockquote}
What does GenericOptionsParser do if you have a command line input like this:

programName --input foo.txt -i bar.txt

where --input is the long argument name for -i as short name? Which one wins? Is it deterministic? {blockquote}

In most cases it's really depends on the implementation, sometimes GenericOptiosnParser isn't even being used. In Mahout's case it's likely to be commons-cli2 that's actually doing the parsing, and I don't know how it would behave in this case. I'll take a look.

GenericOptionsParser simply handles things like -conf and -Dprop=value that control hadoop configurations, job settings and the like, and then hands back the rest to the caller. In many cases in the mahout , GenericOptionsParser isn't used at all which reduces the control one has over a job's behavior. iirc, Sean and Robin have made some progress towards eliminating these cases with the AbstractJob class.

Drew Farris
added a comment - 20/Feb/10 21:54 {blockquote}
What does GenericOptionsParser do if you have a command line input like this:
programName --input foo.txt -i bar.txt
where --input is the long argument name for -i as short name? Which one wins? Is it deterministic? {blockquote}
In most cases it's really depends on the implementation, sometimes GenericOptiosnParser isn't even being used. In Mahout's case it's likely to be commons-cli2 that's actually doing the parsing, and I don't know how it would behave in this case. I'll take a look.
GenericOptionsParser simply handles things like -conf and -Dprop=value that control hadoop configurations, job settings and the like, and then hands back the rest to the caller. In many cases in the mahout , GenericOptionsParser isn't used at all which reduces the control one has over a job's behavior. iirc, Sean and Robin have made some progress towards eliminating these cases with the AbstractJob class.

So this current patch will totally take -conf / -Dprop=value type stuff, and pass it directly on into the program in the usual way, with the only difference being that these arguments could also be in a properties file, as long as their using the exact same form, which would make ugly props files as is:

if you wanted to not have to type:

$MAHOUT_HOME/bin/mahout myClassShortName -DmyProp=value

You would could currently need to have, in your props file:

DmyProp = value

which looks kinda silly, but would work. Oh wait, no it wouldn't, it would end up with a command line which would do " -DmyProp value" not "-DmyProp=value". To get the latter, we'd need an even uglier thing with the current patch:

"DmyProp=value"=

which would get interpolated into -DmyProp=value on the internal command line. Super ugly.

I've got a modified version of this I can upload in a bit which takes care of the short-name/long-name arguments thing by a bit of a kludge, with props files which would look like this:

i | input = foo/path

which is to be interpreted as: if on the command line, the user say "i bar/path" OR "input baz/path", they override the "foo/path" in the props file. If the line in the props file has no "|" separating two options, it's assumed to be prepended with "".

Still doesn't remove the ugliness of -Dprop=value though. Not sure how is best to handle that one. What kind of props file syntax would tell it "take these key-value pairs and do '-key value" and do these other ones as '-Dkey=value'"? I guess just having the 'D' there would be a good signal? It could then just take

i | input = foo/path
DmyProp = propValue

and translate that into a command line like: progName -i foo/path -DmyProp=myValue

That would work and be not completely horribly ugly. Not great though.

Jake Mannix
added a comment - 20/Feb/10 22:17 So this current patch will totally take -conf / -Dprop=value type stuff, and pass it directly on into the program in the usual way, with the only difference being that these arguments could also be in a properties file, as long as their using the exact same form, which would make ugly props files as is:
if you wanted to not have to type:
$MAHOUT_HOME/bin/mahout myClassShortName -DmyProp=value
You would could currently need to have, in your props file:
DmyProp = value
which looks kinda silly, but would work. Oh wait, no it wouldn't, it would end up with a command line which would do " -DmyProp value" not "-DmyProp=value". To get the latter, we'd need an even uglier thing with the current patch:
"DmyProp=value"=
which would get interpolated into -DmyProp=value on the internal command line. Super ugly.
I've got a modified version of this I can upload in a bit which takes care of the short-name/long-name arguments thing by a bit of a kludge, with props files which would look like this:
i | input = foo/path
which is to be interpreted as: if on the command line, the user say " i bar/path" OR " input baz/path", they override the "foo/path" in the props file. If the line in the props file has no "|" separating two options, it's assumed to be prepended with " ".
Still doesn't remove the ugliness of -Dprop=value though. Not sure how is best to handle that one. What kind of props file syntax would tell it "take these key-value pairs and do '-key value" and do these other ones as '-Dkey=value'"? I guess just having the 'D' there would be a good signal? It could then just take
i | input = foo/path
DmyProp = propValue
and translate that into a command line like: progName -i foo/path -DmyProp=myValue
That would work and be not completely horribly ugly. Not great though.

Better version. Javadocs updated in the patch to reflect the way it works:

/**
* General-purpose driver class for Mahout programs. Utilizes org.apache.hadoop.util.ProgramDriver to run
* main methods of other classes, but first loads up default properties from a properties file.
*
* Usage: run on Hadoop like so:
*
* $HADOOP_HOME/bin/hadoop -jar path/to/job org.apache.mahout.driver.MahoutDriver \
* [--classesFile|-cf <file>] [--defaultsFile|-df <file>] shortJobName [over-ride opts]
*
* or for local running:
*
* $MAHOUT_HOME/bin/mahout run [--classesFile|-cf <file>] [--defaultsFile|-df <file>] shortJobName [over-ride ops]
*
* Works like this: by default, the file "core/src/main/resources/driver.classes.props" is loaded, which
* defines a mapping between short names like "VectorDumper" and fully qualified class names. This file may
* instead be overridden on the command line by specifying --classesFile|-cf <classesFile>.
*
* The default properties to be applied to the program run is pulled out of, by default,
* "core/src/main/resources/<shortJobName>.props", unless --defaultsFile|-df <file> is specified by the cmdline.
* The format of the default properties files is as follows:
*
* i|input = /path/to/my/input
* o|output = /path/to/my/output
* m|jarFile = /path/to/jarFile
* # etc - each line is shortArg|longArg = value
*
* The next argument to the Driver is supposed to be the short name of the class to be run (as defined in the
* driver.classes.props file).
*
* Then the class which will be run will have it's main called with
*
* main(newString[] { "--input", "/path/to/my/input", "--output", "/path/to/my/output" });
*
* After all the "default" properties are loaded from the file, any further command-line arguments are taken in,
* and over-ride the defaults.
*
* So if your core/src/main/resources/driver.classes.props looks like so:
*
* org.apache.mahout.utils.vectors.VectorDumper = "vecDump"
*
* and you have a file core/src/main/resources/vecDump.props which looks like
*
* o|output = /tmp/vectorOut
* s|seqFile = /my/vector/sequenceFile
*
* And you execute the command-line:
*
* $MAHOUT_HOME/bin/mahout run vecDump -s /my/otherVector/sequenceFile
*
* Then org.apache.mahout.utils.vectors.VectorDumper.main() will be called with arguments:
* {"--output", "/tmp/vectorOut", "-s", "/my/otherVector/sequenceFile"}
*/

Jake Mannix
added a comment - 21/Feb/10 06:01 Better version. Javadocs updated in the patch to reflect the way it works:
/**
* General-purpose driver class for Mahout programs. Utilizes org.apache.hadoop.util.ProgramDriver to run
* main methods of other classes, but first loads up default properties from a properties file.
*
* Usage: run on Hadoop like so:
*
* $HADOOP_HOME/bin/hadoop -jar path/to/job org.apache.mahout.driver.MahoutDriver \
* [--classesFile|-cf <file>] [--defaultsFile|-df <file>] shortJobName [over-ride opts]
*
* or for local running:
*
* $MAHOUT_HOME/bin/mahout run [--classesFile|-cf <file>] [--defaultsFile|-df <file>] shortJobName [over-ride ops]
*
* Works like this : by default , the file "core/src/main/resources/driver.classes.props" is loaded, which
* defines a mapping between short names like "VectorDumper" and fully qualified class names. This file may
* instead be overridden on the command line by specifying --classesFile|-cf <classesFile>.
*
* The default properties to be applied to the program run is pulled out of, by default ,
* "core/src/main/resources/<shortJobName>.props" , unless --defaultsFile|-df <file> is specified by the cmdline.
* The format of the default properties files is as follows:
*
* i|input = /path/to/my/input
* o|output = /path/to/my/output
* m|jarFile = /path/to/jarFile
* # etc - each line is shortArg|longArg = value
*
* The next argument to the Driver is supposed to be the short name of the class to be run (as defined in the
* driver.classes.props file).
*
* Then the class which will be run will have it's main called with
*
* main( new String [] { "--input" , "/path/to/my/input" , "--output" , "/path/to/my/output" });
*
* After all the " default " properties are loaded from the file, any further command-line arguments are taken in,
* and over-ride the defaults.
*
* So if your core/src/main/resources/driver.classes.props looks like so:
*
* org.apache.mahout.utils.vectors.VectorDumper = "vecDump"
*
* and you have a file core/src/main/resources/vecDump.props which looks like
*
* o|output = /tmp/vectorOut
* s|seqFile = /my/vector/sequenceFile
*
* And you execute the command-line:
*
* $MAHOUT_HOME/bin/mahout run vecDump -s /my/otherVector/sequenceFile
*
* Then org.apache.mahout.utils.vectors.VectorDumper.main() will be called with arguments:
* { "--output" , "/tmp/vectorOut" , "-s" , "/my/otherVector/sequenceFile" }
*/

This patch modifies the mahout shell script to add the "run" command, which invokes this driver class.

It also more nicely takes shortName definitions from either core/src/main/resources/driver.classes.props or the "-cf configFile" location, and runs the class specified by shortName using props specified in core/src/main/resources/shortName.props or whatever is "-df defaultpropsFile".

Also takes options in the file of the form "DsomeOpt = optionVal" and passes those into the program as "-DsomeOpt=optionVal" as well.

Not sure how well it works on hadoop yet. But comand line seems to work for the one class I've got a props file for (TestClassifier).

Jake Mannix
added a comment - 21/Feb/10 06:06 This patch modifies the mahout shell script to add the "run" command, which invokes this driver class.
It also more nicely takes shortName definitions from either core/src/main/resources/driver.classes.props or the "-cf configFile" location, and runs the class specified by shortName using props specified in core/src/main/resources/shortName.props or whatever is "-df defaultpropsFile".
Also takes options in the file of the form "DsomeOpt = optionVal" and passes those into the program as "-DsomeOpt=optionVal" as well.
Not sure how well it works on hadoop yet. But comand line seems to work for the one class I've got a props file for (TestClassifier).

Jake Mannix
added a comment - 22/Feb/10 22:19 Fancy new version. Run as follows:
Set your $MAHOUT_CONF_DIR to a directory where you will have your own overrides (or, if unset, defaults to ./core/src/main/resources).
In that directory, there should be a file called "driver.classes.props" with contents like so:
org.apache.mahout.utils.vectors.VectorDumper= "vecDump"
org.apache.mahout.utils.clustering.ClusterDumper= "clusty"
org.apache.mahout.utils.SequenceFileDumper= "seqDump"
org.apache.mahout.clustering.kmeans.KMeansDriver= "kmeans"
org.apache.mahout.clustering.canopy.CanopyDriver= "canopy"
org.apache.mahout.utils.vectors.lucene.Driver= "luceneVecs"
org.apache.mahout.text.SequenceFilesFromDirectory= "dirToSeq"
org.apache.mahout.text.WikipediaToSequenceFile= "wikToSeq"
org.apache.mahout.classifier.bayes.TestClassifier= "TestClassifier"
Etc. The right hand side can be whatever you want, but whatever it is determines where MahoutDriver will look for a default properties file. For example:
$MAHOUT_HOME/bin/mahout run wikToSeq
would look for the file $MAHOUT_CONF_DIR/wikToSeq.props and in that file, take each line and transform it into command line arguments for WikipediaToSequenceFile, using the logic as follows:
on each line of wikToSeq.props, there is a key-value pair:
i | input = my/wiki/input/path
o | output = my/output/path
c | categories = my/wikiCategories/file
e | exactMatch = true
all = true
The part of the key before the vertical bar is the short-name of the argument to pass, and the second part is the long name. If there is only one, they are assumed to be the same.
You can also pass Hadoop options here, like
Djava.io.tmpdir = / var /tmp/mahout
which would lead to the program being called with "-Djava.io.tmpdir=/var/tmp/mahout" passed in.

Jake Mannix
added a comment - 22/Feb/10 22:22 Oh, I forgot to finish my sentence which began "run as follows..."
Once youv'e got default property files in your $MAHOUT_CONF_DIR, you can run like so:
$MAHOUT_HOME/bin/mahout run wikToSeq
and that's it. If you want to override the options in your wikToSeq.props file, just pass them in on that same command line above, and they override as desired.
If this can be tested out and debugged, this patch is ready for committing, and significantly improves the command line experience.

The help comments are missing from the mahout/bin script. Scroll up that file and you will see a pretty printed help string. Just add the Mahout driver description and possibly a wikilink there. Otherwise looks good to commit. I have checked the full functionality yet. If anyone else want to take a look, please do quickly

Robin Anil
added a comment - 22/Feb/10 22:31 The help comments are missing from the mahout/bin script. Scroll up that file and you will see a pretty printed help string. Just add the Mahout driver description and possibly a wikilink there. Otherwise looks good to commit. I have checked the full functionality yet. If anyone else want to take a look, please do quickly

Did some testing, here's a patch to clean some of these things up + a couple questions:

Could we load the default driver.classes.props from the classpath? If it was loaded that way the default would work regardless of where the mahout script is run from (it currently only works if ./bin/mahout is run, not ./mahout for example) and regardless of whether we're running from a binary release or the dev environment. (included in patch)

Something else I noticed is that the 'mahout' script doesn't add the classes in $MAHOUT_HOME/lib/*.jar to the classpath. This breakes the binary release in that it can't run anything, e.g:

Drew Farris
added a comment - 23/Feb/10 04:08 Did some testing, here's a patch to clean some of these things up + a couple questions:
Could we load the default driver.classes.props from the classpath? If it was loaded that way the default would work regardless of where the mahout script is run from (it currently only works if ./bin/mahout is run, not ./mahout for example) and regardless of whether we're running from a binary release or the dev environment. (included in patch)
Something else I noticed is that the 'mahout' script doesn't add the classes in $MAHOUT_HOME/lib/*.jar to the classpath. This breakes the binary release in that it can't run anything, e.g:
./mahout vectordump
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/commons/cli2/OptionException
Caused by: java.lang.ClassNotFoundException: org.apache.commons.cli2.OptionException
(fixed in patch)
Using -core in the context of a dev build should work properly, but leaving out -core will cause the script to error unless run in the context of a release – this is the way it should work, right?
Also wondering what the purpose of adding the job jars to the classpath is? (removed in patch)
Also added a help message for the 'run' argument.
Does executing './mahout run --help' hang for anyone else or is it something specific to my environment? (didn't track this one down)

Robin Anil
added a comment - 23/Feb/10 07:17 including the job jar is much cleaner than adding all deps. Plus there is nothing more to configure to execute it on top of hadoop..
BTW. How is hadoop execution done using shell script ? i.e
hadoop jar mahout-examples-0.3.job o.a.m...DictionaryVectorizer --input ..... args

including the job jar is much cleaner than adding all deps. Plus there is nothing more to configure to execute it on top of hadoop..

The job files work fine with 'hadoop jar', but putting the job files in the classspath will not automatically include the dependencies they contain (e.g commons-cli2) on the classpath: the dependencies need to be added separately (see the ClassNotFoundException case described above)

BTW. How is hadoop execution done using shell script ?

If the HADOOP_CONF_DIR is set, it should be picked up by the jobs, but I don't think that means jar/jobfile execution works properly. I suspect this needs modifications to make that possible.

Drew Farris
added a comment - 23/Feb/10 13:38 including the job jar is much cleaner than adding all deps. Plus there is nothing more to configure to execute it on top of hadoop..
The job files work fine with 'hadoop jar', but putting the job files in the classspath will not automatically include the dependencies they contain (e.g commons-cli2) on the classpath: the dependencies need to be added separately (see the ClassNotFoundException case described above)
BTW. How is hadoop execution done using shell script ?
If the HADOOP_CONF_DIR is set, it should be picked up by the jobs, but I don't think that means jar/jobfile execution works properly. I suspect this needs modifications to make that possible.

we could probably provide 'runjob' case that appends 'org.apache.hadoop.util.RunJar examples/target/mahout-examples-0.3-SNAPSHOT.job org.apache.mahout.driver.MahoutDriver', but perhaps this could be used in every case that 'run' is called?

Drew Farris
added a comment - 23/Feb/10 14:04 BTW. How is hadoop execution done using shell script ? i.e
It looks like something like the following would do the trick
/bin/mahout -core org.apache.hadoop.util.RunJar /path/to/mahout-examples-0.3-SNAPSHOT.job org.apache.mahout.driver.MahoutDriver TestClassifier
we could probably provide 'runjob' case that appends 'org.apache.hadoop.util.RunJar examples/target/mahout-examples-0.3-SNAPSHOT.job org.apache.mahout.driver.MahoutDriver', but perhaps this could be used in every case that 'run' is called?

Hey Drew, thanks for looking at this. Problems you saw are probably what are known as "bugs".

Did some testing, here's a patch to clean some of these things up + a couple questions:
Could we load the default driver.classes.props from the classpath? If it was loaded that way the default would work regardless of where the mahout script is run from (it currently only works if ./bin/mahout is run, not ./mahout for example) and regardless of whether we're running from a binary release or the dev environment. (included in patch)

YES! We should indeed load from classpath. My most recent version of this patch (which isn't posted, because it conflicts with yours, I'm trying to resolve that now) changes it so that you just supply a single directory in which driver.classes.props and the shortNames.props files are located.

Something else I noticed is that the 'mahout' script doesn't add the classes in $MAHOUT_HOME/lib/*.jar to the classpath. This breakes the binary release in that it can't run anything, e.g:

Jake Mannix
added a comment - 23/Feb/10 17:54 Hey Drew, thanks for looking at this. Problems you saw are probably what are known as "bugs".
Did some testing, here's a patch to clean some of these things up + a couple questions:
Could we load the default driver.classes.props from the classpath? If it was loaded that way the default would work regardless of where the mahout script is run from (it currently only works if ./bin/mahout is run, not ./mahout for example) and regardless of whether we're running from a binary release or the dev environment. (included in patch)
YES! We should indeed load from classpath. My most recent version of this patch (which isn't posted, because it conflicts with yours, I'm trying to resolve that now) changes it so that you just supply a single directory in which driver.classes.props and the shortNames.props files are located.
Something else I noticed is that the 'mahout' script doesn't add the classes in $MAHOUT_HOME/lib/*.jar to the classpath. This breakes the binary release in that it can't run anything, e.g:
./mahout vectordump
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/commons/cli2/OptionException
Caused by: java.lang.ClassNotFoundException: org.apache.commons.cli2.OptionException
(fixed in patch)
This wasn't a problem with my patch, right? That was an issue of the mahout script in trunk itself?
Using -core in the context of a dev build should work properly, but leaving out -core will cause the script to error unless run in the context of a release - this is the way it should work, right?
What is the -core option for ? I've never used it, how does it work?
Also added a help message for the 'run' argument.
Where did you add that?
Does executing './mahout run --help' hang for anyone else or is it something specific to my environment? (didn't track this one down)
The --help option I didn't have in there, you added it, do you know where it's hanging?

So you already added the ability to load via classpath, right? If we merge that way of thinking with what I'm currently working on (having a configurable "MAHOUT_CONF_DIR" which is used for all these props files), we could just have the mahout shell script just add MAHOUT_CONF_DIR to the classpath (the way you already have it adding the hardwired core/src/main/resources directory) and then it would work that way.

Jake Mannix
added a comment - 23/Feb/10 18:08 Ok, Drew, got your patch in diff mode against mine finally.
So you already added the ability to load via classpath, right? If we merge that way of thinking with what I'm currently working on (having a configurable "MAHOUT_CONF_DIR" which is used for all these props files), we could just have the mahout shell script just add MAHOUT_CONF_DIR to the classpath (the way you already have it adding the hardwired core/src/main/resources directory) and then it would work that way.
New patch merging yours with mine forthcoming.

This wasn't a problem with my patch, right? That was an issue of the mahout script in trunk itself?

Yes it was a problem with the script in trunk. I believe this was due to the fact that the job files were on the classpath instead of all of the dependency jars. Adding the job files to the classpath does not add the dependency jars they contain to the classpath as well. So, no you didn't add this, but it should be fixed (and is in the patch)

What is the -core option for? I've never used it, how does it work?

when you're running bin/mahout in the context of a build the -core option is used to tell it to use the build classpath instead of the classpath used for a binary release. This just follows the pattern established (by Doug?) in the hadoop and nutch launch scripts.

Also added a help message for the 'run' argument.

near line 72 in bin/mahout:
(this is different from the --help question I had)

So you already added the ability to load via classpath, right? If we merge that way of thinking with what I'm currently working on (having a configurable "MAHOUT_CONF_DIR" which is used for all these props files), we could just have the mahout shell script just add MAHOUT_CONF_DIR to the classpath (the way you already have it adding the hardwired core/src/main/resources directory) and then it would work that way.

Yep, that should do it, as long as MAHOUT_CONF_DIR appears before src/main/resources, we should be good to go. It should be added outside of the section of the script that determines if -core has been specified on the command-line.

Drew Farris
added a comment - 23/Feb/10 18:42
This wasn't a problem with my patch, right? That was an issue of the mahout script in trunk itself?
Yes it was a problem with the script in trunk. I believe this was due to the fact that the job files were on the classpath instead of all of the dependency jars. Adding the job files to the classpath does not add the dependency jars they contain to the classpath as well. So, no you didn't add this, but it should be fixed (and is in the patch)
What is the -core option for? I've never used it, how does it work?
when you're running bin/mahout in the context of a build the -core option is used to tell it to use the build classpath instead of the classpath used for a binary release. This just follows the pattern established (by Doug?) in the hadoop and nutch launch scripts.
Also added a help message for the 'run' argument.
near line 72 in bin/mahout:
(this is different from the --help question I had)
echo " seq2sparse generate sparse vectors from a sequence file"
echo " vectordump dump vectors from a sequence file"
echo " run run mahout tasks using the MahoutDriver, see: http: //cwiki.apache.org/MAHOUT/mahoutdriver.html"
So you already added the ability to load via classpath, right? If we merge that way of thinking with what I'm currently working on (having a configurable "MAHOUT_CONF_DIR" which is used for all these props files), we could just have the mahout shell script just add MAHOUT_CONF_DIR to the classpath (the way you already have it adding the hardwired core/src/main/resources directory) and then it would work that way.
Yep, that should do it, as long as MAHOUT_CONF_DIR appears before src/main/resources, we should be good to go. It should be added outside of the section of the script that determines if -core has been specified on the command-line.

This appears to be because your patch has CLASSPATH set to add on things like $MAHOUT_HOME/mahout-*.jar, which doesn't exist after I've done "mvn install". Is there another maven target I need to use to generate the release jars in $MAHOUT_HOME?

Jake Mannix
added a comment - 23/Feb/10 20:22
Something else I noticed is that the 'mahout' script doesn't add the classes in $MAHOUT_HOME/lib/*.jar to the classpath. This breakes the binary release in that it can't run anything, e.g:
Also wondering what the purpose of adding the job jars to the classpath is? (removed in patch)
When I run locally now, not using -core, I get this failure:
/bin/mahout vectordump -s wiki-sparse-vectors-out/vectors/part-00000
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/mahout/utils/vectors/VectorDumper
This appears to be because your patch has CLASSPATH set to add on things like $MAHOUT_HOME/mahout-*.jar, which doesn't exist after I've done "mvn install". Is there another maven target I need to use to generate the release jars in $MAHOUT_HOME?

Jake, the basic idea is that you would always use -core when executing from within a build, but you would not use core when executing in the context of a binary release.

The binary release, built using mvn -Prelease, lands in target/mahout-0.3-SNAPSHOT.tar.gz, untar that and try running bin/mahout from the directory that's created and that should work fine without -core

Drew Farris
added a comment - 23/Feb/10 20:28 Jake, the basic idea is that you would always use -core when executing from within a build, but you would not use core when executing in the context of a binary release.
The binary release, built using mvn -Prelease, lands in target/mahout-0.3-SNAPSHOT.tar.gz, untar that and try running bin/mahout from the directory that's created and that should work fine without -core

Jake, the basic idea is that you would always use -core when executing from within a build, but you would not use core when executing in the context of a binary release.

Hmm... ok. I'm a little reticent about running -core when testing, because I'm not really testing what the release run will be like - I like the idea of having a single set of dependencies (jars, not classes directories) which are used locally, and the .job when hitting a remote hadoop cluster. Maybe I'm just not familiar with the -core option and it's use.

The last step, as you've noted, is because I'm not sure that the script actually properly lets HADOOP_CONF_DIR properly get passed through the mahout shell script to actually running on the hadoop cluster, but maybe that's just a config issue in my case? Also means that in fact the default properties idea still doesn't work on hadoop, unless the default properties files are pushed to the classpath.

Maybe a kludgey way to do it would be for the script to grab the properties files from the MAHOUT_CONF_DIR, unzip the release job jar, push them into it, and re-jar it back up and then give it to hadoop, and now those files will be available on the classpath of the running job on the remote cluster?

What is the right way run a job with some additional (runtime) files added to the job's classpath? Is there some cmdline arg to "hadoop" that I'm forgetting?

Jake Mannix
added a comment - 23/Feb/10 20:54
Jake, the basic idea is that you would always use -core when executing from within a build, but you would not use core when executing in the context of a binary release.
Hmm... ok. I'm a little reticent about running -core when testing, because I'm not really testing what the release run will be like - I like the idea of having a single set of dependencies (jars, not classes directories) which are used locally, and the .job when hitting a remote hadoop cluster. Maybe I'm just not familiar with the -core option and it's use.
So far, I've always run by the process of
make code/config changes
run mvn clean install (sometimes with -DskipTests if I'm doing rapid iterations)
run "mahout <comand> args" OR
hadoop jar examples/target/mahout-examples-
{version}
.job <classname> args
The last step, as you've noted, is because I'm not sure that the script actually properly lets HADOOP_CONF_DIR properly get passed through the mahout shell script to actually running on the hadoop cluster, but maybe that's just a config issue in my case? Also means that in fact the default properties idea still doesn't work on hadoop, unless the default properties files are pushed to the classpath.
Maybe a kludgey way to do it would be for the script to grab the properties files from the MAHOUT_CONF_DIR, unzip the release job jar, push them into it, and re-jar it back up and then give it to hadoop, and now those files will be available on the classpath of the running job on the remote cluster?
What is the right way run a job with some additional (runtime) files added to the job's classpath? Is there some cmdline arg to "hadoop" that I'm forgetting?

Hmm... ok. I'm a little reticent about running -core when testing, because I'm not really testing what the release run will be like - I like the idea of having a single set of dependencies (jars, not classes directories) which are used locally, and the .job when hitting a remote hadoop cluster. Maybe I'm just not familiar with the -core option and it's use.

Ahh, I see where you're coming from, so without core, you're suggesting that mahout pick up the jar files in the target directories if they exist? I think it is fine to modify the non-core classpath to include these, they won't be present in the release build anyway.

The last step, as you've noted, is because I'm not sure that the script actually properly lets HADOOP_CONF_DIR properly get passed through the mahout shell script to actually running on the hadoop cluster, but maybe that's just a config issue in my case? Also means that in fact the default properties idea still doesn't work on hadoop, unless the default properties files are pushed to the classpath.

Are any of the default properties files used beyond the MahoutDriver, which executes locally and sets up the job? Do these files need to be distributed to the rest of the cluster? As noted above, I think the proper way to run MahoutDriver in the context of a distributed job is to do something like:

Drew Farris
added a comment - 23/Feb/10 21:12
Hmm... ok. I'm a little reticent about running -core when testing, because I'm not really testing what the release run will be like - I like the idea of having a single set of dependencies (jars, not classes directories) which are used locally, and the .job when hitting a remote hadoop cluster. Maybe I'm just not familiar with the -core option and it's use.
Ahh, I see where you're coming from, so without core, you're suggesting that mahout pick up the jar files in the target directories if they exist? I think it is fine to modify the non-core classpath to include these, they won't be present in the release build anyway.
The last step, as you've noted, is because I'm not sure that the script actually properly lets HADOOP_CONF_DIR properly get passed through the mahout shell script to actually running on the hadoop cluster, but maybe that's just a config issue in my case? Also means that in fact the default properties idea still doesn't work on hadoop, unless the default properties files are pushed to the classpath.
Are any of the default properties files used beyond the MahoutDriver, which executes locally and sets up the job? Do these files need to be distributed to the rest of the cluster? As noted above, I think the proper way to run MahoutDriver in the context of a distributed job is to do something like:
./bin/mahout org.apache.hadoop.util.RunJar /path/to/mahout-examples-0.3-SNAPSHOT.job org.apache.mahout.driver.MahoutDriver TestClassifier
I suspect we could easilly modify the mahout script and shorten this to:
./bin/mahout runjob TestClassifier
I can look at this a little closer tonight, so if you have an updated patch for me to work on/test in a few hours, definitely post it. I'd be happy to make any changes you're interested in.
What is the right way run a job with some additional (runtime) files added to the job's classpath? Is there some cmdline arg to "hadoop" that I'm forgetting?
FWIW, [http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/util/GenericOptionsParser.html|GenericOptionsParser] provides a way to do this with -files, -libjars and -archives

Ahh, I see where you're coming from, so without core, you're suggesting that mahout pick up the jar files in the target directories if they exist? I think it is fine to modify the non-core classpath to include these, they won't be present in the release build anyway.

Cool, yeah, that makes sense.

Are any of the default properties files used beyond the MahoutDriver, which executes locally and sets up the job? Do these files need to be distributed to the rest of the cluster? As noted above, I think the proper way to run MahoutDriver in the context of a distributed job is to do something like:

I suspect we could easilly modify the mahout script and shorten this to:

./bin/mahout runjob TestClassifier

Cool, so why not just check to see if $HADOOP_CONF_DIR is set - if it is, do "runjob" as described, if it's not, do "run" to do locally.

FWIW, [http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/util/GenericOptionsParser.html|GenericOptionsParser] provides a way to do this with -files, -libjars and -archives

Now of course, I guess I don't really need the files to get onto the job's classpath on the cluster - it just needs to be on the classpath of the locally running jvm which is invoking MahoutDriver.main(). So I was doing more work than was necessary. This is easy to do, just add MAHOUT_CONF_DIR to the classpath and we're good to go.

Jake Mannix
added a comment - 23/Feb/10 22:12
Ahh, I see where you're coming from, so without core, you're suggesting that mahout pick up the jar files in the target directories if they exist? I think it is fine to modify the non-core classpath to include these, they won't be present in the release build anyway.
Cool, yeah, that makes sense.
Are any of the default properties files used beyond the MahoutDriver, which executes locally and sets up the job? Do these files need to be distributed to the rest of the cluster? As noted above, I think the proper way to run MahoutDriver in the context of a distributed job is to do something like:
./bin/mahout org.apache.hadoop.util.RunJar /path/to/mahout-examples-0.3-SNAPSHOT.job org.apache.mahout.driver.MahoutDriver TestClassifier
I suspect we could easilly modify the mahout script and shorten this to:
./bin/mahout runjob TestClassifier
Cool, so why not just check to see if $HADOOP_CONF_DIR is set - if it is, do "runjob" as described, if it's not, do "run" to do locally.
FWIW, [http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/util/GenericOptionsParser.html|GenericOptionsParser] provides a way to do this with -files, -libjars and -archives
Now of course, I guess I don't really need the files to get onto the job's classpath on the cluster - it just needs to be on the classpath of the locally running jvm which is invoking MahoutDriver.main(). So I was doing more work than was necessary. This is easy to do, just add MAHOUT_CONF_DIR to the classpath and we're good to go.

Cool, so why not just check to see if $HADOOP_CONF_DIR is set - if it is, do "runjob" as described, if it's not, do "run" to do locally.

Yes, ok – that should work because I believe you can use RunJar to launch anything even if it isn't a mapreduce job, no need for classpath setup in this case either – all you need to do is point to the examples job. Might be able to take advantage of this elsewhere.

Drew Farris
added a comment - 23/Feb/10 22:23 Cool, so why not just check to see if $HADOOP_CONF_DIR is set - if it is, do "runjob" as described, if it's not, do "run" to do locally.
Yes, ok – that should work because I believe you can use RunJar to launch anything even if it isn't a mapreduce job, no need for classpath setup in this case either – all you need to do is point to the examples job. Might be able to take advantage of this elsewhere.

Any thoughts on whether it makes sense to attempt to work the latter form into the mahout script? It won't pull the necessary config files for MahoutDriver in from a path outside of the job file unless HADOOP_CLASSPATH is set to include those directories, but I haven't had a chance to verify that.

Drew Farris
added a comment - 24/Feb/10 03:47 It doesn't appear that the following command works as intended:
./bin/mahout org.apache.hadoop.util.RunJar /path/to/mahout-examples-0.3-SNAPSHOT.job org.apache.mahout.driver.MahoutDriver TestClassifier
The following seems to be the appropriate way to achieve what we're trying to do here:
hadoop jar examples/target/mahout-examples-0.3-SNAPSHOT.job org.apache.mahout.driver.MahoutDriver TestClassifier
Any thoughts on whether it makes sense to attempt to work the latter form into the mahout script? It won't pull the necessary config files for MahoutDriver in from a path outside of the job file unless HADOOP_CLASSPATH is set to include those directories, but I haven't had a chance to verify that.

This one works in one of two ways. If you have $MAHOUT_CONF_DIR defined (there are some dummy files living in the newly created directory "conf" at the top level, moving away from core/src/main/resources), then you can just run:

$MAHOUT_HOME/bin/mahout run svd

and it should read your properties in $MAHOUT_CONF_DIR/svd.props and run (locally).

The other way it can work (and actually does, at least on my setup) is running on hadoop:

And again, $MAHOUT_CONF_DIR/svd.props is read locally before being launched off to the hadoop cluster.

I have not yet been able to get the idea of turning the shell script into automagically issuing RunJar as the command and passing MahoutDriver and the remaining args after, so that you would never need to run hadoop's shell script at all, although that would be great to have work.

Also not yet in this patch: actually default set MAHOUT_CONF_DIR to the correct place in both dev mode and release mode, and I haven't modified the pom to package up the new conf dir and put it in the distribution.

Jake Mannix
added a comment - 24/Feb/10 03:48 Ok, new patch.
This one works in one of two ways. If you have $MAHOUT_CONF_DIR defined (there are some dummy files living in the newly created directory "conf" at the top level, moving away from core/src/main/resources), then you can just run:
$MAHOUT_HOME/bin/mahout run svd
and it should read your properties in $MAHOUT_CONF_DIR/svd.props and run (locally).
The other way it can work (and actually does, at least on my setup) is running on hadoop:
$HADOOP_HOME/bin/hadoop jar path/to/mahout.job org.apache.mahout.driver.MahoutDriver svd
And again, $MAHOUT_CONF_DIR/svd.props is read locally before being launched off to the hadoop cluster.
I have not yet been able to get the idea of turning the shell script into automagically issuing RunJar as the command and passing MahoutDriver and the remaining args after, so that you would never need to run hadoop's shell script at all, although that would be great to have work.
Also not yet in this patch: actually default set MAHOUT_CONF_DIR to the correct place in both dev mode and release mode, and I haven't modified the pom to package up the new conf dir and put it in the distribution.

Any thoughts on whether it makes sense to attempt to work the latter form into the mahout script? It won't pull the necessary config files for MahoutDriver in from a path outside of the job file unless HADOOP_CLASSPATH is set to include those directories, but I haven't had a chance to verify that.

You're right - I did indeed set my HADOOP_CLASSPATH to include $MAHOUT_CONF_DIR, which allowed this to work, otherwise it would not. This should be done by the script.

Ideally, yes, it's ugly but if $MAHOUT_HOME/bin/mahout just sets $HADOOP_CLASSPATH to include $MAHOUT_CONF_DIR (or $MAHOUT_HOME/conf if that variable is not set), then just execute $HADOOP_HOME/bin/hadoop jar ... then it should work.

Jake Mannix
added a comment - 24/Feb/10 04:03 Our comments crossed in the ether!
Any thoughts on whether it makes sense to attempt to work the latter form into the mahout script? It won't pull the necessary config files for MahoutDriver in from a path outside of the job file unless HADOOP_CLASSPATH is set to include those directories, but I haven't had a chance to verify that.
You're right - I did indeed set my HADOOP_CLASSPATH to include $MAHOUT_CONF_DIR, which allowed this to work, otherwise it would not. This should be done by the script.
Ideally, yes, it's ugly but if $MAHOUT_HOME/bin/mahout just sets $HADOOP_CLASSPATH to include $MAHOUT_CONF_DIR (or $MAHOUT_HOME/conf if that variable is not set), then just execute $HADOOP_HOME/bin/hadoop jar ... then it should work.

Ok, now we're getting somewhere. This one a) has the ability to properly handle "mahout run -h" or "mahout run --help", helpfully spitting out the list of classes with shortName's which MahoutDriver has been told about in the driver.classes.props, and more importantly, it can, both in a release environment, and in a dev environment, do:

./bin/mahout run kmeans [options]

If $MAHOUT_CONF_DIR is set, and points to a place with the right files, then the default properties are loaded from there (overridden by [options] given above).

If both $HADOOP_HOME and $HADOOP_CONF_DIR are set, then this actually sets $HADOOP_CLASSPATH to be prepended with $MAHOUT_CONF_DIR so that the following is actually run:

Now the question is: do we want to be? Or do we want to trim down the shell script to just always use MahoutDriver, and get rid of all of the 'elif [ "$COMMAND" =' stuff and just have $CLASS be MahoutDriver, passing it $COMMAND as the first argument?

Then the command line would be exactly the same as before, except you could also load up your $MAHOUT_CONF_DIR/<shortName>.props files with whatever defaults you wanted to use.

Jake Mannix
added a comment - 24/Feb/10 06:33 Ok, now we're getting somewhere. This one a) has the ability to properly handle "mahout run -h" or "mahout run --help", helpfully spitting out the list of classes with shortName's which MahoutDriver has been told about in the driver.classes.props, and more importantly, it can, both in a release environment, and in a dev environment, do:
./bin/mahout run kmeans [options]
If $MAHOUT_CONF_DIR is set, and points to a place with the right files, then the default properties are loaded from there (overridden by [options] given above).
If both $HADOOP_HOME and $HADOOP_CONF_DIR are set, then this actually sets $HADOOP_CLASSPATH to be prepended with $MAHOUT_CONF_DIR so that the following is actually run:
$HADOOP_HOME/bin/hadoop jar [path to examples.job] o.a.m.driver.MahoutDriver kmeans [options]
actually works and it gets the default properties loaded and overridden as necessary, running your job on the hadoop cluster.
If one of those variables are not specified (TODO: if $HADOOP_HOME is specified, but $HADOOP_CONF_DIR is not, guess a default of $HADOOP_HOME/conf, I suppose), then the assumption is to run locally.
Previous behavior still works, from what I can tell - you can still do:
$MAHOUT_HOME/bin/mahout kmeans --output kmeans/out --input input/vecs -k 13 --clusters tmp/foobar
and we're backwards compatible with the old way.
Now the question is: do we want to be? Or do we want to trim down the shell script to just always use MahoutDriver, and get rid of all of the 'elif [ "$COMMAND" =' stuff and just have $CLASS be MahoutDriver, passing it $COMMAND as the first argument?
Then the command line would be exactly the same as before, except you could also load up your $MAHOUT_CONF_DIR/<shortName>.props files with whatever defaults you wanted to use.

This sounds great. I will take it for a spin when I am in front of a computer. My take is that the old if, else it's in the script are now redundant. As long as one can use MahoutDriver to run both classes that have been aliased to short names and classes specified using the full name, I say let's get rid of them.

Drew Farris
added a comment - 24/Feb/10 12:46 This sounds great. I will take it for a spin when I am in front of a computer. My take is that the old if, else it's in the script are now redundant. As long as one can use MahoutDriver to run both classes that have been aliased to short names and classes specified using the full name, I say let's get rid of them.

Here's a partial patch that includes modifications to bin/mahout and MahoutDriver:

It removes the separate 'command' option from the original script and delegates everything to MahoutDriver, so things like the following work:

./mahout testclassifier
./mahout --help

Also will set MAHOUT_CONF_DIR to MAHOUT_HOME/conf if MAHOUT_CONF_DIR is not set.

If no args are specified, will print same output as --help.

One potential TODO from this would be to potentially launch arbitrary classes if no matching program name is specified, but I need to dig into ProgramDriver to understand how it works before I can contribute something like that.

Drew Farris
added a comment - 24/Feb/10 17:59 Jake, this is looking really great.
Here's a partial patch that includes modifications to bin/mahout and MahoutDriver:
It removes the separate 'command' option from the original script and delegates everything to MahoutDriver, so things like the following work:
./mahout testclassifier
./mahout --help
Also will set MAHOUT_CONF_DIR to MAHOUT_HOME/conf if MAHOUT_CONF_DIR is not set.
If no args are specified, will print same output as --help.
One potential TODO from this would be to potentially launch arbitrary classes if no matching program name is specified, but I need to dig into ProgramDriver to understand how it works before I can contribute something like that.
Hope this is helpful.

One potential TODO from this would be to potentially launch arbitrary classes if no matching program name is specified, but I need to dig into ProgramDriver to understand how it works before I can contribute something like that.

Yeah, I was thinking about that over breakfast - an easy hack to do this is while the driver.classes.props file is being read, keep track if whether you've found an exact match on args[0], and once all of drivers.classes.props has been read and you haven't found a match, just do a Class.forName(args[0]) and add it to the ProgramDriver with it's full name as the "shortName" and the rest of the program will work (and would even still work with default properties files! If you put com.mycompany.MyClass.props in $MAHOUT_CONF_DIR, it'll read that for defaults).

I'll see if I can add that to your patch later today. I think if that's working, we should be looking good to commit and see who else wants to play with it and test it out.

Jake Mannix
added a comment - 24/Feb/10 18:08 Awesome Drew, I'll check it out.
One potential TODO from this would be to potentially launch arbitrary classes if no matching program name is specified, but I need to dig into ProgramDriver to understand how it works before I can contribute something like that.
Yeah, I was thinking about that over breakfast - an easy hack to do this is while the driver.classes.props file is being read, keep track if whether you've found an exact match on args [0] , and once all of drivers.classes.props has been read and you haven't found a match, just do a Class.forName(args [0] ) and add it to the ProgramDriver with it's full name as the "shortName" and the rest of the program will work (and would even still work with default properties files! If you put com.mycompany.MyClass.props in $MAHOUT_CONF_DIR, it'll read that for defaults).
I'll see if I can add that to your patch later today. I think if that's working, we should be looking good to commit and see who else wants to play with it and test it out.

Ok, new patch, with the modification that indeed you have the ability to just run "$MAHOUT_HOME/bin/mahout <classname> [args]" and it still works. And if <classname>.props exists on the classpath, it'll get used for defaults. w00t, as the kids say.

I've added to the patch the conf directory (you'd not kept it in your patch, Drew), and there are a bunch of emtpy files in there, except some of them have commented out properties in the right format:

Jake Mannix
added a comment - 25/Feb/10 02:51 Ok, new patch, with the modification that indeed you have the ability to just run "$MAHOUT_HOME/bin/mahout <classname> [args] " and it still works. And if <classname>.props exists on the classpath, it'll get used for defaults. w00t, as the kids say.
I've added to the patch the conf directory (you'd not kept it in your patch, Drew), and there are a bunch of emtpy files in there, except some of them have commented out properties in the right format:
cleaneigen.props :
#ci|corpusInput =
#ei|eigenInput =
#o|output =
To help users see what they can store in here, and in what format.

Jake Mannix
added a comment - 25/Feb/10 02:53 Let's release this. Others want to try it out?
We need documentation for it too, obviously, but see how it runs on other jobs? It should work on Hadoop, too, as this ticket / comment thread indicates.

Had a chance to take this out for a spin tonight. It is working very well. I did some k-means using the script starting with the 20newsgroups collection as textfiles, both locally and on a cluster. I think it is good to go, can we commit? I'd be happy to handle it if we have sufficient consensus.

There are a couple modifications I've made to the maven assemblies to include all of this in the binary and source releases properly (adding the conf directory, setting executable on the mahout script, etc). While I was at it, I cleaned up the bin assembly process so that the releases should build faster too. Should I commit those, open another issue or should I re-post as a part of this patch?

Drew Farris
added a comment - 26/Feb/10 03:49 Had a chance to take this out for a spin tonight. It is working very well. I did some k-means using the script starting with the 20newsgroups collection as textfiles, both locally and on a cluster. I think it is good to go, can we commit? I'd be happy to handle it if we have sufficient consensus.
There are a couple modifications I've made to the maven assemblies to include all of this in the binary and source releases properly (adding the conf directory, setting executable on the mahout script, etc). While I was at it, I cleaned up the bin assembly process so that the releases should build faster too. Should I commit those, open another issue or should I re-post as a part of this patch?

Jake Mannix
added a comment - 26/Feb/10 06:19 Drew, do you have a patch with your last changes? If I can try them out too to verify that they work on more than one system, we can commit this I think.
Should I commit those, open another issue or should I re-post as a part of this patch?
I'd say that should be in a separate issue, that should be small enough to mark for 0.3 and commit separately.

Just capturing something longer term here, no need to block anything. One of the things I'd love to have is some basic "experiment management" capabilities. I can imagine in this mode that things like input parameters, etc. are all written into files and organized along with the output, etc. such that it is easy to keep track of all the different ways things get run over time. Seems like this script w/ default property files, etc. could be part of that solution.

Grant Ingersoll
added a comment - 26/Feb/10 11:29 Just capturing something longer term here, no need to block anything. One of the things I'd love to have is some basic "experiment management" capabilities. I can imagine in this mode that things like input parameters, etc. are all written into files and organized along with the output, etc. such that it is easy to keep track of all the different ways things get run over time. Seems like this script w/ default property files, etc. could be part of that solution.

Checked in a version of this which works, not sure if it had the most updated stuff from Drew in it. I'll check out the MAHOUT-311 patch to see if there's a bit more for the assembly stuff to get in too.

Jake Mannix
added a comment - 02/Mar/10 18:11 Checked in a version of this which works, not sure if it had the most updated stuff from Drew in it. I'll check out the MAHOUT-311 patch to see if there's a bit more for the assembly stuff to get in too.