Moved to utils based on discussion on the dev list. This can be committed as-is, or I can take a closer look at Robin's suggested refactoring (counters, sequence file output) and submit a revised patch later

Drew Farris
added a comment - 09/Feb/10 04:25 Moved to utils based on discussion on the dev list. This can be committed as-is, or I can take a closer look at Robin's suggested refactoring (counters, sequence file output) and submit a revised patch later

Robin Anil
added a comment - 08/Feb/10 15:39
In the map/reduce job try adding increasing a counter when you encounter a gram and another counter when you encounter an error, instead of throwing errors
Make the output a sequencefile instead of TextOutput format ?

Robin Anil
added a comment - 08/Feb/10 15:28 I will just run this and start thinking on integrating this with DictionaryVectorizer. Move it to util and upload the new patch. I will commit first and then start refactoring.

Hey Drew, I'm not much of a maven guy - what's the maven-foo you use to get this running?

Jake, my mistake, I should have included updated docs when I updated the patch.

I eliminated the code to read plain text files from a directory, so you would need to begin by producing a SequenceFile<Text,Text> (document id, document) as input. Robin's utility in mahout-examples, o.a.m.text.SequenceFilesFromDirectory can do this. Run the following from the 'examples' directory;

Once the driver class is run, the collocations will be in (output-directory)/colloc/part-00000 as plaintext. They can be sorted by LLR score using the same sort command I included in the previous comment above (I have a question about this below).

FWIW, I need to re-submit the patch to clean up some of the pom changes to MAHOUT-215 were applied.

A couple follow up questions while you're looking at this:

Currently I just dump the results of the second pass to a file, not sorted by LLR score. I could sort by LLR by sending it through another pass with an identity mapper and a reducer but I suspect that's probably pretty inefficient. Is there a better way to sort the output of the second pass by LLR?

If I only wanted to emit the top 1-10% of the collocs (user configurable), how would I tell the reducer to stop emitting results at a certain point (or is yet another pass needed to achieve something like this?)

Would it be better to emit a sequencefile<LongWritable,Text> instead of a text file as the output from the final pass?

Drew Farris
added a comment - 29/Jan/10 14:49 Hey Drew, I'm not much of a maven guy - what's the maven-foo you use to get this running?
Jake, my mistake, I should have included updated docs when I updated the patch.
I eliminated the code to read plain text files from a directory, so you would need to begin by producing a SequenceFile<Text,Text> (document id, document) as input. Robin's utility in mahout-examples, o.a.m.text.SequenceFilesFromDirectory can do this. Run the following from the 'examples' directory;
mvn -e exec:java -Dexec.mainClass= "org.apache.mahout.text.SequenceFilesFromDirectory" -Dexec.args= "--parent (...input directory..) --outputDir (..output directory..) --charset UTF-8"
Once you have the sequence file to use as input, run the following from the 'examples' directory as well.
mvn -e exec:java -Dexec.mainClass= "org.apache.mahout.nlp.collocations.llr.CollocDriver" -Dexec.args= "--input (..path-to-input..) --output (..path-to-output..) -w"
Once the driver class is run, the collocations will be in (output-directory)/colloc/part-00000 as plaintext. They can be sorted by LLR score using the same sort command I included in the previous comment above (I have a question about this below).
FWIW, I need to re-submit the patch to clean up some of the pom changes to MAHOUT-215 were applied.
A couple follow up questions while you're looking at this:
Currently I just dump the results of the second pass to a file, not sorted by LLR score. I could sort by LLR by sending it through another pass with an identity mapper and a reducer but I suspect that's probably pretty inefficient. Is there a better way to sort the output of the second pass by LLR?
If I only wanted to emit the top 1-10% of the collocs (user configurable), how would I tell the reducer to stop emitting results at a certain point (or is yet another pass needed to achieve something like this?)
Would it be better to emit a sequencefile<LongWritable,Text> instead of a text file as the output from the final pass?

Jake Mannix
added a comment - 28/Jan/10 08:25 Hey Drew, I'm not much of a maven guy - what's the maven-foo you use to get this running?
mvn clean install
followed by
mvn e exec:java -Dexec.mainClass="org.apache.mahout.colloc.CollocDriver" -Dexec.args=" -input src/test/resources/article --colloc target/colloc --output target/output -w"
? Returns with some classpath sadness for me, probably because I haven't set it up correctly...
Caused by: java.lang.ClassNotFoundException: org.apache.mahout.colloc.CollocDriver

Each of the existing attributes are serializable which suggests that there might be a short path to Writable for these, I'll see if Avro offers any shortcuts here. A subclass of IndexWriter may indeed be a very quick route to this.

Perhaps. But the fact that Avro data is self describing and highly portable is probably as important as most other considerations. There are also lots of concessions to Java that are in the Lucene document that are very different from the concessions to good serialization that we need.

I also wonder if flipping this around and implementing a LuceneIndexInputFormat may be another option.

Yes. Or at least a conversion system for spilling an index into SequenceFiles.

Ted Dunning
added a comment - 24/Jan/10 03:24
Each of the existing attributes are serializable which suggests that there might be a short path to Writable for these, I'll see if Avro offers any shortcuts here. A subclass of IndexWriter may indeed be a very quick route to this.
Perhaps. But the fact that Avro data is self describing and highly portable is probably as important as most other considerations. There are also lots of concessions to Java that are in the Lucene document that are very different from the concessions to good serialization that we need.
I also wonder if flipping this around and implementing a LuceneIndexInputFormat may be another option.
Yes. Or at least a conversion system for spilling an index into SequenceFiles.

Thanks for the advice, I'll take a look at doing something in line with the structure you describe. It may be useful to re-used the classes introduced with Lucene's new TokenStream API, where essentially each Token is really a collection of different types of attributes. Each of the existing attributes are serializable which suggests that there might be a short path to Writable for these, I'll see if Avro offers any shortcuts here. A subclass of IndexWriter may indeed be a very quick route to this.

I also wonder if flipping this around and implementing a LuceneIndexInputFormat may be another option.

Drew Farris
added a comment - 23/Jan/10 22:10 Ted,
Thanks for the advice, I'll take a look at doing something in line with the structure you describe. It may be useful to re-used the classes introduced with Lucene's new TokenStream API , where essentially each Token is really a collection of different types of attributes . Each of the existing attributes are serializable which suggests that there might be a short path to Writable for these, I'll see if Avro offers any shortcuts here. A subclass of IndexWriter may indeed be a very quick route to this.
I also wonder if flipping this around and implementing a LuceneIndexInputFormat may be another option.

I think that what we really need is a serialized form for something a lot like a Lucene document, i.e. SequenceFile<key=docId, value=map<fieldname, list<token>>> where token is either a string or a string and a token number or a string and a token number and an offset (i.e. record(string, position, offset) with nullable fields). One really clever integration point would be to build an implementation of a Lucene IndexWriter that would emit these serialized documents.

My guess is that Avro would be a useful tool in implementing this. It should be quite easy to emit simpler forms of this from any of our current code.

Ted Dunning
added a comment - 23/Jan/10 01:06
Drew,
I think that what we really need is a serialized form for something a lot like a Lucene document, i.e. SequenceFile<key=docId, value=map<fieldname, list<token>>> where token is either a string or a string and a token number or a string and a token number and an offset (i.e. record(string, position, offset) with nullable fields). One really clever integration point would be to build an implementation of a Lucene IndexWriter that would emit these serialized documents.
My guess is that Avro would be a useful tool in implementing this. It should be quite easy to emit simpler forms of this from any of our current code.

Decoupling the tokenization logic from the ngrammer is a great idea. I think Ted sort of alluded to this in a recent post on another subject as well. Here, the input would really be a stream of tokens. It makes a great deal of sense to do the analysis offline.

I'll take a shot a providing a SequenceFile<Doc ID, Document> -> SequenceFile<DocId, TokenStream> utility and update this code to take input from there. I suspect it would also be handy to be able to easily string a bunch of these processing steps (Document files > Sequence file<Doc Id, Doc> -> Sequence File<DocId, TokenStream together into a chain, but that's a larger issue. What does anyone else think?

Also, I wanted to ask: does it make more sense to emit a sequence file <longwritable, text> (or gram) instead of a text file? There had been some recent traffic on this list about this wrt line endings and I suspect it may be more reasonable to provide the former as output (or both!?)

Drew Farris
added a comment - 22/Jan/10 23:34 Decoupling the tokenization logic from the ngrammer is a great idea. I think Ted sort of alluded to this in a recent post on another subject as well. Here, the input would really be a stream of tokens. It makes a great deal of sense to do the analysis offline.
I'll take a shot a providing a SequenceFile<Doc ID, Document> -> SequenceFile<DocId, TokenStream> utility and update this code to take input from there. I suspect it would also be handy to be able to easily string a bunch of these processing steps (Document files > Sequence file<Doc Id, Doc> -> Sequence File<DocId, TokenStream together into a chain, but that's a larger issue. What does anyone else think?
Also, I wanted to ask: does it make more sense to emit a sequence file <longwritable, text> (or gram) instead of a text file? There had been some recent traffic on this list about this wrt line endings and I suspect it may be more reasonable to provide the former as output (or both!?)

My use case may differ from what's more typical, but basically I had 10MM+ semi-processed strings (lowercased, no punctuation, etc.) that I wanted to do my analysis on, not a large document or a series of smaller documents. As a result, I found that using the Lucene Analyzers in the NGramCollector to be horribly ineffecient (the job was taking hours to complete 1% on my modest 7 node cluster). Instead I used some custom logic to generate the n-grams and pass them to the collector. After this modification, the entire set of jobs ran in a little less than an hour even on a larger dataset.

I know Robin mentioned something about an Ngram generator in the Bayes classifier, I should check it out.

I guess I would advocate splitting the module as it exists into a couple of different pieces:

1. A general purpose tool for tokenizing/analyzing a document/documents (or in my case, 'documents' consisting of strings). I know there's some exisitng tools for vectorizing text, and indices, so maybe this is something similar but the basic idea would be to go from document --> analyzed text

2+3 (These can remain as is in the module or be split). N gram generator and counter and LLR pieces.

Maybe this is all crazy talk, but it would seem to me refactoring/extracting the pieces out like this would prove useful and allow for code reuse in other modules (other NLP-type modules, elsewhere in the project).

Zaki Rahaman
added a comment - 22/Jan/10 22:44 Thanks Drew. Some overdue feedback from testing this baby out...
My use case may differ from what's more typical, but basically I had 10MM+ semi-processed strings (lowercased, no punctuation, etc.) that I wanted to do my analysis on, not a large document or a series of smaller documents. As a result, I found that using the Lucene Analyzers in the NGramCollector to be horribly ineffecient (the job was taking hours to complete 1% on my modest 7 node cluster). Instead I used some custom logic to generate the n-grams and pass them to the collector. After this modification, the entire set of jobs ran in a little less than an hour even on a larger dataset.
I know Robin mentioned something about an Ngram generator in the Bayes classifier, I should check it out.
I guess I would advocate splitting the module as it exists into a couple of different pieces:
1. A general purpose tool for tokenizing/analyzing a document/documents (or in my case, 'documents' consisting of strings). I know there's some exisitng tools for vectorizing text, and indices, so maybe this is something similar but the basic idea would be to go from document --> analyzed text
2+3 (These can remain as is in the module or be split). N gram generator and counter and LLR pieces.
Maybe this is all crazy talk, but it would seem to me refactoring/extracting the pieces out like this would prove useful and allow for code reuse in other modules (other NLP-type modules, elsewhere in the project).

CollocMapper, Line 66: If I read your implementation correctly, this means that documents are always read fully into memory, right?

Yes, you are correct. In addition to what Ted and Jake have already said, reading parts documents is a little tricky because there would have to be some overlap to ensure that the collocations around the split were picked up properly. Perhaps there will be an opportunity to do something like this once the collocs are windowed based on sentence boundary?

Gram, Line 192: You can omit the "else" clauses, in case the "if" already returns its result to the caller, however this is a question of style."

I prefer the form you describe, easier to read.

I was wondering, why in line 177 you did not write "this.position != other.position"?

I'm not certain about what you're referring to here, before edits, line 177 is in the middle of the write() method.

NGramCollector, Line 47 (and a few others): Shouldn't we avoid using deprecated apis instead of suppressing deprecation warnings?

I've removed the SuppressWarnings("deprecated"). The only reason they are there is that I've used the hadoop 0.19 api here and I tend to get distracted by the deprecation warnings.

This brings up a question of policy however. Should I be using the newest hadoop api for new work in Mahout such as this, or should we continue to make efforts to retain backwards compatibility with older hadoop installations? I assumed the policy was the latter, but relish the opportunity to learn the new api – I just didn't think Mahout ready to commit to the new api yet.

LLRReducer, Line 143: How about making the method package private if it should be used in unit tests only anyway?
Line 106: Would be nice to have an additional counter for the skipped grams.

Drew Farris
added a comment - 22/Jan/10 21:58 Thanks for the review Isabel, here's an updated patch.
CollocMapper, Line 66: If I read your implementation correctly, this means that documents are always read fully into memory, right?
Yes, you are correct. In addition to what Ted and Jake have already said, reading parts documents is a little tricky because there would have to be some overlap to ensure that the collocations around the split were picked up properly. Perhaps there will be an opportunity to do something like this once the collocs are windowed based on sentence boundary?
Gram, Line 192: You can omit the "else" clauses, in case the "if" already returns its result to the caller, however this is a question of style."
I prefer the form you describe, easier to read.
I was wondering, why in line 177 you did not write "this.position != other.position"?
I'm not certain about what you're referring to here, before edits, line 177 is in the middle of the write() method.
NGramCollector, Line 47 (and a few others): Shouldn't we avoid using deprecated apis instead of suppressing deprecation warnings?
I've removed the SuppressWarnings("deprecated"). The only reason they are there is that I've used the hadoop 0.19 api here and I tend to get distracted by the deprecation warnings.
This brings up a question of policy however. Should I be using the newest hadoop api for new work in Mahout such as this, or should we continue to make efforts to retain backwards compatibility with older hadoop installations? I assumed the policy was the latter, but relish the opportunity to learn the new api – I just didn't think Mahout ready to commit to the new api yet.
LLRReducer, Line 143: How about making the method package private if it should be used in unit tests only anyway?
Line 106: Would be nice to have an additional counter for the skipped grams.
done and done, great idea to use the counter to track skipped grams.

Isabel Drost-Fromm
added a comment - 21/Jan/10 17:34
I am not worried about them at this point.
Also not very worried - probably should have indicated that basically everything I found could be filed as "trivial, minor or style question only"...

Are we really worried about individual documents which are bigger than hundreds of MB?

I am not worried about them at this point. Even in the most skewed distributions, this would be very, very rare. How many people on linked in have 100 million friends?

The quadratic cost of unwindowed collocation detection would make this infeasible first. For windowed collocation, it isn't that hard to scan the input, but reading the whole shebang isn't such a bad thing.

Ted Dunning
added a comment - 21/Jan/10 17:12
Are we really worried about individual documents which are bigger than hundreds of MB?
I am not worried about them at this point. Even in the most skewed distributions, this would be very, very rare. How many people on linked in have 100 million friends?
The quadratic cost of unwindowed collocation detection would make this infeasible first. For windowed collocation, it isn't that hard to scan the input, but reading the whole shebang isn't such a bad thing.

CollocMapper, Line 66: If I read your implementation correctly, this means that documents are always read fully into memory, right? So we would assume to only run the ngramCollector over documents that fit into main memory and unable to process larger documents. I am wondering whether this is an issue at all, and if so, whether there is any way around that.

Are we really worried about individual documents which are bigger than hundreds of MB?

jakes old (non-committer) account
added a comment - 21/Jan/10 14:31 I can try and take a look at this later today / tonight.
regarding
CollocMapper, Line 66: If I read your implementation correctly, this means that documents are always read fully into memory, right? So we would assume to only run the ngramCollector over documents that fit into main memory and unable to process larger documents. I am wondering whether this is an issue at all, and if so, whether there is any way around that.
Are we really worried about individual documents which are bigger than hundreds of MB?

First of all, thanks for the patch. The code looks good so far, patch applies cleanly and builds w/o problems. Some initial comments and questions I had when reading it:

CollocMapper, Line 66: If I read your implementation correctly, this means that documents are always read fully into memory, right? So we would assume to only run the ngramCollector over documents that fit into main memory and unable to process larger documents. I am wondering whether this is an issue at all, and if so, whether there is any way around that.

Gram, Line 192: You can omit the "else" clauses, in case the "if" already returns its result to the caller, however this is a question of style. I was wondering, why in line 177 you did not write "this.position != other.position"?

NGramCollector, Line 47 (and a few others): Shouldn't we avoid using deprecated apis instead of suppressing deprecation warnings?

LLRReducer, Line 143: How about making the method package private if it should be used in unit tests only anyway?
Line 106: Would be nice to have an additional counter for the skipped grams.

I agree with you that things like sentence boundary detection and more sophisticated tokenization should be left as work for an additional issue.

Jake, would be great, if you could have a closer look to verify that this is about the pipeline you had in mind in the referenced e-mail threads and mention anything that might still be missing.

Isabel Drost-Fromm
added a comment - 21/Jan/10 12:45 First of all, thanks for the patch. The code looks good so far, patch applies cleanly and builds w/o problems. Some initial comments and questions I had when reading it:
CollocMapper, Line 66: If I read your implementation correctly, this means that documents are always read fully into memory, right? So we would assume to only run the ngramCollector over documents that fit into main memory and unable to process larger documents. I am wondering whether this is an issue at all, and if so, whether there is any way around that.
Gram, Line 192: You can omit the "else" clauses, in case the "if" already returns its result to the caller, however this is a question of style. I was wondering, why in line 177 you did not write "this.position != other.position"?
NGramCollector, Line 47 (and a few others): Shouldn't we avoid using deprecated apis instead of suppressing deprecation warnings?
LLRReducer, Line 143: How about making the method package private if it should be used in unit tests only anyway?
Line 106: Would be nice to have an additional counter for the skipped grams.
I agree with you that things like sentence boundary detection and more sophisticated tokenization should be left as work for an additional issue.
Jake, would be great, if you could have a closer look to verify that this is about the pipeline you had in mind in the referenced e-mail threads and mention anything that might still be missing.

I think there are some improvements that can be made, but if possible it would be nice to review, commit this version and add on to it later through additional patches More specifically, I'd like to see this:

Drew Farris
added a comment - 16/Jan/10 20:50 Log-likelihood collocation identifier in patch form. This puts itself in o.a.m.nlp.collocations.llr
I think there are some improvements that can be made, but if possible it would be nice to review, commit this version and add on to it later through additional patches More specifically, I'd like to see this:
include the ability to avoid forming collocations around sentence boundaries and other boundaries per: http://www.lucidimagination.com/search/document/d259def498803ffe/collocation_clarification#29fbb050cf5fa64
work for non-whitespace delimited languages, e.g: anything an analyzer can produce tokens for.
I removed the ability to read in files from a directory, Robin's document -> sequence file work fits into this well.

I'd like to get this into patch form as a next step + get all of the license headers on the code here, but I'm not sure where it should live in terms of project/package names, etc. Any thoughts?

Also, I'm looking for feedback on the algorithm implementation – this version differs that I presented on the list in that the implementation tracks the part of the orginal n-gram that the sub-part appears in (head, tail). I'm not 100% sure this is necessary or even correct.

Also, it's a bummer to have to create an analyzer subclass just to provide an implementation with a no-argument constructor. Has anyone considered making use of a DI framework with mahout? I know Grant has mentioned such options spring or guice with Mahout? Anyone have any strong objections to pull one of those in as a dependency?

Drew Farris
added a comment - 14/Jan/10 05:54 Thanks for taking a look and providing some great feedback Robin.
Here's a new version that includes the following changes:
Now runs from a SequenceFile<Text,Text> of ID -> DOC by default. Tested on some medium-sized collections of 10k and 100k files using Robin's directory to sequence file util.
Analyzer is now configurable from the command-line via the --analyzerName option
Using a Writable implementation instead of strings to move data around. No more parsing, splitting, concatenating
Improved the handling of the output directory, output from passes are written to subdirectories of this directory, so no need to specify multiple output directories any longer.
After 'mvn clean install' a sample can be run like so:
mvn -e exec:java -Dexec.mainClass="org.apache.mahout.colloc.CollocDriver" -Dexec.args="--input src/test/resources/article --output target/output -w -t"
I'd like to get this into patch form as a next step + get all of the license headers on the code here, but I'm not sure where it should live in terms of project/package names, etc. Any thoughts?
Also, I'm looking for feedback on the algorithm implementation – this version differs that I presented on the list in that the implementation tracks the part of the orginal n-gram that the sub-part appears in (head, tail). I'm not 100% sure this is necessary or even correct.
Also, it's a bummer to have to create an analyzer subclass just to provide an implementation with a no-argument constructor. Has anyone considered making use of a DI framework with mahout? I know Grant has mentioned such options spring or guice with Mahout? Anyone have any strong objections to pull one of those in as a dependency?

Try to stick to SequenceFile<Text,Text> docid => for input content and leave it to the user to generate this file. There is a SequenceFileFromDirectory class in examples in my patch which does this conversion and writes SequenceFiles on HDFS directly

Also Take a look at the TermCountMapper, where I have parameterized the Lucene Analyzer through the conf

If you need to pass a tuple as input or output. Check out the StringTupleWritable class, instead of appending stuff to Text or splitting it.

Robin Anil
added a comment - 11/Jan/10 05:03
Try to stick to SequenceFile<Text,Text> docid => for input content and leave it to the user to generate this file. There is a SequenceFileFromDirectory class in examples in my patch which does this conversion and writes SequenceFiles on HDFS directly
Also Take a look at the TermCountMapper, where I have parameterized the Lucene Analyzer through the conf
If you need to pass a tuple as input or output. Check out the StringTupleWritable class, instead of appending stuff to Text or splitting it.