I'm looking to teach myself more about NLP. I started with NLTK and I see that it has potential to eventually become something I could get paid for.

From there, my journey continued to reading this blog post by Matthew Honnibal. So I'm interested in writing my own NLP algorithms.

The main issue I have is with licenses, specifically with training data. I don't want to ship software to clients that have any licensing issues.

I've narrowed my search down to three: the Brown corpus (license), the CoNLL2000 corpus and the portion of the Penn Treebank that NLTK provides on their downloads page. I understand that each of these downloads has a small snippet about licenses, but it doesn't describe fully what commercial use cases it supports. They only briefly describe that it's free for academic or learning purposes.

Perhaps it's a non-issue?

If I extract statistics from these words, is that a derivative work also under the same copyright? Perhaps I can train my new tagger on the results of an older tagger?

I know that the CoNLL2000 corpus is derived from the Brill tagger. Since the Brill tagger is in the public domain, does that necessarily mean that all of the data it produces is also public domain?

@philshem I think part of the problem is the lack of clear licensing for any given corpus. I can find shitloads of text that is packaged as a corpus for NLP usage, but almost none of them explicitly state how they can be used in a paid project.
– Farley KnightJul 6 '15 at 18:20

Why not simply contact the authors of the corpora (if that's the plural) and ask if they allow commercial use? Maybe they never thought it would be of interest to non-scientists.
– SuzanaJul 8 '15 at 0:13

1 Answer
1

After doing some searching it seems that the Moby Project is in the public domain, and they have a POS corpus. However, it's simply a dictionary so it doesn't help with words that have multiple POS. Also it's not encoded in ASCII so opening it up in a text editor, it's hard to read. Will obviously require pre-processing before it can be useful.

According to the users in that thread, if you train a machine learning algorithm, the vectors / weights / what-have-you that came from that training is considered a "derivative work". Hence one can download a copy of any corpus, train & tune the algorithm, and use that trained algorithm to make money.

That said, if you don't use enough training data, the algorithm could easily reproduce the same data it was given for training. The example they gave was with images. If you train an algorithm to recognize images and also generate images, and it generates images that are very similar to a copyrighted work, then it might not be covered as a "derivative work".