Thursday, December 29, 2011

This post assumes you have already installed Leiningen and you can work with your choice of programmers' editor.

Starting a new project

This command creates a new directory (hello-nlp). Navigate into that new directory and open the file project.clj. You are going to see something like this:

We are going to use the Apache Foundation's OpenNLP library with the help of Lee Hinman's Clojure library interface (and this post is based on Hinman's tutorial). Searching for “opennlp” gives various results, so we picked up the first (ending with 0.1.7). The information page contains everything you might want to know, the location of the github repo and a short code snippet for leiningen users [clojure-opennlp "0.1.7"]. Copy and paste the code into project.clj as follows:

Now your project.clj knows everything and is ready to serve you. The command

downloads dependencies (e.g. the clojure-opennlp library) and puts them into your path. Have a look at the lib library in your project library and you'll see jar files.

The core
Now navigate into the hello-nlp/src/hello_nlp/ library. You'll find a core.clj file there. Open it in your editor.

You'll see something like this:

To “enable” OpenNLP, modify the file:

You need a few additional files. Make a models directory in hello-nlp and download the pre-trained models from here (http://opennlp.sourceforge.net/models-1.5/). In this post, we are using English models, but feel free to change to another one. You need the Sentence Detector (en-sent.bin), Tokenizer (en-token.bin) and the POS Tagger (en-pos-maxent.bin).

Now, we can add user defined functions to core.clj. In the example, we made a sentence detector (get-sentences), a tokenizer (tokenize) and a POS tagger (pos-tag) based on the downloaded models.

Monday, November 28, 2011

As we expressed in our previous post, we'd like to experiment with Clojure. Let us emphasis again, we are NOT developing a new library, we just believe that using Clojure in linguistic computing might be fruitful. In order to prove this assumption (or refute it), we are going to try some tools out, and summarize and share our experiences as blog posts. Here is our tentative road-map.

Friday, November 25, 2011

We received emails from interested folks who are new to Clojure. We hope they can find enough information about setting up a convenient environment for working with us so that they can provide us feedback. Here we give them a few tips. Please share your experiences in the comments!

Installing Clojure requires some expertise, this means you should be comfortable with your operating system. The easiest way to run Clojure, is downloading the clojure.jar file, and using the java -cp clojure.jar clojure.main command from the command line. However, this isn't the most effective way. Finding information about how to install Clojure on your platform is not impossible with a search engine. Ubuntu users find everything in Clojure on Ubuntu, please note the clojure github repo has been moved to https://github.com/clojure/clojure and clojure-contrib also moved individual repos, so don't follow the description literally!

You'll also need to install Leiningen. Why? As you can read on its repo “Working on Clojure projects with tools designed for Java can be an exercise in frustration. With Leiningen, you just write Clojure”. We are going to use Java tools, and the Clojars community repository provides us with these tools. Although using Leiningen to include various Java libraries into our projects looks very tedious (have a look at the sample file), but taking some time before getting into coding can give us goodies like the Stanford parser, OpenNLP, WEKA.

We won't speak about version control, but you using version control is good house keeping technique. If you are new to this theme, and haven't committed yourself to a tool yet, have a look at git, and github, and read the git community book.

First of all, we are NOT proposing a new framework/library here! Our main goal is to examine what Clojure offers to linguists. Although more and more linguistics departments offer courses in statistics and probability theory, the vast majority of students graduate with some background in discrete maths, mostly taught in an implicit way through a class in syntax and/or semantics (and the same is true for philosophy education). Using computer programs to test our scientific ideas is becoming a common practice in sciences, and this is true for linguists too. Stefan Th. Gries distinguishes linguistic computing from computational linguistics; following him, we think linguistic computing will become a common methodology used in the language sciences.

So, what's the difference between computational linguistics and linguistic computing? Well, there is no clear boundary! We'd say computational linguistics (or natural language processing) is a kind of applied science and engineering, and as such it is more “goal oriented”. Norvig's recent critique of Chomsky shows that commercial success is a measure of ideas, but despite the proliferation of statistical methods linguists are still doing research on rule based systems like HPSG, minimalism, etc., and new interdisciplinary research themes have emerged like Parikh's idea of the social software (and game theoretic semantics and dynamic epistemic logic, among others). But what is “pure” research today can become applied research tomorrow. To foster communication between pure and applied research, between linguistic computing and computational linguistics, we need a lingua franca.

As Clojure is the Lisp for the JVM, it is a convenient language for linguists. In the not-so-distant past, Touretzky wrote his Gentle Introduction to Symbolic Computation, an excellent book for beginners in the humanities. Gazdar and Mellish Natural Language Processing in X (where X stands for Prolog, Lisp or Pop11) is a good introduction to finite state techniques, grammars, parsing and it even has a chapter on question answering. We don't deny that these techniques are old, but they are still part of the well-educated linguists' body of knowledge. Also, although Norivig's PAIP is a real gem, one cannot argue against the “old” AI paradigm without seeing the past, and those ideas are still important for linguist, philosophers and cognitive scientists. Logic programming is a natural pair of functional programming. The basic techniques of computational linguistics can be expressed in logic programs, and although they have their computational limitations, these little programs has got unquestionable educational value.

Porting the classic into Clojure is not a novel idea, as some Google searching shows that people are turning the classic Lisp books like PAIP or the Structure and Interpretations of Computer Programs into modern Clojure. The core.logic library opens up the possibility to do the same with the Prolog literature.

The most common argument against NLTK is that you can't use mature, industry standard tools like the GATE framework, Stanford core, and openNLP. Clojure's Java interoperability solves this problem. If you are into machine learning, Weka, MALLET and etc. are at your service. The Incanter package provides an R-like statistical library.

With these tools in your hand, you can test your ideas in a language that's very close to what you learned about formal languages. Using Java libraries is like using rapid prototyping material when you are a marble sculptor. And as your works end result can be shared with the computational linguists, you can get more feedback, and even help from the greater community.

That's why we think that Clojure lx is an idea worths exploring. We'd like to test ourselves! Can we use Clojure to express our simple ideas? How easy is it to use Java libraries for a project? If you would like to join us, please send an email to zoltan.varju(at)gmail.com. We welcome everyone, linguists and Clojure hackers, philosphers, digital humanists, everyone who is interested!