Friday, October 29, 2010

I work in natural language processing, programming in C# 3.5 and Python. My work includes classification, named entity recognition, sentiment analysis and information extraction. Both C# and Python are great languages, but I do have some unmet needs. I investigated if there are any new languages that would help. I only looked at minimal language that would be simple to learn. The 3 top contenders were: Clojure, Go and Cython. Both Clojure, Go have innovative approaches to non locking concurrency. This is my first impression of working with these languages.

For contrast let me start by listing the features of my current languages.

While many features of C# are not directly related to NLP they are very convenient. C# has some NLP libraries: SharpNLP is a port of OpenNLP from Java. Lucene has also been ported. The ports are behind the Java implementation, but still give a good foundation.

Python

Python is an elegant scripting language, with a strong focus on simplicity.

NLTK is a great NLP library

Lot of open source math and science libraries

PyDev is a good development environment

Good MongoDB library

Great for rapid development

Issues

It is interpreted and not very fast

Problems with GIL based threading model

C# vs. Python and unmet needs

I was not sure what language I would prefer to work with. I suspected that C# would win out with all it advanced features. Due to demand for fast turnaround, I ended up doing more work in Python, and have been very happy with that choice. I have a lot of scripts that can be piped together to create new applications, with the help of the very fast and flexible MongoDB.

I do have some concerns about Python moving forward:

Will it scale if I get really large amount of text

Will speed improve on multi core processors

Will it work with cloud computing

Part of speech tagging is slow

Java

Java is a modern object oriented language. Like C# it is a programming platform:

Has most NLP libraries: OpenNLP, Mahout, Lucene, WEKA

It is fast

Great development environment: Eclipse and NetBeans

You can do almost any tasks in it

Great database support with JDBC and Hibernate

Many web development frameworks

Good GUI toolkit: Swing and JavaFX

Good concurrency with threading library

Issues

Functional style programming is clumsy

Working with MongoDB is clumsy

Java code is verbose

I would not hesitate using Java for NLP, but my company is not a Java shop.

Clojure

Clojure was released in 2007. It is a right sized LISP. Not very big like Common LISP or very small like Scheme.

Clojure is minimal in the sense that it is build on around 10 primitive programming constructs. The rest of the language is constructed with macros written in Clojure.

Once I got Clojure installed it was easy to work with and program in. Most of the good features about Python also applies to Clojure: it is minimal and has batteries included. Still I think that Python is a simpler language than Clojure.

Use case
Clojure is a good way to script the extensive Java libraries, for rapid development. It has more natural interaction with MongoDB than Java.

Clojure OpenNLP

The clojure-opennlp project is a thin Clojure wrapper around OpenNLP. It came with all the corpora used as training data for OpenNLP nicely packaged and it works well. You can script OpenNLP approximately as terse as NLTK, from an interactively repl.

I tried it in both Eclipse and NetBeans. They seem somewhat equal in number of features. I had a little better luck with the Eclipse version.

clojure-opennlp is using a Maven built system, but has a nontraditional directory layout, this caused problems for both Eclipse and NetBeans, they both took some configuration.

Eclipse Counterclockwise
The Counterclockwise instruction for labrepl mainly worked for installing clojure-opennlp.
When you were done you had to go in add the example directory the source directories under properties.

NetBeans Enclojure
I imported the project. I had to move the Clojure file from example directory to a different position to get it to work.

Maven plugins for Clojure
The standard Maven directory layout has several advantages, e.g. if you want to mix Java and Clojure code. I created my own Maven pom configuration file up, based on examples of other Clojure Maven projects. They used Clojure plugins for Maven, I could not get this to work. Eventually I ripped these plugins out and was left with very pain POM file that worked.

Go / Golang

Go was announced November 2009. It is created by Google to deal with multicore and networked machines. It feels like a mixture of Python and C. It is a very clean and minimal language.

It is fast

Good standard library

Excellent support for concurrency

It is trivial to write your own load balancer

Issues

The Eclipse IDE is in an early stage

Debugger is not working

Windows port is not done and has just been released

It was hard to find the right Go Windows port, there are several Go windows port projects with no code.

Use cases
I currently have a problem when downloading a lot HTML pages and parsing them to a tree structure. This does not have the best support in C#. I found a library that translates HTML to XHTML and then I can use LINQ to process it. The library is not documented, not very fast and fails for some HTML files.

Go comes with a HTML library that parses HTML 5, it is simple to write a program with some threads that download and other that parse the files into a DOM tree structure.
I would use Golang for loading large amounts of text in a cloud computing environment.

Cython

Cython was released in July 2007. It is a static compiler to write Python extension modules in a mixture of Python and C.

Process for using Cython

Start by writing normal Python code

Find modules that are too slow

Add static types

Compile it with Cython using the setup tool

This produces compiled modules that can be used with normal Python

Issues

It is still more complex that normal Python code

You need to know C to use it

I was surprised how simple it was to get it working both under Windows and Linux. I did not have to mess with make files or configure the compiles. Cython integrated well with NumPy and SciPy. This expands the programming tasks you can do with Python substantially.

Use cases
Speed up slow POS tagging.

My previous language experience

Over the years I have experimented with a long list of non mainstream languages: functional, declarative, parallel, array, dynamic and hybrid languages. Many of these were frustrating experiences. I would read about a new language and get very excited. However this would often be the chain of events:

Download language

Installed Cygwin

Find out how the language's build system works

Try to find a version of the GCC compiler that will compile it

Get the right version of Emacs installed

Try to get the debugger working under Emacs

Start programming from scratch since the libraries were sparse

Burn out

You only have so much mental capacity, and if you do not use a language you forget it. Only Scala made it into my toolbox.

Do Clojure, Go or Cython belong in your programmer's toolbox

Clojure, Go and Cython are all simple languages. They are easy to install, easy learn, they all have big standard libraries so you can be productive in them right away. This is my first impression:

Clojure is a good way to script the extensive Java libraries, for rapid application development and for AI work.

Go is a great language but it is still rough around the edges. There are not any big NLP libraries written for Go yet. I would not try to use it for my main NLP tasks.

Cython was useful right away for my NLP work. It makes it possible to do fast numerical programming in Python without too much trouble.

10 comments:

just fyi: I added your blog (by Clojure lable) into Planet Clojure, so if you'll write more about this language, all posts will fetched into it

Regarding language itself - I often use Clojure for prototyping, and very easy work with Java libraries is useful. I'm also interested in NLP-related topics, although I'm only starting my journey into this world, so it could be interesting to see more work in this branch in ClojureAbout project management - for Clojure it could be much easier to use Leiningen instead of Maven, especially for simple projects. Although, I personally use Maven to build complex projects

I do not understand why you say you need to know C to use Cython. I use both (and know both reasonably well), but I have never had the impression that I really needed to know my C to get great effect with Cython.

I agree with Alex; use Leiningen instead of Maven. The layout of clojure-opennlp is actually a pretty conventional clojure setup. If you have leiningen, you can use "lein uberjar" to make a standalone jar for use in eclipse or netbeans; this is much easier than mucking around with maven.

Hello, i would like to ask that what is the scope of C language training, what all topics should be covered and it is kinda bothering me … and has anyone studies from this course http://www.wiziq.com/course/2118-learn-how-to-program-in-c-language of programming in C ?? or tell me any other guidance...would really appreciate help… and Also i would like to thank for all the information you are providing on C concepts.

It is a result of the requests and the necessities that the exposition must satisfy. However as an understudy you http://www.wwwritingservice.com/ - essay writing service ought not be debilitated in light of the fact that you can simply get paper composition administrations to help with the written work.

About Me

My interests are natural language processing, machine learning, programming language design, artificial intelligence and science didactic.
Author of open source software image processing project called ShapeLogic: https://github.com/sami-badawi/shapelogic-scala.
I have worked in NLP for several years, but spent many years working in the cubicles, at: Goldman Sachs with market risk, Fitch / Algorithmics with operational risk, BlackRock with mortgage backed securities, DoubleClick with Internet advertisement infrastructure, Zyrinx / Scavenger with game development. I have a master of science in mathematics and computer science from University of Copenhagen. For work I have been using these programming languages: Scala, Python, Java, C++, C, C#, F#, Mathematica, Haskell, JavaScript, TypeScript, Clojure, Perl, R, Ruby, Slang, Ab Initio (ETL), VBA. Plus many more programming languages for play.