Tuesday, July 12, 2011

I do natural language processing in C# 3.5 and Python. My work includes classification, named entity recognition, sentiment analysis and information extraction. Both C# and Python are great languages, but I have some unmet needs. I am investigating if there are any new languages that would help.

It should be possible to make a super language that has the elegance of Python, but without these shortcomings.

My first Scala experience

In 2006 I thought Scala was this super language. It is very advanced; you can call any Java libraries from Scala, including all the open source libraries. But I ran into a list of problems with Scala:

The Scala IDE was far behind Eclipse Java

Scala is a quite complex language

The Java libraries and the functional programming libraries were badly integrated

There were no Scala REPL or interpreter like in Python

Scala was stable enough for use, but it did not improve my productivity so after some months I went back to using Python as my scripting language.

Python's weakness

Recently I had to make a small text processing application that end users could use directly. This was not the best fit for Python. Normally my Python programs have no GUI and are controlled by command line parameters.

I had 2 Python options:

Make simple GUI using TkInterTkInter is a Python wrapper of TK, the cross platform GUI toolkit. It is pretty crude by modern GUI standards, but would have been good enough. However trying to install all the Python libraries that I needed on the end users machine would be setting myself of for a maintains nightmare.

Wrap code in web application
I could wrap a web interface around it. The application is using a lot of memory and I would have to maintain a web application.

I had a 1 week hard deadline for the task and both of these options looked unappealing. I needed something else...

My first F# application

I took a chance on F#, and managed to learn enough F# to finish the program by my 1 week deadline.

There is no GUI builder for F# in Visual Studio, but it was pretty easy to hand code a simple WinForms GUI to wrap around the core code. It was not pretty but you could give it to an end user. The whole application ended up being one 40KB executable file, and it was very fast. F# had actually filled a niche that Python does not do so well.

There were also problems, I wrote the whole application from scratch, while in Python I would have been able to use NLTK, write the code faster and get better results.

All in all this was very good experience. I thought that F# would be a good supplement to my Python library. It would both give me raw speed when I need it and good connectivity with C#, ASP.NET, WPF and Microsoft Office.

Functional programming benefits

Functional programming is a great fit for my NLP work.

I have a lot of different text sources: database, flat file, directory, public RESTful web application services.

I need many operations building on other operations: Bigram finder, POS tagger, named entity recognizer.

Created different reports: database, csv, Excel.

In functional languages you can just take any combinations of these operations and easily pipe them together while getting good compiler support. This does not fit so well with object oriented programming were you are more concerned with encapsulation.

F# impression

F# is the first compiled language I tried that is comparable to Python in simplicity and elegance. It has a real Pythonic feel:

F# is fast

Simple and elegant

Good development environment in Visual Studio 2010

Best concurrency support of any language I have seen

Good database support

Good MongoDB library

Simple to combine F# with C# or VB.NET for ASP or WPF

Good REPL

Issues

Runs best under Windows

For an IDE you really need Visual Studio 2008 or 2010, and that cost at least $700

F# can be compiled and run the shell from SharpDevelop 4.0 and 4.1, but you do not have the same productivity

The math libraries under .NET are not as good as NumPy and SciPy

The NLP libraries are better under Python

Scala revisited

After the success with F# I was very curious about why F# has been so much more successful than my first experience with Scala.

I looked at an F# and Scala cheat sheet and thought they look remarkably similar. I watched a few screen casts and found no obvious problems. I bought the book: Programming in Scala, Second Edition, it turned out to be a very interesting computer science book and I read the whole 852 pages. Scala still looked good.

I installed the Scala Eclipse plugin and wrote some code. Both the language and the IDE have come a long way during the last 5 years:

Hadoop is the Java based open source version of MapReduce. To run Hadoop natively it has to run a JVM language like Java or Scala.

Hadoop Streaming extends a limited version of Hadoop to work with programs written in other programming languages as long as they work like a UNIX pipes that read from stdin and write to stdout.

There is a Python wrapper for Hadoop Streaming called Dumbo. Python is around 10 times slower than Java and Dumbo is a limited version of the Hadoop, so if you are trying to do NLP on massive amount of data this might not solve your problems.

Scala is fast and will give you full access to run native Hadoop.

Microsoft's version or MapReduce is called: Dryad or LINQ to HPC. It is not officially released yet, but F# works well with Windows Azure.

NLP and other languages

Let me finish by giving a few short comparisons of F# and Scala with other languages:

Clojure vs. Scala

Clojure is a LISP dialect that it also running on the JVM, and it the other big functional language running there. Clojure has some distinct niches for NLP:

Clojure better

Language understanding

Formal semantic: taking text and translating it to first order propositional logic

Artificial intelligence tasks

Scala better

It is easy to write fast Scala code

Smaller learning curve coming from Java

I tried Clojure recently and was very impressed; but more of my work falls in the category that would benefit from Scala.

Java vs. Scala

Java better

Better IDE tools and support

Better GUI builders

Great refactoring support

Many more programmers that know Java

Scala better

Terser code

Closures

First class function

More expressive language

C# vs. F#

C# better

Better IDE tools and support

Better GUI builders

There are a lot more programmers that know C#

Better LINQ to SQL support

F# better

Terse code

Better support for concurrency, Synch, continuations

More productive for NLP

Conclusion

F# and Scala are similar hybrid functional object oriented languages.

For years I have periodically tried functional programming languages to see if they were ready for mainstream corporate computing; and they were not. With the recent spread of functional features into object oriented languages I thought that real functional programming languages would soon be forgotten.

I was pleasantly surprised by how well F# and Scala work now. Functional languages are finally coming of age and becoming useful in mainstream corporate computing. They are stable enough, and they have niches were they are more productive than object oriented languages like C# and Java.

I really enjoy programming in F# and Scala, they are a very good fit for natural language processing and cloud computing. For bigger NLP projects I now prefer to use F# or Scala over C# or Java.

For GUI and web programming the object oriented languages still rules. Stick with C# or Java if the NLP part is small or GUI or web interface is the domineering part.

Java and C# are also improving e.g. by adding more functional features. Many working programmers are well served by just waiting for Java 8 or C# 5. But functional programming is here to stay. Rejoice...

There is also an F# plug-in for MonoDevelop if you find yourself developing on Linux or Mac and want to use F#:

http://functional-variations.net/monodevelop/

Thanks for the article. This is a really great and informative comparison.

I should also point out that IKVM.NET makes it really easy to use Java libraries in your .NET projects. I have used big Java libraries a few times and been really happy with how well they integrate into my .NET apps. So, you could still use F# if you wanted without having to give up the Java libraries.

One of the reasons that I prefer the CLR is that I find it gives me access to all the great .NET libraries plus all the great Java ones. For example, I wrote a reporting module for a website a couple of years ago that use the HTML Agility Pack (a great .NET HTML parser) and Flying Saucer (a fantastic Java based CSS parser/renderer) to create on the fly PDF reports from dynamically generated HTML pages CSS templates. I even deployed the whole thing on Linux/Apache using Mono. It worked really well.

I am obviously more of a .NET guy but I have been thinking for a while that Scala would eventually win me over to the JVM.

F# scared me away at first but lately I have been liking it more and more. So maybe the CLR has me for a while longer.

Clojure really interests me as well but I never seem to get around to it. The JVM is obviously it's primary home but the CLR version looks quite good (unlike the CLR version of Scala).

What about writing your UI in C# and doing work part in F#? This could be easily done by creating class libraries for the different parts. One of the big advantages of the .NET platform is being able to match the language to the functionality desired.

Hi Sami,you wrote that in 2006 Scala had no REPL, but do not mention that there currently exist a very useful REPL.

Then, comparing Scala and F# you attest F# "fantastic concurrency" .I don't know F#, so I've no idea where F# is better than Scala's actor approach.But you definitely should put it into account for Scala when comparing that with Java.

Then Scala's alleged "complexity" is widely discussed. Throwing that in without any further comment doesn't help that discussion, because OTOH you attest Scala being "More orthogonal, reusing the same constructs", which in some perspective reduces complexity e.g. compared with Java.

In the end, what I did not understand is: "For GUI and web programming the object oriented languages still rules."

For GUI I am with you. For Web I do not see the point, as there are currently so much approaches to functional web services: First of all Node.js, v8cgi and Rhino-for-webapps (all Javascript), then regarding Scala the Unfiltered library or the http package in Scalaz, to just mention a few.As web is always semantically response = webservice(request), and as webservices live from parallelism and other things where FP is strong (e.g. map/reduce algorithms), I can't see where OO shines in this area.

You are making a lot of good detailed points; and I agree with what you say. I had to keep the blog post short and readable. My focus was the great progress of FP seen from an NLP viewpoint.

Yes functional programming should work fine for Web applications. I worked with ASP.NET using Visual Studio 2008 and it is a tremendous amount of work Microsoft has put into making this easy to use. So my statement was more about the web tools than the merit of FP. I have not tried any of the Scala web libraries, so maybe they have caught up with ASP.NET on Visual Studio.

Good post. You have mentiotned Azure for F# and Hadoop (via EC2 or one's own cloud) for Scala. Well google's colud of GAE also is on the Scala side. For the web libraries, there is lift web frame work for Scala. Here is Scala lift frame work on Google App Engine. http://lift-example.appspot.com/

About Me

My interests are natural language processing, machine learning, programming language design, Artificial Intelligence and science didactic.
This blog started during my work on an open source software image processing project called ShapeLogic.
I have worked in NLP for several years, but spent many years working in the cubicles, at: Goldman Sachs with market risk, Fitch / Algorithmics with operational risk, BlackRock with mortgage backed securities, DoubleClick with Internet advertisement infrastructure, Zyrinx / Scavenger with game development. I have a master of science in mathematics and computer science from University of Copenhagen. For work I have been using these programming languages: Scala, Python, Java, C++, C, C#, F#, Mathematica, Haskell, JavaScript, Perl, R, Ruby, Slang, Ab Initio (ETL), VBA. Plus many more programming languages for play.