2010. november 24.

Incanter, a statistical environment for Clojure - Interview with its creator David Edgar Liebke

This week we interviewd David Edgar Liebke the creator of Incanter (a statistical and graphics environment for the JVM). David is a developer and statistician working for Clojure/core at Relevance Inc. He has a B.S. in cognitive science (UC San Diego), M.S. in applied mathematics and statistics (Georgetown), an M.B.A. (UC Irvine). He's got a nice blog, Data Sorcery with Clojure, and you can find him on Twitter as @liebke.

David Edgar Liebke: Yes, the lisp family of languages have historically been thought primarily as a tool for symbolic, rather than numeric, computation. I had in fact spent a lot of time programming traditional AI systems in lisp as an undergraduate and again, years later, when I was learning how to program automated theorem provers. But lisps are extraordinarily good general-purpose programming languages, and their functional approach combined with their interactive development-style, due to their dynamic type system and REPL, suit the typical data analysis work flow, which involves a great deal of non-numerical work, transforming raw data into something that can stuffed in a matrix. I think this is why both R and Lisp-Stat have their roots in lisp. Lisp-Stat was, as is obvious by its name, implemented in a lisp, but more surprisingly R is also built on a lisp-like engine written in C.

Clojure combines the power of lisp with the enormous selection of libraries found in the Java/JVM ecosystem, including the libraries that I built Incanter on, such as the Colt numeric library from CERN and Parallel Colt, an extension that provides multiprocessor support, the JFreeChart charting library, the Processing visualization library, LaTeX and PDF rendering libraries, MongoDB libraries, MS Excel file parsers, and on and on.

In addition to the large ecosystem, Clojure has a powerful set of concurrency primitives and a growing set of parallel computation functionality that greatly reduce the pain associated with writing programs that can exploit multi-core architectures.

How seamless is the integration of those tools?

Remarkably seamless, this turns out to be one of Clojure's killer features, the ability to provide concise, dynamic access to existing Java libraries.

As a reason for creating Incanter, you cite two papers. Back to the Future, which is deals with the problems of scalability and R, and Lisp-Stat issue of the Journal of Statistical Software that is summarizing the lessons learned from the Lisp-Stat project. Why do these things matter? Why should we take care of scalability?

Scalability matters because the diversity and volume of data available to analyze is growing at a phenomenal pace. The ability to either pull in data from divergent data sources, or embed your computation in the systems where this data lives will become increasingly important, and Clojure is an excellent fit for either approach.

Linguistics is facing to a paradigmatic change as it is becoming more and more data-intensive. Bender and Good in their white paper, A Grand Challenge for Linguistics: Scaling Up and Integrating Models, argue that we should considerably scale up our databases. Most of us take the advice and learned python and/or R and some sort of database (mySQL, but mapreduce implementations are also becoming popular). What can Clojure and Incanter offer to linguists? Why should we consider using it?

I think learning either Python or R is worthwhile; R has become the lingua franca of statistical computing and Python's Numpy and Scipy libraries are very powerful. Language choice is frequently a function of library availability, so if what you need to do depends on functionality supported in either R or NumPy/SciPy, then those are the obvious choices.

But I think Clojure is a better general purpose language than R and a better language for multi-core programming than Python; and it has access to a broader set of data sources than either through the libraries available within the JVM ecosystem.

I've heard examples of it being used in both stand alone mode to perform exploratory data analysis and chart generation and as an embedded library within a larger system to perform custom calculations and generate data visualizations.