db.find returns documents in the collection which satisfy the query pattern
specified by the bson document kv. If you use the keyword :all db.find will return all documents. If you're looking for all documents with specific (key value) pair you'd specify that those as the query pattern :

In a key-value database like mongodb the fundamental data element is the key value pair.
Key-value pairs in languages like javascript or python as have a seperate representation like :

{ key : value }

In lisp a natural enough equivalent is the association list or alternatively the dotted list.
Since the javascript expression also creates a key-value object, I wanted to mimick the same semantics in lisp.
In order to stay a close as possible to the java script syntax, and to also create a disticnt type
I could switch on when dispatching generic function calls, I created the two types.

To get cl-mongo to run with clisp I needed to make only one fairly minor change in the networking code.

When you send an insert to mongodb, it does not respond back. When you send a query request you obviously do get a response back. Since I'm using the same socket connection for writes and reads I need to have some mechanism in place to wait for a response, when one is expected. A blocking read on the socket is out of the question. That would destroy performance.

All of this advice is good, but I've been using slime with sbcl and emacs for years and I don't want to start from scratch just to add clojure. In addition, rather than hand things off to a tool like elpa, I'd like to install things myself, so I get to understand how the various pieces work together.

I'm going to show how you too can use use the current cvs head for slime, and the current git repos for clojure, clojure-contrib and swank clojure to run clojure with slime and lisp. I"ll provide a few helpful links to get more information on slime and swank. As it turns out, there 's currently a bit of incompatibility between clojure and the slime package, but it's minor and easy to work around.

I revamped the way connections to a mongo database are handled in cl-mongo. In the new implementation each connection is referenced by a unique name. Each connection is stored in a connection registry. Database calls in cl-mongo default to using the :default connection. The connection parameters used by the default connection for host, port and db are accessible through the mongo-default-.. special variables.

defgeneric mongo (&key host port db name) gets the connection referred to by the :name keyword. The default for :name is :default. If no connection with that name exists, a new connection will be created.

defun mongo-show () will show all connections currently in the registry.

(defgeneric mongo-swap (left right) will swap two connections. This is useful if you want to use a different default connection, or want to change the parameters on an existing named connection.

defgeneric mongo-close ( name ) is used to close a connection. The special keyword :all can be used to close all connections in the registry.

I want to be able to keep the documentation of cl-mongo current when I am adding new features to the package.

The original documentation was generated 'by hand'. I went through the code and made sure I documented the exported classes, methods and functions. This is not really satisfactory. Ideally you want the documentation and the code tightly integrated. Javadoc style documentation, which can be generated from source code annotations would be ideal.

There are various lisp packages available which will pull out the source code comments for the public interface defined in the packages file. One such package is Edi Weitz's documentation template. This package generates an HTML file with the API description based on embedded comments, which is exactly what I was looking for. The package does have its quirks/features. It doesn't seem to like embedded HTML or markdown formatting in the lisp code comments, so the API descriptions appear somewhat 'flat'. In addition I can't define an order on the way the API components are presented and consequently things jump around a-bit. It also hard-codes licensing information and URLs which are not appropriate for me. The way I dealt with this was to take the generated HTML file and to search and replace the licensing information and URLs.

The next challenge was to integrate the HTML file generated by documentation-template with the README.md. This I accomplished by stripping out the HTML header and appending the resulting HTML file to a markdown formatted file.

The result doesn't look bad and more importantly gives me a way to easily keep my documentation up-to-date.

The purpose of this post is to claim my blog on technorati. For that to work I need to include the claim code they provided to me in this little email This is an automatically-generated email.

Thank you for submitting your blog claim on Technorati. Technorati will need to verify that you
are an author of the site http://www.mohegan-skunkworks.com/ by looking for a unique code.
We have just assigned the claim token 2624RGHKBMNQ to this claim.
Please visit http://technorati.com/account/ for more details, including how to use the claim token.
Thank you.

mongo is a scalable, high-performance, open source, schema-free, document-oriented database. I was introduced to mongo at the new-york mysql meetup. Two things made mongo look attractive: inter-operability and document centric storage.

I'm familiar with the elephant persistence framework in lisp. However elephant objects are not readable (as far as I know) in languages other than lisp.That makes inter-operating with other platforms difficult. A traditional rdms requires some sort of schema if you want to use it effectively. Mongo on the other hand is optimized for the the kind of free form document storage I'm looking for.

Mongo comes with set of drivers but was missing was a lisp driver. This looked like a good project to get better acquainted with lisp and mongo.

So I set out to try to write one and the result is cl-mongo. At this stage it's close to having the capabilities I'm looking for, but cl-mongo is obviously a work in progress.

Friday I attended the 4 th Machine Learning Symposium organized by the New York Academy of Sciences (NYSA).

The Symposium program consisted of four main talks given by local experts in the area of machine learning, interspersed with four graduate student talks, a poster session and a terrific lunch.

Since I'm not really hindered by any overwhelming expertise in this area I'll confine my self to a few breezy impressions of the main talks.

The first one was given by Bob Bell, from AT&T Bell Labs and a member of the team which won the Netflix prize.

What made the contest challenging was not only the huge size of the data set but also the fact that 99 % of the data was missing. In addition there were significant differences between training and test data. Regardless of whether a 10 % improvement of a movie rating system should be worthy of a million dollar prize, it provided a great way to test classifiers against real world data.

One thing that stood out for me was that a relative small amount of users was responsible for 'most' of the ratings. He mentioned that they identified one user responsible for 5400 ratings on one particular day. This could an data error on the Netflix side, where the time stamp was somehow misapplied. On the other hand it sounds like someone was trying to deliberately affect a large swath of ratings.

The final classifier incorporated breakthroughs made by different teams in the earlier stages of this multi-year competition.One such breakthrough was to consider the previous genres of the movies someone has rated to determine future recommendations. That must seem rather obvious in retrospect. The other was a clever way called Collaborative Filtering which takes into account the time-dependency of people's movie preferences.

An ensemble of previously validated classifiers was used to construct the final classifier and the calculation to get the final result submitted to Netflix took almost a month, primarily because a power failure forced a restart of the calculation engine. In fact the use of an ensemble of classifiers of mentioned as one of the main lessons learned from the contest. The other was the power matrix factorization (i.e. treating users and preferences as independent parameters and using matrix to link the two) as a computational tool.

My requirements are simple: I want to be able to use the usual cast of mathematical symbols inlined in my main text as well format large equation blocks. In this post I'll compare and contrast inline html with online editors provided by SITMO, CodeCogs and Textify.

To save you the trouble of having to wade through miles of text, I'll start of with my

Conclusions

Use HTML for inlining symbols and equations. Although those will never look as good in html as in Latex, the overall format of your text will suffer less.

For equation blocks you have the choice between two online editors : SITMO, CodeCogs. When it comes to embedding latex code into your html both editors are comparable. When you're looking for more variety with regard to fonts or markup languages CodeCogs is your only alternative.

If you want to use the url link to embed a large code block you probably would want to use a url shortner. Tiny url is a good option here.As an alternative both editors can generate a png image for you to embed.

Textify has a nice clean interface, but I can't embed the links it generates. My only alternative would appear to be to run my own instance of this service which is obviously not something I need or want to do. Furthermore, although it cleverly provides a shortened url, it uses bit.ly which unfortunately doesn't handle complex latex url's well..

In chapter two of Hastie, Tibshirani and Friedman 's 'The Elements of Statistical Learning' the authors discuss the use of least- squares regression to construct a data classifier for linearly separable data.

A set of training data together with the least-squares method is used to construct a hyper-plane in the data space. The classification of a data point depends on what side of the hyper-plane you end up on.

The example in Hastie uses two data classes in a two dimensional parameter space. I didn't grok the the example immediately, and I thought it would be helpful to try to construct my own much simpler example by staying in one dimension and using a simple normal distribution. The rest of this post describes the details.

In a previous post I discussed four methods to multi-thread the Sieve of Erasthones in Python. I concluded that multi-threading didn't increase performance, and in fact could have a significant adverse effect. The global interpretor lock (GIL) prevents threads from running concurrently and thus limits the upside of threading. The use of locks or avoiding the use of shared data can than decrease performance quite a bit.

In this section I'll be using Python's multiprocessing module to 'multi-thread' the Sieve.

The multiprocessing module spawns a number of processes and distributes the calculation amongst them.There is no equivalent to the GIL so I should be able to see some gain in performance as the number of processes increases. On the other hand, spawning processes means that there is startup overhead which may offset any performance gain due to the distribution of its execution across multiple processes. However, I should still be able to investigate how performance scales with the number of processes, and whether the multiprocessing module is able to take advantage of multiple cores. In this post I'll discuss four approaches to distributing the Sieve algorithm, basically following the approaches I discussed earlier when using the multi-threading package.The various approaches differ in the way the load is balanced and whether the state of the sieve is shared.

This is the first of two posts on threading and multiprocessing in Python. In this post I'll explore the thread module and in the second post I'll look at Python's multiprocessing module. My starting point is the multi-threaded implementation of the Sieve of Erasthones found in this tutorial on multi-threading in Python (pdf).

Threading a compute-bound algorithm, like the Sieve consists of subdividing of the main task into autonomous sub-tasks which share as little state as possible. Having no shared state eliminates the overhead that inevitably comes with locking. It turns out that Python is not very good at multi-threading compute-bound processes. This is not a surprise. CPython has a global interpretor lock (GIL) which prevents threads from running concurrently.

Regardless, there are other lessons I learned when multi-threading the Sieve algorithm. One is that sharing state between threads may be unavoidable to achieve reasonable performance. In fact, if you don't share state, performance can become predictable worse as the number of threads of execution increases.

The other is that locking can have a surprising impact on performance. It's not just the cost of locking per se, but the effect locking has on the distribution of work between the various threads.

Am I the only one in the world who feels that using python's getopt is a bit of a struggle ? It involves a lot of boiler plate. Tedious refactoring is required each time you add or change an option. This is not specific to Python, as most languages have a similar facility to parse the command line, which is similarly annoying.

I decided to create an easier way to process command line options, by transforming the command line into an immutable (read-only) object. The result is dyn_options.

dyn_options considers every string on the command line which starts with either - or -- (i.e. a single or double dash) an option flag. The value of the option flag is a concatenation of everything that follows it, until the next flag is encountered. A simple option flag is one without explicit values and is considered a boolean flag, set to True. dyn_options creates a read-only object, with attributes and values set to the command line option flags and values respectively.

So, '--opt4 hello world' will be converted to an option flag called opt4, with a value of hello world. This makes dealing with spaces on the command line a lot easier.

Recursive algorithms are elegant. However, if the recursion is not a tail call the growth of the stack leads to a stack overflow.

Tail call recursion is a technique whereby the last call in a recursive function does not depend on the variables pushed on the stack. In other words the function returns the value of its additional (recursive) call.

Functional languages like Haskell or Lisp are designed to support the use of tail recursive algorithms.The JVM -although now the target platform of a lisp like clojure or a hybrid functional language like scala - does not support tail recursion at all. In C/C++ the compiler can in fact replace tail recursive calls with a simple loop, thereby eliminating the allocation for additional stack frames all together. In this post I'll consider various implementations of the humble factorial to illustrate some of these things.

I 'm writing this using a self rolled blog engine called cl-bliky. I'm indebted to an excellent tutorial put together by Vetle Roeim. His goal was obviously to put together a compelling tutorial and he succeeded. My goal was to use lisp in a small programming project, and developing a simple and easily customizable blog engine seemed like a good start.