Thursday, November 3, 2011

Where I regress with my Clojure thanks to Andrew

Hello again,
this time for the "big" entry.

One promise I took was seriously reading the SICP book. I started reading it on vacation in parallel to other exciting activities activities like watching Cesarini and Thompson videos on Erlang, and starting learning a Haskell... for my great good (aficionados will understand).
Joining the reading of Michealson book with these other activities shook me in a positive way as all this functional programming stuff started to rearrange nicely in a flow of coherent concepts. Road is still long, but at least things are getting clearer.
Having read the first part of SICP (practicing the exercises in Scheme and Clojure) helped me in changing my approach of playing with higher order function, recursion, meanwhile re-learning nice mathematical stuff.
Of course I am far from mastering functional programming but it becomes more natural facing some problems to start writing something in Clojure or Scala (and soon in Erlang and Haskell I hope). This functional programming sympathy showed itself when I started watching Andrew Ng lectures on machine learning Stanford lecture CS 229), lectures kindly pointed out to me by Debashish Gosh (thank you Sir).

In addition to the Stanford video resources, Andrew Ng, also presents short lectures nicely supporting and detailing the concepts exposed during the classroom sessions. I did not have much time yet to set an Octave installation, but quickly felt a lot of frustration while watching some examples.
As a matter of fact I had a hunch, I could code adopting a functional language because of their mathematical expression so suitable in functional programming languages. I found the least square method applied to linear regression modelling demonstration so appealing that I did not resist in trying something.
After a first shot I was not satisfied with, I read the first part of SICP. As some concepts became clearer I took the problem again and rewrote it, faster and better. Naturally I am far from perfection, but at least, I can understand the code each time I come back to it. So there have been improvements :)

Here we go. I strongly recommend you make your self familiar with the least square method before starting reading the explanations in the open classroom.
Roughly speaking we are going to be provided a set of so-called training samples, we will use to fit a prediction model.
Typically a data set can be the measurements of heights of children versus their age, or prices of houses versus their areas etc... We will try to model a behavior using these training data sets in order to predict new values.
We will label as Y the data to be predicted and X the data provided as input in order to support our prediction attempt. More precisely, there can be multiple data inputs X for a given Y. Some given input data can be then detailed as X1, X2 etc.. depending on the number of inputs for given Y value. X is then a vector.
This aspect leads to very elegant matrices calculus we are not concerned with today.

We can come up with a starting hypothesis for our prediction model represented by the formula

hθ(X)

where in a linear regression approach, the expression takes the form:

hθ(X) = θ0X0 + θ1X1 +X2θ2 etc...

In vector representation, θ is also assumed a vector with components θk. Traditionally X0 is added as an additional virtual input and set to 1.
Model fitting can be achieved starting up with a cost function expression that will be minimized versus our model parameters. In essence the values predicted by our model must be as close as possible to the real sample values.
A way of formulating this difference between model and reality can be expressed as:

J(θ) =(1/2n) ∑in(hθ(Xi) - Yi)2

where n is the total number of training samples
The variable part in the formula being the θk parameters, we are going to look for converging values of these parameters. Grossly the algorithm can be summarized as:

Start with a given set of θk values, like 0 for example
keep changing the values until we have convergence

At each step of the algorithm will changed the hypothetical θk values applying the following formula:

θk: = θk - α∂θJ(θ)

Where α is called the learning rate. Naturally its value influences the speed of convergence of the process.
Regarding a linear regression approach , the last expression can be simplified into:

θk : = θk - (α/n)∑in(hθ(Xi) - Yi)Xki

I tried to keep things simple, sorry for the invasive formulas.

I tempted a rewriting of my first shot solution using both the wishful thinking advocated by Abelson and Sussman, and testing. This is to not-open the debate of understanding what you are doing versus TDD. Both served my purpose, no discussion about that.

Let's get driven by wishful thinking, at least once. My entry point function tries to replicate the generic algorithm :

In order to mimic Andrew Ng approach while dealing with the exercises, find-parameters functions take as input

The datas

the number of iterations

the learning rate

Taking the number of iterations as input greatly helps in writing the iterative process as a recursion call (Check the sicp part one for type of processes generated by procedures).
The local function improve, recures on itself in order to refine the input parameters until reaching the number of iterations.On each loop, we create a new hypothesis expression using the more recently processed parameters. So the definition of the hypothesis is abstracted in an external function.

The θ's parameters get created in the create-thetas method. By convention we always start with all parameters values set to 0.We use the number of input data in a sample in order to set the parameters.The creation functions for data samples and parameter generation where implemented using TDD. We expect the input data sets to be created using the data sample creation functions: create-sample. The driving tests:

The astute reader will notice we have prep-ended the imaginary sample value 1 (aka X0) to the input data as explained before, in the create-sample function.
Y and Xi data sets being taken from two different files, the create-samples function has been introduced so to zip data incoming from both files.

A hash map represents each sample, indexing the input values with the :input symbol, and the output with the :output symbol.
A hash map structure also represents the θ parameters, referencing in each parameter the value with a :value symbol (hurray !) and the parameter index with the :at symbol.

Good, we have data structures. Remains, the foundation of our model, the hypothesis, hθ(X). I also called bias the difference between the model predicted values and the real model value (hθ(Xi) - Yi) as it is also a recurring structure. Both the hypothesis and the bias are mathematical functions. Using a predefined (simple) set of test values I came with :

being higher order functions as returning functions taking respectively as input , the thetas values for the hypothesis and a data sample for the bias. From my point of view this is where functional programming nicely fit the algorithm modeling, abstracting the mathematical functions as basic bricks in our construction.

Going back to the find-parameters function, we now need to implement the new-parameters-from function, in charge of processing the next parameters values, extracted for a previous set of parameter values.
This time I must confess I fully decomposed the problem in wishful thinking splitting from top to down the problem in smaller pieces, testing only two of the methods, when there was doubt.
The obtained implementation

the new-parameters-from function simply maps onto the list of parameters in order to set up the next list of parameters,

the corrected-value function really does the job of altering an input parameter value.

The correction-for function then processes the correction for a given parameter without the scaling learning rate, and this using all the data samples (logical, thinking about that), delegating the task of processing the correction on a per data sample base to the data-correction function...and that's all.

This settles a personal story I had with least square method while I was a physicist. These times are buried, the least square problem is tackled too and that's nice.

For the people who were not bored, I will briefly present the content of the sample_two.clj file I used to challenge the exercise three, as it is necessary to apply a little scaling processing onto the input datas in order to proceed with a set of two inputs per sample. The solution works fine though.

2
comments:

This is great! I'll have to read your code closely. I'm also in the ML class, BTW. I have been using DrRacket rather than Clojure. It's fun and often more compact to look for a functional solution. The folks who built DrRacket strongly advocate writing test cases before one writes the code in a function. So most of my Racket files take the form ml.rkt and ml-test.rkt; I put my test cases in the latter and begin that file with (require rackunit), a unit testing module.