Sunday, November 6, 2011

Today we are going to work on the least square exercise number 3 from Andrew Ng Machine learning classroom.Last time, we challenged our least square method (applied to linear regression), using only single input data. The input data was composed of the ages of young pupils which height was sampled.
Today, we try to find the helpful parameter values required into the prediction of house pricing given for each house, the living area and the number of bedrooms. As exposed by Professor Ng during its lectures, the values of the areas being a 1000 times greater than the number of room some scaling of the input data must be applied.

Being formally a physicist in my youth, I would call this step a normalization.
We will scale both types of inputs by their standard deviations and set their means to zero. As a reminder, the mean of a x-data set can be expressed as:

μ =(1/n) ∑in Xi

where n is the total number of data, and Xi each data value.
The standard deviation is expressed by

σ = √(1/n) ∑in( Xi - μ)2

(Yes the little symbol behind is a square root :))
This time we must go a little bit further than using the function raw-data as in the previous exercise.
We will reuse our file-reader module from last session:

You will notice I am working in a new namespace, referencing both the file-reader and least-square namespaces from the previous blog entry. Having browsed to Andrew Ng open classroom lecture you might recognize the file names bound to the from-x-sample and from-y-sample variables.
Before going any further I know I will have to process:

mean of a data set

standard deviation of a data set

normalization of data

These tiny steps are suitable for test driven development and the Wikipedia reminding link provides us with a useful sample for tests. I propose you the following set of test functions:

This is where we can see that anonymous functions (prefixed by the '#' symbol) should not be long as the '#' and '%' symbols may quickly clutter the readability of the code. The formulas remaining simple in this situation, the anonymous functions purposes remain clear.
Our tool methods are ready.
But one minute. The raw-data function from my file-reader module helps me to produce a list of vectors for both the x-data and y-data sets:

In order to normalize the x-data, I will have collect the columns of data referring the living area values, and the column of values referring the number of rooms values. This step is easy meanwhile gathering again on the same list of vectors from the two individual sets will be a little trickier.
I am sure some kind soul will point out to me the useful method that allow to zip vectors in Clojure. After one week of reading and practicing the recursion part of my brain while reading the first part of SICP, I immediately wrote a (tail) recursive solution for the problem.
One of the beautiful aspects of Lisp languages being the easiness of developing one's adapted solution here and now for a given problem.
What we expect is:

There, we find a local scoped function calling itself recursively. The final result is accumulated in a result variable while we iterate, working on a more and more reduced tail of the vectors being processed. A typical recursive procedure generating an iterative process (check first part of SICP).
The complete file content is :

At the bottom, the load-datas function, takes in charge the reading, normalization and construction of the x input data, while the sample-two function will allow us to fire the regression using Andrew parameters: the number of iterations steps, and the learning rate.

...which is close to Andrew Ng result with a precision of 1 to 2%. This bias from the Octave/Matlab result should encourage us to work on the error amplitude estimation while working on numerical data.
In a near future I hope.

Thank you very much to the people who have kindly followed these last three articles, and thank you for the feedback.

I have to pack my things in order to come back to the muddy and smelly suburb of Paris, but now that SICP reading has been seriously started, Haskell and Erlang discovery are in progress, I won't be disturbed by that environment. As a matter of fact, I won't have to bear this situation very long as I am looking for new positions in Europe :)

Thursday, November 3, 2011

One promise I took was seriously reading the SICP book. I started reading it on vacation in parallel to other exciting activities activities like watching Cesarini and Thompson videos on Erlang, and starting learning a Haskell... for my great good (aficionados will understand).
Joining the reading of Michealson book with these other activities shook me in a positive way as all this functional programming stuff started to rearrange nicely in a flow of coherent concepts. Road is still long, but at least things are getting clearer.
Having read the first part of SICP (practicing the exercises in Scheme and Clojure) helped me in changing my approach of playing with higher order function, recursion, meanwhile re-learning nice mathematical stuff.
Of course I am far from mastering functional programming but it becomes more natural facing some problems to start writing something in Clojure or Scala (and soon in Erlang and Haskell I hope). This functional programming sympathy showed itself when I started watching Andrew Ng lectures on machine learning Stanford lecture CS 229), lectures kindly pointed out to me by Debashish Gosh (thank you Sir).

In addition to the Stanford video resources, Andrew Ng, also presents short lectures nicely supporting and detailing the concepts exposed during the classroom sessions. I did not have much time yet to set an Octave installation, but quickly felt a lot of frustration while watching some examples.
As a matter of fact I had a hunch, I could code adopting a functional language because of their mathematical expression so suitable in functional programming languages. I found the least square method applied to linear regression modelling demonstration so appealing that I did not resist in trying something.
After a first shot I was not satisfied with, I read the first part of SICP. As some concepts became clearer I took the problem again and rewrote it, faster and better. Naturally I am far from perfection, but at least, I can understand the code each time I come back to it. So there have been improvements :)

Here we go. I strongly recommend you make your self familiar with the least square method before starting reading the explanations in the open classroom.
Roughly speaking we are going to be provided a set of so-called training samples, we will use to fit a prediction model.
Typically a data set can be the measurements of heights of children versus their age, or prices of houses versus their areas etc... We will try to model a behavior using these training data sets in order to predict new values.
We will label as Y the data to be predicted and X the data provided as input in order to support our prediction attempt. More precisely, there can be multiple data inputs X for a given Y. Some given input data can be then detailed as X1, X2 etc.. depending on the number of inputs for given Y value. X is then a vector.
This aspect leads to very elegant matrices calculus we are not concerned with today.

We can come up with a starting hypothesis for our prediction model represented by the formula

hθ(X)

where in a linear regression approach, the expression takes the form:

hθ(X) = θ0X0 + θ1X1 +X2θ2 etc...

In vector representation, θ is also assumed a vector with components θk. Traditionally X0 is added as an additional virtual input and set to 1.
Model fitting can be achieved starting up with a cost function expression that will be minimized versus our model parameters. In essence the values predicted by our model must be as close as possible to the real sample values.
A way of formulating this difference between model and reality can be expressed as:

J(θ) =(1/2n) ∑in(hθ(Xi) - Yi)2

where n is the total number of training samples
The variable part in the formula being the θk parameters, we are going to look for converging values of these parameters. Grossly the algorithm can be summarized as:

Start with a given set of θk values, like 0 for example
keep changing the values until we have convergence

At each step of the algorithm will changed the hypothetical θk values applying the following formula:

θk: = θk - α∂θJ(θ)

Where α is called the learning rate. Naturally its value influences the speed of convergence of the process.
Regarding a linear regression approach , the last expression can be simplified into:

θk : = θk - (α/n)∑in(hθ(Xi) - Yi)Xki

I tried to keep things simple, sorry for the invasive formulas.

I tempted a rewriting of my first shot solution using both the wishful thinking advocated by Abelson and Sussman, and testing. This is to not-open the debate of understanding what you are doing versus TDD. Both served my purpose, no discussion about that.

Let's get driven by wishful thinking, at least once. My entry point function tries to replicate the generic algorithm :

In order to mimic Andrew Ng approach while dealing with the exercises, find-parameters functions take as input

The datas

the number of iterations

the learning rate

Taking the number of iterations as input greatly helps in writing the iterative process as a recursion call (Check the sicp part one for type of processes generated by procedures).
The local function improve, recures on itself in order to refine the input parameters until reaching the number of iterations.On each loop, we create a new hypothesis expression using the more recently processed parameters. So the definition of the hypothesis is abstracted in an external function.

The θ's parameters get created in the create-thetas method. By convention we always start with all parameters values set to 0.We use the number of input data in a sample in order to set the parameters.The creation functions for data samples and parameter generation where implemented using TDD. We expect the input data sets to be created using the data sample creation functions: create-sample. The driving tests:

The astute reader will notice we have prep-ended the imaginary sample value 1 (aka X0) to the input data as explained before, in the create-sample function.
Y and Xi data sets being taken from two different files, the create-samples function has been introduced so to zip data incoming from both files.

A hash map represents each sample, indexing the input values with the :input symbol, and the output with the :output symbol.
A hash map structure also represents the θ parameters, referencing in each parameter the value with a :value symbol (hurray !) and the parameter index with the :at symbol.

Good, we have data structures. Remains, the foundation of our model, the hypothesis, hθ(X). I also called bias the difference between the model predicted values and the real model value (hθ(Xi) - Yi) as it is also a recurring structure. Both the hypothesis and the bias are mathematical functions. Using a predefined (simple) set of test values I came with :

being higher order functions as returning functions taking respectively as input , the thetas values for the hypothesis and a data sample for the bias. From my point of view this is where functional programming nicely fit the algorithm modeling, abstracting the mathematical functions as basic bricks in our construction.

Going back to the find-parameters function, we now need to implement the new-parameters-from function, in charge of processing the next parameters values, extracted for a previous set of parameter values.
This time I must confess I fully decomposed the problem in wishful thinking splitting from top to down the problem in smaller pieces, testing only two of the methods, when there was doubt.
The obtained implementation

the new-parameters-from function simply maps onto the list of parameters in order to set up the next list of parameters,

the corrected-value function really does the job of altering an input parameter value.

The correction-for function then processes the correction for a given parameter without the scaling learning rate, and this using all the data samples (logical, thinking about that), delegating the task of processing the correction on a per data sample base to the data-correction function...and that's all.

This settles a personal story I had with least square method while I was a physicist. These times are buried, the least square problem is tackled too and that's nice.

For the people who were not bored, I will briefly present the content of the sample_two.clj file I used to challenge the exercise three, as it is necessary to apply a little scaling processing onto the input datas in order to proceed with a set of two inputs per sample. The solution works fine though.

Wednesday, November 2, 2011

Once again a very (very) short one in... Clojure this time. As a matter of fact this is a short one before the longer one up to come :).

In order to challenge the code displayed in the upcoming post, I had again this problem of reading data from a file before converting it. Thanks to an astute reader in a previous post, I dropped the use of the clojure contrib 1.2 while using the 1.3 version of Clojure.As I did not took the time to explore the new Clojure contributions I found my self a little bit upset while needing to load data from a file. The older Clojure contributions library hosted a suitable read-lines method in the clojure.contrib.io module. And I liked it.

Fortunately LISP languages always invite you in a nice way to develop your tools and build upon them. Of course I could not resist. At that point the do-not-reinvent-the-wheel conservationist's will have left the building.Thank you to the others for staying :) The following harsh test renders what I needed:

The read-lines takes as input the exact file location expression and returns a sequence (lazy if possible).
Having explored lazy sequences in a previous post , I came to the working following solution:

One will recognize the invocations of the BufferedReader, FileReader, and File java constructors recognizable by the appended "." at the end of the class name. The imports have been achieved through the use of the :import form at the top of the file.

The function body hosts a recursive inner function flush-content, in charge of

taking the decision to stop the recursion when no more line has been found

recur again using the readLine method from the BufferedReader.

each new found line becomes an element of the lazy sequence using the standard cons form.
The lazy sequence end clause is reached when no more line has been found. Works nicely for small files.