3/23/2015

Hadoop is a very nice tool that has over the last few years become the workhorse of
the new data science industry that is gathering maturity. It does a lot of things for you, mostly shuffling
your data to the computer where it is needed for computation. To do this it has a transport mechanism taking messages across nodes constituted. The messages are constituted of two parts a routing address, which states where it should end, and a payload which is what you are transporting. This is as old as the world, and it also applies to hadoop.

I will focus in this post on the payload, as the writables are effectively the entry point to make your application
more reliable and maintainable.

The basic writable interface is formed of two very simple methods. A write in which it is the responsibility of the developer to write all the fields that need to travel to a DataOutput object and a readFields in which all fields should be read back into the object state.
The nice thing about the Data input and output interface is that it provides a read and write method for each basic type and Strings, freeing you from most of the hassle of casting and converting the bytes you would read into something usable.

There are however a few idiosyncrasies with it. At first sight the fact that you expect an end of file exception EOFException when the output has been consumed sounds like a good idea. You just put your fields in a while(true) loop and you are a happy bunny.
Until you make part of your state a collection of writables. Then you are in a pickle forever. Why? Very simple the contained type will catch the end of file exception react to it as expected and not forward it upstream since in all strict justice the exception has been consumed.
So now you have to change the behaviour of a writable depending on its location in the encapsulation. Is it the encapsulated or does it encapsulate? If encapsulated it has to forward up the call chain the exception if it encapsulates it just consumes it. This is a text book example of why its a bad idea to rely on exceptions for program flow.

When storing collections in a DataOutoutStream the simplest at most straight forwards way is to pre-pend the size of any collection to the collection elements, like that you do not have to loop forever in the loading of your data.

As per usual the code is over at github, this time I decided to use a maven as a build mechanism. Once cloned a simple mvn test will execute all the unit tests associated with the code.
Hoping you learn from this valuable code which shows clearly how not to do things and you consider doing them properly.

3/07/2015

A past Friday whose date I refuse to remember I hit a wall on my Hadoop experiments. A Reducer was dying out of memory and no matter how many of his peers came to the rescue the error was consistent.

Nooo, I refuse to work on such large data. What do you think I am, a donkey?!? - My reducers were saying.

Obvious things first I went for the two courses of action as most people not understanding a problem in software do.

Here, have more reducers to help you out. -D mapreduce.job.reduces=600

Little did I realise that if your reducer is stateless that helps little.

Here, have more memory -

As a manger would say, if more developers wont cut it I still have to throw more money at the problem so that it goes away. I have always had little empathy with such kinds but really its not their fault if they do not understand the problem and just want it to go away. Their decisions, in most cases, are still rational and full of good will, but they come from a completely different angle.

By time I started throwing memory at the problem with the almighty -D mapreduce.reduce.memory.mb=14336 and -D mapreduce.reduce.java.opts=-Xmx14g I knew that if even that worked, I was only buying time. This was still the smallest of my production datasets so a quick fix would not cut it for the larger ones.

So then I took my week-end away from the computer. I did the usual shopping and other chores, but already on Friday evening the solution was brewing at the back of my mind.

1 Problem description

What the reducer was supposed to do was to merge the collections in the state of a bunch of objects into an object of the same type but will all the data. One of those collections was in the form of a string which had to be concatenated.
Further since this is Hadoop the containing object had to be writable and be capable of flying through the whole infrastructure for subsequent processing. That meant the data had to be encoded to occupy as little space possible and the representation had to be unique, that is the same string had to have, unequivocally, the same representation.

2 Ideal theoretical solution

I started looking at hashes, but I had another constraint in my problem. The string representations have to be comparable and the comparator needs to order the strings in a given way that I want. If I had a non inversable hash function, which most are, at least it had to be monotonic, good luck with that. A collision free hash would have done the trick for the first criteria, but then I needed the second.
Now if there are any mathematicians out there with time and would like to design such a function I will gladly speak with you. Such a function or class of functions would make my life much easier.

In the absence of such a nice thing, which may well exist but I do not know about. I had to apply another strategy.

3 Brute force and Ignorance

The choice for data representation was clear it had to a lose-less compression. Nice, java.util.zip is full of goodies to do just that.

Since memory was not big enough, I went for disk. The idea was to coalesce the state of the objects into files. Concatenating in the file the state and then reading the compressed versions of the files into memory. I am lucky in that the state of the object, specially the string compresses well since it has a lot of repetition.

Now, there may be a need to coalesce coalesced objects. That is my merging of state can be done in stages.
To solve that problem I needed to be careful because most compression mechanisms need a dictionary that is they compress what they have seen not something that may potentially arrive later. I could have potentially used something like a Bloom filter, but if my strings are not preserved verbatim I need to re-think some of the theory of what I do with them and that may take yet another PhD.

4 Rules of engagement

State is stored in objects uncompressed at first. When I need to merge I dump that state to a file, UUIDs make for nice file names.
When I am done with merging I read the contents of the file through a DeflaterInputStream which hands be the compressed representation of the file as bytes. To make things transport ready I BASE64 encode it and store it in my object state, making sure I remember it is compressed.
When I next have to merge the state of objects which are compressed I write back the state to file through a InflaterInutStream which writes to file the uncompressed version of the data. I can then append all the data I want to that file, which will then be slurped back into memory and compressed in the process.

Since I have several types I play with what I described above is what happens to strings. For the longs and ints that I also use I do the same only I write them to a DataOutputStream so that I do not have to handle their string representation with toString(), valueOf() and field delimiters explicitly.

Of course you then have to use the data, and preferably with it not blowing up your memory. For the string a buffered reader way of accessing it will appear in the code base shortly, it will come to play in later parts of the processing.
For the ints and longs a DataInputStream makes them available to you, so you can consume them and as long as you do not hold on to too many of them in memory you are good.

To read the compressed versions you use the read function. To read the usable, non compressed versions, you use the readData function which hands over to you a DataInputStream.

5 Here is one I made earlier

There is no need for you to program all that, you can view, review, inspect, poke use, abuse and extend an implementation of this, it is over at github. It ships with a test class, for my own sanity and so that utilisation is documented. And all is blended with an old-school Makefile, just because.

Enjoy and let me know in the comments if you run into any problems or have a better solution.

8/04/2012

Lately I have been using linear programming, for work.
Linear programs differ much from procedural programming.
They are a way of stating an optimisation problem in a very effective way.

A linear program consist of variables, a target function, and constraints.
Both the target function and the constraints are expressed in terms of variables.
The target function is the expression which you aim to maximise or minimise in line of the constraints you specified.

It is not called linear by chance, both the target function and the constraints have to be linear i.e. of the form
\(a + b * x <=> c\)
where \(a,b\) and \(c\) are constants.
Here we only have one variable \(x\) but you can have as many as your solver can handle.
The variable can be of several types. It can be an integer, a boolean (restriction on integer with only values 0 and 1),
or a real number.
Restricting the variable to a boolean enables to describe optimisation problems which can involve mappings.

Lets say we want to map the elements of two sets \(A\) and \(B\).
That both sets and all the combinations of elements are enough so we need a computer but not enough for all the
combinations to be too many to enumerate in practice.
We also assume that we can calculate the cost of a given mapping.
That is that the cost \(c_11\) of \(a_1\) being mapped to \(b_1\) is computable in practice.
We can then define a target function as the sum of the costs of the active mappings \(target =\sum_i,j m_{ij}*c_{ij}\).
The variable \(m_{ij}\) is 0 if \(a_i\) and \(b_j\) are mapped.
We can then specify as constraints any other rules that we may want.
At one stage the constraints may become constraints about the mappings themselves.
We can very simply imaging that the costs are altered if two elements are mapped.

For the sake of the argument lets say that we want to decrease the cost by a constant \(z\) if \(a_1\) is mapped to \(b_1\) and \(a_2\)
is mapped to \(b_2\).
That is modifying the target function to be \(target =\sum_{i,j} m_{ij}*c_{ij} - m_{11} * m_{22} *z\)
This last statement would all be fine if you could multiply the two variables \(m_{11}\) and \(m_{22}\).
You need for both mappings to exist i.e. be 1 in order for \(z\) to be subtracted from the target function.
The thing with linear programs is that you can not multiply two variables, the problem definition would then cease to be linear.
We would not be able to solve the mapping with the same techniques.

Here we are then truly stuck. We can not perform an and.
We can not multiply the two boolean variables, which is another way of saying the same.
But no, there is a way of doing it.

What we need is a set of constraints that hold emulating the truth table of an and.
That is given three variables \(a\), \(b\) and \(c\).
We want \(c\) to be equal 1 only when both \(a\) and \(b\) are 1 and 0 otherwise.
Bear in mind that the variables can only take one of two values 0 or 1.
And that the value that the variable takes has always to be the maximal if we maximise, or the minimal otherwise.

We will present here the method if we are maximising.

Since all the constraints have to hold, the trick is to first find a constraint that will give us the result we want.
That is when both variables are 1 the result is also worth 1 $a + b <= 1 +c $ is such a constraint.
Pretty but there is a but.
The problem is now that for all other cases of either \(a\), \(b\) or both being 0 the constraint also holds for
\(c\) being 1. Which is not what we want.
OK, not a cool thing, but we can define other constraints which will prevent those cases from occurring.

This can be done with the help of the two following constraints.
\(c < = a\) and \(c < = b\).
This means that when \(a\) or \(b\) are 0 \(c\) has to be 1.
Technically the constrains as stated hold for \(a = 0, c = 0\) but since we are maximising \(c\) is forced to take the value 1,
in such scenario.
(NOTE: To better understand this last point look at the subtle role that the free variables play in the SIMPLEX algorithm.)
If we then combine all the constraints we have:

Constraints mimicking an and operation

$a + b < = 1 +c $

\(c < = a\)

\(c < = b\)

Which hold for the following values of \(a\), \(b\), and \(c\)

\(a = 0, b = 0, c = 0\)

\(a = 1, b = 0, c = 0\)

\(a = 0, b = 1, c = 0\)

\(a = 1, b = 1, c = 1\)

But not for:

$a = 0, b = 0, c = 1$

$a = 1, b = 0, c = 1$

$a = 0, b = 1, c = 1$

$a = 1, b = 1, c = 0$

Which is exactly how the truth table of an and looks like.
So we have found a way to emulate an and, which is the multiplication of two boolean variables in a linear program.

12/19/2011

I have recently encountered the need to iterate through arrays with a dynamic number of dimensions. The usual way of having nested for loops does not work any more since the number of for loops would need to be adapted at run time, Generating code is cool, but to do it at runtime based on a dynamic data structure in a language which is not LISP can be infeasible.
The solution we propose comes in two parts, first the vectorisation of the matrix in order to facilitate access to the elements i.e. make it more consistent. When thinking about this, Fortran all of a sudden made sense.

The second part involves one of the possible implementations of the access to the data structure.
This new representation makes for a very simple way of iterating through arrays of dynamic dimensions.
Lets first provide a bit more context to illustrate all this.
In Octave you can create data structures with any number of dimensions.
When you want to iterate through them you either know the number of dimensions ahead of time or you can use a handy operator ":" that converts to a vector any matrix. This operation is called vectorisation and as such can be found on wikipedia.

This approach is fine but if you have to do something special per row, column or any slice of the array playing with indices can get scary. A way of getting the best of both words is to calculate the index of each array position as a function of an index of each one of the dimensions. When figuring out the mapping, which is really not difficult. It then becomes apparent why Fortran has its column first strategy.
The mapping is indeed very simple, and even simpler if we apply the Fortran convention
With the help of an array $D$ which describes the dimensions of the matrix.
$D$ can be used besides for validating if the element is inside bounds. To compute quickly what is the index of the element in the vectorized form of the matrix.
For the bound check we have that for a given index vector $V$ any of its elements $v_i$ for dimension $i$ has to be strictly bound by the value of $D$ for the same dimension $i$ a.k.a $d_i$ or in other words $0\leq v_i < d_i$.
The position of an element in the vectorialised form of the array is given by the following formula. $p = \sum_i=0^|D| v_i \prod_j=1^i d_j $ as you can see the first dimension is skipped.
The cool thing about this formula is that it still holds for the row first or Cs' way of storing arrays, all you need to do there is swap $v_0$ and $v_1$ as well as $d_0$ and $d_1$.

The other method although more involved programatically yields itself better to more complex manipulations. I like to call it the dynamic hand clock, because it functions just like a clock only the number of hands and the maximum value that they reach are dynamic.
The index of the current element is stored in an array $v$ where $v_i$ stores the current position for dimension $i$. Alongside the vector $v$ we also need a vector $m$ storing the maximum valid value for each dimension. The next index $v$ is then calculated by adding 1 modulo the maximum value for that index. If a rol-over occurs the vector is incremented at position $i+1$.

It becomes then very easy to react on such rol overs at each dimension, making this method very convenient for an iterator or a delegate pattern.

6/22/2011

Since I am now going a phase of re-connecting with my past
I have decided to give this old blog some new posts.
I am back now to write code but mostly in Octave, which when integrated with strong typed languages does pose some problems. Notably how to iterate over a data structure which is completely dynamic, pretty much like the arrays in octave. And to do so without reverting to linked list or other container which may hurt performance and may not be as fun.

As part of preparing to post that solution, I wanted to experiment with AsciiMath ML since I will need it fairly soon.

The only thing preventing me from posting the solution is that I want to still figure one small part of the problem and that it is 1:35, and I have an early flight tomorrow.
But enough rambling In order to describe my solution I will need to use equations of the sort:

4/09/2006

One of the oldest most visited problems in any computer science course the Fibonacci series a very good explanation including sample algorithms can be found in the almighty Wikipedia

This program presents a slight extension in that you can calculate series formed by the sum of the k prior elements. All you have to do is change the size of the array that stores the k prior numbers.

There is just a little trick to work out in which element of the array will the nth element be stored.

;working array of only two el elements;If the array is initialised to k the program will calculate series or order k;That is f(i)=f(i-1)+...+f(i-k) when k=2 its the Fibonacci series(setq arr (make-array 2))

This is one of the typical algorithms that gains alot from being de-recursified. Like this the memory footprint is low, the computation time also, as we only add a modulus operation, and everyone is happy.

4/06/2006

I decided to name this code snippet after the french term for reversing because this algorithm does just that. It iterates through the perimeter of the array, extracts the anti-diagonal "(-1,1) direction" and reverses that list. Then it simply asigns it back to the same array.

Little idea extracted from an exercise of "Programming pearls"

The program reads:

;example of array transposition based on rotations;The method consists of the following: we reverse each of the anti-diagonal arrays;first we have to extract and re assign each of the elements