Using the ALGLIB random forest with F#

25 Sep 2016

The intent of this post is primarily practical. During the Kaggle Home Depot competition, we ended up using the Random Forest implementation of ALGLIB, which worked quite well for us. Taylor Wood did all the work figuring out how to use it, and I wanted to document some of its aspects, as a reminder to myself, and to provide a starting point for others who might need a Random Forest from F#.

The other reason I wanted to do this is, I have been quite interested lately in the idea of developing a DSL to specify a machine learning model, which could be fed to various algorithms implementation via simple adapters. In that context, I thought taking a look at ALGLIB and how they approached data modelling could be useful.

I won’t discuss the Random Forest algorithm itself; my goal here will be to “just use it”. In order to do this, I will be using the Titanic dataset from the Kaggle “Learning From Disaster” competition. I like that dataset because it’s not too big, but it hits many interesting problems: missing data, features of different types, … I will be using it two ways, for classification (as is usually the case), but also for regression.

Let’s dive in the ALGLIB random forest. The library is available as a nuget package, alglibnet2. To use it, simply reference the assembly #r @"alglibnet2/lib/alglibnet2.dll"; you can then immediately train a random forest, using the alglib.dfbuildrandomdecisionforest method - no need to open any namespace. The training method comes in 2 flavors, alglib.dfbuildrandomdecisionforest and alglib.dfbuildrandomdecisionforestx1. The first one is a specialization of the second one, which takes an additional argument; therefore, I’ll work on the second, most general version.

Signature of dfbuildrandomdecisionforestx1

The signature of dfbuildrandomdecisionforestx1 is the following:

letinfo,forest,report=alglib.dfbuildrandomdecisionforestx1(trainingset,// training datasamplesize,// how many observationsfeatures,// how many features/variablesclasses,// how many classes; 1 represents regressiontrees,// how many trees to build; recommended: 50 to 100featuresincluded,// how many features retained when splittinglearningproportion// how much of the sample to use for each tree. Recommended: 0.05 (high noise) to 0.66 (low noise))

We’ll illustrate shortly how the inputs should be prepared. The function produces 3 outputs:

info: an integer return code. 1 signals success, -2 or -1 are supposed to signal issues (more on that in a second).

forest: a random forest (alglib.decisionforest), which can be used to produce predictions.

report: an alglib.dfreport that contains various quality metrics.

The Titanic dataset

The dataset (which you can download from here) we will use comes as a CSV file, “titanic.csv”, which contains the following columns:

trees represents how many trees we want in the forest, that is, how deep / long we want to train. The documentation recommends 50 to 100. From experience, higher is possible, but an OutOfMemoryException is also a possibility :)

featuresincluded: this is the extra parameter from the other function available. It drives how many of the available features should be randomly selected at each split. This is done automatically in the other case.

learningproportion: this is a tuning parameter, the documentation recommends values between 0.05 (for very noisy datasets) and 0.66 (for clean datasets). This determines how much of the training set is used for each tree, and lower values should help prevent over-fitting.

Running the regression

Running the regression produces… well, on my machine, with the current setup, the computation never returns. If I change the learningproportion to 0.1 instead, I get this:

alglib+alglibexception:Exceptionoftype'alglib+alglibexception'wasthrown.atalglib.dforest.dfsplitr(Double[]&x,Double[]&y,Int32n,Int32flags,Int32&info,Double&threshold,Double&e,Double[]&sortrbuf,Double[]&sortrbuf2)// more stack trace from hell.

So much for using error codes. My experience with the library has been that if there is something wrong with the input, it will either explode or never return. Perhaps I am doing something wrong?

The 2 issues you may hit when preparing the data are:

invalid indexing: for instance, setting samplesize to a value larger than the sample size will result in System.IndexOutOfRangeException: Index was outside the bounds of the array..

missing data / nan: this is the problem we are hitting here. The training set is expected to be a float [,], but if it contains nan values, for either input or output, you’ll run into problems.

Side-note: is there a more elegant way to check if a float is a “normal number”?

This eliminates every row that contains one or more invalid input, and alglib.dfbuildrandomdecisionforestx1 runs like a champ now. The info flag is 1, signaling success. The report results are as follows:

You get the expected metrics in the report (average error, root mean square error, …), in two flavors. The values prefixed with oob indicate out-of-bag, and I suspect the other ones are on data that has been used for training (that is, the complement of out-of-bag). I am not 100% sure about this one. In general, out-of-bag is the better indicator for what performance you should expect from your model when using it on new data points.

Generating predictions

You can now use the forest to generate predictions, by calling alglib.dfprocess. dfprocess is expecting a forest, and a vector of input values, and will compute the output value. The output value is expected by reference, and is not a float, but a float[] (more on this when we discuss classification later). In our case, our model has 3 features / variables, so we should pass in a float[] of size 3.

Interestingly, while training doesn’t like missing values, it seems dfprocess deals with it quite well:

predict[|Double.NaN;Double.NaN;Double.NaN|]>valit:float=183.445

I tried out a couple of variants (predict [| 30.0; 0.0; Double.NaN |], predict [| Double.NaN; 0.0; 1.0 |]), and in each case got a different prediction. I assume ALGLIB is picking up the most likely value when the input is missing, but I don’t know for sure what the algorithm is doing there.

Categorical and Ordinal input

So far, we have used only input values that were numerical. However, one of the nice properties of random forests is that they are quite flexible, and can handle virtually any type of input.

Let’s try to incorporate sex, and the port of embarkation - Southampton, Cherbourg or Queenstown. ALGLIB has a very good description of how they encode variables. Categorical (or, in their parlance, Nominal) variables are encoded either as:

0 or 1 for variables with 2 states,

“1-of-N” for variables with 3 states or more.

So incorporating sex would simply entail adding a column with 0.0 or 1.0 values for either case, and encoding the port of embarkation would use a 3-state vector, [1.0;0.0;0.0] for Southampton, [0.0;1.0;0.0] for Cherbourg, and [0.0;0.0;1.0] for Queenstown.

This ignores the possibility of missing data, however. We can take 3 strategies here (as well as for numerical values):

we do not think missing data conveys useful information, and filter it out as we did,

we think missing value conveys useful information, in what case we can simply add another state. For instance, port of embarkation would take 4 states, the 4th one being “unknown port of embarkation”, represented as [0.0;0.0;0.0;1.0],

we can attempt to replace missing values by “reasonable ones”. In general I tend to dislike making up data, but at the same time, in the case of a dataset where all rows contain mostly good data with some missing, we would end up discarding a lot of rows, which can be a problem.

We could for instance model our data like this, without making any attempt at elegance:

The out-of-bag RMSE dropped from 57.16 to 56.29. Looks like these features are not very helpful…

One last thing worth considering is ordinal values. A good example on this dataset is Class. Class is not quite a numerical value (how far apart they are is meaningless), but the order matters: first class is (in some sense) greater than second, which itself is greater than third.

Both encodings - as a Categorical, or as a Numerical - are valid. One possible benefit of representing Class as Numerical is that it can implicitly create “groupings”. Because 1 < 2 < 3, it would make sense to lump together “1 and 2” vs. “3”, or “1” vs. “2 and 3”, which is how continuous values are handled in a tree, dividing them by segments.

Classification

Let’s try now to use the random forest as a classifier. The only differences here are with the last column in the training set, which will now contain the “index” of the class, and the form of the output.

Let’s begin with a classic exercise, and predict who survives on the Titanic.

We simply encode survival as 1.0 or 0.0; all we need to do then is change classes to 2 (we have 2 cases) and run the model:

letsamplesize=trainingset.GetUpperBound(0)letfeatures=8letclasses=2// classificationlettrees=10letfeaturesincluded=4letlearningproportion=0.5letinfo,forest,report=alglib.dfbuildrandomdecisionforestx1(trainingset,// training datasamplesize,// how many observationsfeatures,// how many features/variablesclasses,// how many classes; 1 represents regressiontrees,// how many trees to build; recommended: 50 to 100featuresincluded,// how many features retained when splittinglearningproportion// how much of the sample to use for each tree. Recommended: 0.05 (high noise) to 0.66 (low noise))

Instead of a bare-bones class prediction, we get a full probability distribution on the possible outcomes: 70% chances of not making it, and 30% of surviving. This is quite nice (for us, not for that hypothetical passenger, obviously).

Similarly, we could try, say, to predict the port of embarkation. In this case, we have 3 classes (Southampton, Cherbourg or Queenstown). Without any attempt at elegance, let’s encode this, creating values 0, 1 and 2 for each case, and changing classes to 3:

That person most likely embarked in Cherbourg, with 60% chance, or in Southampton, with 40% chance.

Note: if the classes do not match the number of cases in the last column, alglib.dfbuildrandomdecisionforestx1 will return a flag of -2.

Parting thoughts

In my opinion, in spite of some quirks, the ALGLIB random forest is quite nice, and potentially very useful. What I like about it is that it is a full-fledged random forest; this is an extremely versatile algorithm, which, in my experience, “always works”. What I mean by that is, other algorithms will potentially give you better results - but a random forest is fast, easy to set up, and will produce decent predictions, and handle with minimal effort both regression and classification problems, incorporating data in all shapes and forms.

The quirky parts are around the API. I would have expected the function alglib.dfbuildrandomdecisionforestx1 to always return, indicating with return codes if something went wrong. This is obviously not the case; I might be misunderstanding some aspects, and would love to hear from you if you know something about this.

The way alglib.dfprocess uses byref to produce outputs is a bit unsettling, and some of the choices around the alglib.dfbuildrandomdecisionforestx1 function signature are a bit odd to me. Why do I need to pass the size of the training set, when it can be computed from the data we are passing in? Similarly, why do I need to specify how many variables are used? The documentation hints at the possibility of having more than one column for regression outputs, but I had no success with that.

Still - these are details. I’ll take the quirks, for a library that does what I want, and there are things I like about the modelling choices. Getting a full distribution on the possible classification outputs instead of a single prediction is nice; even though in both cases the most likely output is the same, it is quite different to know that the model thinks a particular outcome has a 99.9% chances of happening, vs. only 50.1%.

That’s it - hope you got something out of this guided tour of the ALGLIB random forest.