Disclaimer: I am new to Machine Learning, and claim no expertise on the topic. I am currently reading Machine Learning in Action, and thought it would be a good learning exercise to convert the book’s samples from Python to F#.

The idea behind Decision Trees is similar to the Game of 20 Questions: construct a set of discrete Choices to identify the Class of an item. We will use the following dataset for illustration: imagine that we have 5 cards, each with a major masterpiece of contemporary cinema, classified by genre. Now I hide one - and you can ask 2 questions about the genre of the movie to identify the Thespian luminary in the lead role, in as few questions as possible:

Action

Sci-Fi

Actor

Cliffhanger

Yes

No

Stallone

Rocky

Yes

No

Stallone

Twins

No

No

Schwarzenegger

Terminator

Yes

Yes

Schwarzenegger

Total Recall

Yes

Yes

Schwarzenegger

The questions you would likely ask are:

Is this a Sci-Fi movie? If yes, Arnold is the answer, if no,

Is this an Action movie? if yes, go for Sylvester, otherwise Arnold it is.

That’s a Decision Tree in a nutshell: we traverse a Tree, asking about features, and depending on the answer, we draw a conclusion or recurse deeper into more questions. The goal today is to let the computer build the right tree from the dataset, and use that tree to classify “subjects”.

Defining a Tree

Let’s start with the end - the Tree. A common and convenient way to model Trees in F# is to use a discriminated union like this:

typeTree=|Conclusionofstring|Choiceofstring*(string*Tree)[]

A Tree is composed of either a Conclusion, described by a string, or a Choice, which is described by a string, and an Array of multiple options, each described by a string and its own Tree, “tupled”.

For instance, we can manually create a tree for our example like this:

Our tree starts with a Choice, labeled “Sci-Fi”, with 2 options in an Array, “No” or “Yes”. “Yes” leads to a Conclusion (a Leaf node), Arnold, while “No” opens another Choice, “Action”, with 2 Conclusions.

So how can we use this to Classify a “Subject”? We need to traverse down the Tree, check what branch corresponds to the Subject for the current Choice, and continue until we reach a Decision node, at what point we can return the contents of the Conclusion. To that effect, we’ll represent a “Subject” (the thing we are trying to classify) as an collection of Tuples, each Tuple being a key/value pair, representing a Feature and its value:

classify is a recursive function: given a subject and a tree, if the Tree is a Conclusion, we are done, otherwise, we retrieve the label of the next Choice, find the value of the Subject for that Choice, and use it to pick the next level of the Tree.
At that point, using the Tree to classify our subject is as simple as:

>letactor=classifytestmanualTree;;valactor:string="Schwarzenegger"

Not bad for 14 lines of code. The most painful part is the manual construction of the Tree - let’s see if we can get the computer to build that for us.

So what…s a good question?

So how could we go about constructing a “good” tree based on the data, that is, a good sequence of questions?

The first consideration here is how informative the answer to a question is. An answer is very informative if it helps us separate the dataset into very differentiated groups, groups which are different from each other and internally homogeneous. A question which helps us break the dataset into 2 groups, the first containing only “Schwarzenegger” movies, the other only “Stallone” movies, would be perfect. A question which splits the dataset into groups with a half/half mix of movies by each actor would be totally useless.

To quantify the “internally homogeneous” notion, we will use Shannon Entropy, which is defined as

Shannon Entropy formula, courtesy of Wikipedia

I won’t go into a full discussion on Entropy today (if you want to know more, you may find this older post of interest) - the short version is, it provides a measure for how “disorganized” a set is. The lower the Entropy, the more predictable the content: a set where all results are identical would have a 0 entropy, while a set where all results are equally probably will result in an entropy value that is maximal.

The second consideration is, how much information do we gain by asking one question versus the other. If we look back at our dataset, there are 2 questions we could ask first: Sci-Fi or Action. If we asked “is it an Action movie” as an opener, there is a 20% chance that we are done and can conclude “Schwarzenegger”, but in 80% of the cases, we are left with a 50/50 chance of guessing the right actor. By contrast, if we ask “is it a Sci-Fi movie” first, there is a 40% chance that we are done, with a perfect Entropy group of Schwarzenegger movies, and only a 60% chance of a not-so-useful answer.

We can quantify this - how informative is each question - by considering the expected gain in Entropy: for a question, consider the conclusion of each possible answer and compute its Entropy, and average out the Entropy of receiving each answer, weighted by the probability of obtaining that answer.

The final consideration is whether the question is informative at all, which we can translate as “is our Entropy better after we ask the question”? For instance, if we know it’s a Sci-Fi movie, asking whether it’s an Action movie brings us nothing at all: the expected Entropy after the question isn’t better than our initial Entropy. If a Question doesn’t result in a higher expected Entropy, we shouldn’t even bother to ask the Question.

Another way to consider what we just discussed is from a Decision Theory standpoint. In that context, information is considered valuable if having the information available would result in making a different decision. If for any answer to the question, the decision remains identical to what it would have been without it, the information is worthless - and if offered the choice between multiple pieces of information, one should pick the most likely to produce a better decision.

Automatic Creation of the Tree using Entropy

Let’s put these concepts into action, and figure out how to construct a good decision tree from the dataset. The outline of the algorithm is the following:

Given a dataset, consider each Feature/Question available, split the dataset according to that Feature, and compute the Entropy that would be gained by splitting on that Feature,

Pick the Feature that results in the highest Entropy gain, and split the dataset accordingly into subsets corresponding to each possible “answer”,

For each of the subsets, repeat the procedure, until no information is gained,

When no Entropy is gained by asking further questions, return the most frequent Conclusion in the subset.

On our example, this would result in the following sequence:

Sci-Fi has a better Entropy gain than Action: break into 2 subsets, Sci-Fi and non-Sci-Fi movies,

For the Sci-Fi group, do we gain Entropy by asking about Action? No , stop,

For the non-Sci-Fi group, asking about Action still produces information; after that, stop - there is no question left.

Rather than build up step-by-step, I’ll dump the whole tree-building code here and comment afterwards:

prop is a simple utility function which converts the count of items from a total in a float proportion.

inspect is a utility function which extracts useful information from a dataset. I noticed I was doing variations of the same operations everywhere, so I ended up extracting it in a function. The dataset is modeled as a Tuple, where the first element is an array of headers, representing the names of each Feature, and the second is an Array of Arrays, representing a list of observations. The feature we are classifying on is expected to be in the last column of the dataset.

h computes the Shannon Entropy of a vector (an Array of values). entropy extracts the vector we are classifying on, and computes its Shannon Entropy.

remove is a simple utility function, which removes the ith component of a vector. It is used in split, a function which takes a dataset and splits it according to feature i. split returns a Tuple, where the first element is the updated header (with the split-feature removed), and a Sequence containing each value of the feature, with the corresponding reduced dataset. For instance, splitting the movies dataset on feature 1 (“Sci-Fi”) produces the following:

Sci-Fi is now gone from the Headers, and 2 groups have been produced, corresponding to Non-Sci-Fi and Sci-Fi movies, with the corresponding reduced dataset.

splitEntropy computes the expected Entropy that would result from splitting on feature i. It splits the dataset on the value in column i (the feature we would split on), retrieves the value of the last column (the feature we are classifying on) and computes the sum of the entropies, weighted proportionally to the size of each sub-group (the probability to get that answer). selectSplit builds on it, and computes the entropy gain achieved by splitting on each of the available features (if there is any feature left), and picks the feature with maximum gain, if it is higher than the current situation.

majority is a simple utility function, which returns the most-frequent value in a dataset.

Finally, build puts all of this together and recursively builds the Decision Tree. If there is no feature to split on (no information gain left), it produces a Conclusion, containing the majority of elements in the current dataset. Otherwise, the dataset is split on the best feature, and a Choice node is created, which is populated with a Tree corresponding to the action taken for each possible outcome of the Feature we are splitting on.

That was a lot of code to go through (actually, under 100 lines, not too bad) - let’s get our reward, and try it out:

Real data

The book demonstrates the algorithm on the lenses dataset, which focuses on what contact lenses to prescribe to patients in various conditions, and can be found at the wonderful collection of test datasets maintained at the UC Irvine Machine Learning Repository. I tried it out, and got the same tree as the author. I also tested it on the Nursery dataset (I am still not 100% clear on the details of that dataset, but I have to admit I cracked up when I read the description for attribute 1: “parents: usual, pretentious, great_pret”…), which is significantly larger (12,960 records); the algorithm went through it like a champ, and produced an ungodly tree which I am not even going to try to reproduce here.

I haven’t found a good way to plot the resulting Decision Trees yet, so I’ll leave it at that - running the algorithm on these datasets is pretty straightforward.

Conclusion

The code conversion for this chapter was interesting. The part where F# really shines is the Tree representation as a Discriminated Union, which combined with pattern-matching works wonders in manipulating Trees, and seems to me cleaner than the equivalent Python code using nested dictionaries.

I spent quite a bit of time second-guessing myself on how to organize the dataset itself - one array, labels + data, headers, data and “feature of interest”, 2-d array or array of arrays… There is a lot of unpleasant grouping and array manipulations going on in the splitting / feature selection part, and I have a nagging feeling that there has to be a representation that allows for clearer transformations - and clearer return types. However, the code remains reasonably readable, and Seq.groupBy does in one line all the unique keys identification and counting that is more involved in the Python code.

The piece I have completely left out from Chapter 3 is persisting and plotting Decision Trees. I may go back to the plotting part at some later time, but my first impression is that this will be no piece of cake, and I…d rather focus on the algorithms at that point in time. If you know of a good and free F# tree-plotting library, let me know!

Finally, I decided to put this code on GitHub. It…s my first repository there, so please be patient and let me know if there are things I can do to make that repository better!