Discretizing a continuous variable using Entropy

26 May 2013

I got interested in the following question lately: given a data set of examples with some continuous-valued features and discrete classes, what’s a good way to reduce the continuous features into a set of discrete values?

What makes this question interesting? One very specific reason is that some machine learning algorithms, like Decision Trees, require discrete features. As a result, potentially informative data has to be discarded. For example, consider the Titanic dataset: we know the age of passengers of the Titanic, or how much they paid for their ticket. To use these features, we would need to reduce them to a set of states, like “Old/Young” or “Cheap/Medium/Expensive” – but how can we determine what states are appropriate, and what values separate them?

More generally, it’s easier to reason about a handful of cases than a continuous variable – and it’s also more convenient computationally to represent information as a finite set states.

So how could we go about identifying a reasonable way to partition a continuous variable into a handful of informative, representative states?

In the context of a classification problem, what we are interested in is whether the states provide information with respect to the Classes we are trying to recognize. As far as I can tell from my cursory review of what’s out there, the main approaches use either Chi-Square tests or Entropy to achieve that goal. I’ll leave aside Chi-Square based approaches for today, and look into the Recursive Minimal Entropy Partitioning algorithm proposed by Fayyad & Irani in 1993.

The algorithm idea

The algorithm hinges on two key ideas:

Data should be split into intervals that maximize the information, measured by Entropy,

Partitioning should not be too fine-grained, to avoid over-fitting.

The first part is classic: given a data set, split in two halves, based on whether the continuous value is above or below the “splitting value”, and compute the gain in entropy. Out of all possibly splitting values, take the one that generates the best gain – and repeat in a recursive fashion.

Let’s illustrate on an artificial example – our output can take 2 values, Yes or No, and we have one continuous-valued feature:

The Continuous Feature takes 3 values: 1.0, 2.0 and 3.0, which leaves us with 2 possible splits: strictly less than 2, or strictly less than 3. Suppose we split on 2.0 – we would get 2 groups. Group 1 contains Examples where the Feature is less than 2:

Continuous Feature

Output Class

1.0

Yes

1.0

Yes

The Entropy of Group 1 is H(g1) = - 1.0 x Log(1.0) = 0.0
Group 2 contains the rest of the examples:

Partitioning on 2.0 gives us a gain of H – 2/5 x H(g1) – 3/5 x H(g2) = 0.67 – 0.4 x 0.0 – 0.6 x 0.63 = 0.04. That split gives us additional information on the output, which seems intuitively correct, as one of the groups is now formed purely of “Yes”. In a similar fashion, we can compute the information gain of splitting around the other possible value, 3.0, which would give us a gain of 0.67 – 0.6 x 0.63 – 0.4 x 0.69 = - 0.00: that split doesn’t improve information, so we would use the first split (or, if we had multiple splits with positive gain, we would take the split leading to the largest gain).

So why not just recursively apply that procedure, and split our dataset until we cannot achieve information gain by splitting further? The issue is that we might end up with an artificially fine-grained partition, over-fitting the data.

As an illustration, consider the following contrived example:

Continuous Feature

Output Class

1.0

Yes

2.0

No

3.0

Yes

4.0

No

From a “human perspective”, the Continuous Feature looks fairly uninformative. However, if we apply our recursive split, we’ll end up doing something like this (hope the notation is understandable):

At every step, extracting a single Example increases our information, and the final result has a clear over-fitting problem, with each Example forming its own group.

To address this issue, we need a “compensating force”, to penalize the formation of blocks that are too small. For that purpose, the algorithm uses a criterion based on the Minimum Description Length principle (MDL). From what I gather, conceptually, the MDL principle “basically says you should pick the model which gives you the most compact description of the data, including the description of the model itself” [source]. In this case, our model is pretty terrible, because to represent the data, we end up using all of the data itself.

This idea appears in the full algorithm as an additional condition: a split will be accepted only if the entropy gain is greater than a minimum level, given by the formula

where N is the number of elements in the group to be split, and k the number of Classes in a group.

The derivation of that stopping criterion is way beyond my level in information theory (look at the Fayyad and Irani article listed below if you are curious about the details, it’s pretty interesting), so I won’t make a fool of myself and attempt to explain it. At a very high-level, though, with heavy hand-waving, the formula appears to make some sense:

(1/N) x log2(N-1) decreases to 0 as N goes to infinity; this introduces a penalty on splitting smaller datasets (to an extent),

Implementation

Here is my naïve implementation of the algorithm (available here on GitHub:

namespaceDiscretization// Recursive minimal entropy partitioning,// based on Fayyad & Irani 1993. // See the following article, section 3.3,// for a description of the algorithm:// http://www.math.unipd.it/~dulli/corso04/disc.pdf// Note: this can certainly be optimized.moduleMDL=openSystem// Logarithm of n in base bletlogbnb=logn/logbletentropy(data:(_*_)seq)=letN=data|>Seq.length|>(float)data|>Seq.countBysnd|>Seq.sumBy(fun(_,count)->letp=(float)count/N-p*logp)// A Block of data to be split, with its// relevant characteristics (size, number// of classes, entropy)typeBlock(data:(float*int)[])=lets=data|>Array.length|>(float)letclasses=data|>Array.mapsnd|>Set.ofArray|>Set.countletk=classes|>(float)leth=entropy(data)memberthis.Data=datamemberthis.Classes=classesmemberthis.S=smemberthis.K=kmemberthis.H=h// Entropy gained by splitting "original" block// into 2 blocks "left" and "right"letprivateentropyGain(original:Block)(left:Block)(right:Block)=original.H-((left.S/original.S)*left.H+(right.S/original.S)*right.H)// Minimum entropy gain required// for a split of the "original" block// into 2 blocks "left" and "right"letprivateminGain(original:Block)(left:Block)(right:Block)=letdelta=logb(pown3.original.Classes-2.)2.-(original.K*original.H-left.K*left.H-right.K*right.H)((logb(original.S-1.)2.)/original.S)+(delta/original.S)// Identify the best acceptable value// to split a block of dataletsplit(data:Block)=// Candidate values to use as split// We remove the smallest, because// by definition no value is smallerletcandidates=data.Data|>Array.mapfst|>Seq.distinct|>Seq.sort|>Seq.toList|>List.tailletwalls=seq{forvalueincandidatesdo// Split the data into 2 groups,// below/above the valueletg1,g2=data.Data|>Array.partition(fun(v,c)->v<value)letblock1=Block(g1)letblock2=Block(g2)letgain=entropyGaindatablock1block2letthreshold=minGaindatablock1block2// if minimum threshold is met,// the value is an acceptable candidateifgain>=thresholdthenyield(value,gain,block1,block2)}if(Seq.isEmptywalls)thenNoneelse// Return the split value that// yields the best entropy gainwalls|>Seq.maxBy(fun(value,gain,b1,b2)->gain)|>Some// Top-down recursive partition of a data block,// accumulating the partitioning values into// a list of "walls" (splitting values)letpartition(data:Block)=letrecrecursiveSplit(walls:floatlist)(data:Block)=match(splitdata)with|None->walls// no split found|Some(value,gain,b1,b2)->// append new split valueletwalls=value::walls// Search for new splits in first groupletwalls=recursiveSplitwallsb1// Search for new splits in second grouprecursiveSplitwallsb2// and go search!recursiveSplit[]data|>List.sort

The code appears to work, and is fairly readable / clean. I am not fully pleased with it, though. It’s a bit slow, and I have this nagging feeling that there is a much cleaner way to write that algorithm. I also dislike casting the counts to floats, but that’s the best way I found to avoid a proliferation of casts everywhere in the formulas, which operate mostly on floats (eg. log or proportions).

To avoid re-computing entropies, and counts of elements and classes, I introduced a Block class, which represents a block of data to be split – an array of (float * int), where the float is the continuous value and the int the label / index of the class. The algorithm recursively attempts to break blocks, and accumulates “walls” / split points along the way; it looks up every float value in the current block as a potential split point, generates a sequence of valid candidates, picks the one that generates the largest gain, and keeps searching in the two resulting blocks.

Results

So… does it work? This is by no means a complete validation (see the References below for some more rigorous analysis), but I thought I would at least try it on some synthetic data. The test script is on GitHub:

… looks like the algorithm is handling these obvious cases just the way it should.

That’s it for today. I’ll come back to the topic of discretization soon, this time looking at Khiops / Chi-Square based approaches. In the meanwhile, maybe this will come in handy for some of you – and let me know if you have comments or questions!