Getting all the Entropy We Can Eat!

Visual Studio 6.0

This next routine, getEntropy, is the main routine for determining which classification is the most important for sales growth. We will turn to communication research and rip off one of their discoveries. As mentioned, entropy is a precise measurement of information.

In a nutshell, entropy is a measure of uncertainty in an observation. The entropy explains the amount of uncertainty about a particular outcome.

This is used a lot in determining information that comes over noisy data lines. We, however, have hijacked it for our marketing purposes.

We will be using logarithms, which happen to be a built-in function in VB 6.0. If you are a bit rusty on them, it simply means that logx = y where (2 to the power y) = x, in this case log has base 2. In English, this means that the log to base 2 of any number is any bits that it would take to represent that many numbers. For example, log2(16) = 4. This tells us that it takes 4 bits to uniquely represent 16 different values. So we will explain this in the code below. But hey, don't worry. We will just use the VB built-in log function to do the dirty work for us.

Try It Out - Determining the Entropy of a Classification

1. Please add another function to the frmID3 form. Name the function getEntropy and enter the code as shown. This routine will determine the entropy for the classification and return the entropy to the calling routine, determineEntropy.

Public Function getEntropy(ByVal sField As String, itotalRecords As Integer) As Single
'--------------------------------------------------------------
'-- First, determine how many unique values of this field are
'-- in the table. Store them and determine the entropy for each
'--------------------------------------------------------------
Dim sSql As String
Dim adoHoldGroup As ADODB.Recordset
Dim totalUp As Integer
Dim totalDown As Integer
Dim totalSamples As Integer
Dim fractionalEntropy As Single
Dim entropy As Single
Dim searchValue As String
fractionalEntropy = 0
Set adoHoldGroup = New ADODB.Recordset
adoHoldGroup.CursorLocation = adUseClient
'-- grab the info here--
sSql = "SELECT * FROM ID3 ORDER BY " & sField
adoHoldGroup.Open sSql, adoConnection
If (adoHoldGroup.BOF) And (adoHoldGroup.EOF) Then
MsgBox "No records in the ID3 Table"
Exit Function
End If
totalSamples = adoHoldGroup.RecordCount
adoHoldGroup.MoveFirst
searchValue = adoHoldGroup.Fields.Item(sField)
With adoHoldGroup
While Not .EOF
totalUp = 0
totalDown = 0
'---------------------------------------------------
'-- Loop through all of the records with like values
'---------------------------------------------------
Do While (searchValue = .Fields.Item(sField))
If !SalesUp = True Then
totalUp = totalUp + 1
Else
totalDown = totalDown + 1
End If
.MoveNext
If (.EOF) Then Exit Do
Loop
'---------------------------------------------------
'-- Now determine the entropy for that group --
'---------------------------------------------------
If (totalUp = 0) Then
fractionalEntropy = -1#
ElseIf (totalDown = 0) Then
fractionalEntropy = -1#
Else
fractionalEntropy = _
-(totalUp / totalSamples) * Log(totalUp / totalSamples) - _
(totalDown / totalSamples) * Log(totalDown / totalSamples)
End If
fractionalEntropy = (fractionalEntropy / itotalRecords)
entropy = entropy + fractionalEntropy
If (.EOF) Then
getEntropy = entropy + fractionalEntropy
Else
searchValue = adoHoldGroup.Fields.Item(sField)
End If
Wend
End With
End Function

How It Works

We want all of the records sorted by the particular field we will be examining. So if we are determining the entropy on the CountryLanguage classification, for example, all of the records that have English in the CountryLanguage field will be lumped together.

So what we want to do is loop through the recordset and determine the entropy on each of the records that have the same value in the CountryLanguage field. When the value of the field changes to, say, German, we will stop and calculate the value for English because we know we have looked at each of those. So by returning the records sorted by the classification field, we can be sure that all like records are together. An easy way to do this is to take the first value passed into our getEntropy function, sField. We can dynamically build our SQL statement instructing the recordset to be returned ordered by that field. So if we were determining the CountryLanguage, all entries for English would be together, etc.

We start by opening the recordset adoHoldGroup, which gives us all of the records in the ID3 table. Now that the recordset is opened, we to know which classification we will be working with. After all, it might be CountryRegion, CountryLanguage or Country. Well, we passed in the value as a parameter to the function, sField. Let's say that sField contains Country. So we use that variable and pass it to the recordset's Fields.Item() as a parameter. This will return the current value of that field (Country) in the current record. That value, say Spain, is assigned to the variable searchValue. Please notice that this is another example of a generalized routine. We don't hard code in a value, but permit a parameter passed into the routine, sField, to determine what we will be searching on. This allows us to use the same routine to examine each of the three classifications.

adoHoldGroup.MoveFirst
searchValue = adoHoldGroup.Fields.Item(sField)

As mentioned, we want to handle each category. Let's assume that we passed in CountryLanguage as the classification to be evaluated. If the CountryLanguage of the current record is English, then English is assigned to searchValue. As long as the value of the current record's sField (CountryLanguage) is equal to English, we loop through the recordset and increment the totalUp and totalDown fields. We are simply incrementing the totalUp or totalDown fields if that particular record shows that sales were either up or down. So if we had 4 records from four countries that speak English, and three of the countries experienced sales growth while the fourth did not, totalUp would be 3 and totalDown would be 1. As long as the searchValue (say English) is equal to the field being examined (CountryLanguage), the Do loop looks at each record and increments the totalUp or totalDown. In addition, since all of the records with like languages are grouped together, we know we will get them all. Now say the MoveNext goes to the next record and the Fields.Item(CountryLanguage) is now German. The code exits the Do loop and continues.

While Not .EOF
totalUp = 0
totalDown = 0
'---------------------------------------------------
'-- Loop through all of the records with like values
'---------------------------------------------------
Do While (searchValue = .Fields.Item(sField))
If !SalesUp = True Then
totalUp = totalUp + 1
Else
totalDown = totalDown + 1
End If
.MoveNext
If (.EOF) Then Exit Do
Loop

The Scary Part

Essentially, we want to see how many records for countries that speak English had sales that were up and how many were down. So we are going to calculate something called fractionalEntropy.

What we want to look at here is the fractionalEntropy when both the totalUp and totalDown are not 0. If one of these were indeed 0, we would get a divide by zero error. So right away this tells us that we want to have enough samples so that typically both the totalUp and totalDown will have a number greater than zero. We calculate the fractionalEntropy because this is currently being calculated for a single, say language. However, we still need to calculate the number for Spanish, French, German, etc. So we call this fractionalEntropy because it is a piece of the total.

What we want to do is look at the frequency of an "up" value out of the total values. Let's say that there were 10 countries with the language of English. If 5 were up, then the probability of the language classification having a positive sales growth is 5 out of 10, or 5/10. Likewise, the probability of the language classification having a negative sales growth is also 5/10. In general terms, we can look at the formula like so:

Outcome = - Probability(up) log Probability(up) - Probability(down)

log Probability(down)

In our example we would then have

Outcome = -5/10 log (5/10) - 5/10 log (5/10)

Outcome = 1 (i.e. no change. We can't discern anything here)

The outcome represents the uncertainty of sales going up, based on the information we placed in the ID3 table. This is NOT meant to be a math book, but only to add an element of interest in data mining and show a simple technique on how we can add another dimension to our knowledge of data mining.

We then divide the fractionalEntropy by the number of records in the recordset and assign this to fractionalEntropy. Now that we have calculated the fractionalEntropy for say, the English language, we add that to the variable entropy which represents the entire classification. So we would loop through to the next record in the ID3 table and determine the fractionalEntropy for all countries that speak Spanish. When finished, this fractionalEntropy factor for Spanish would be added to the entropy variable. So entropy is cumulative. Each of the fractionalEntropy factors for each language is just added to the variable entropy, which is the sum of each of the similar records.

If we have reached the end of the recordset, we assign the result to the name of the function, getEntropy, and return that value for the recordset based on the criteria we just analyzed. Otherwise, we update the value of searchValue with the current record. The code then jumps to the top of the loop, initializes the variables, and looks at the next group. Notice how we update the fractionalEntropy with the results from each grouping of say, CountryLanguage.

The final value is then returned to the calling routine, determineEntropy, and placed in the array for sorting and display. Let's now take a look at how to sort our array to ensure the lowest entropy value is stored in the first position.

Try It Out - Sorting Our Multi-Dimensional Array

1. Please add a final sub to our frmID3 form called qsort. We will call this to sort the array, Position, so the least entropy factor is in the lowest position. Essentially, this will sort our array in ascending order.

How It Works

We call this sub from the determineEntropy routine. Notice that we are passing in the array ByRef, or by reference. This is so that qsort actually has the memory location of the array. Passing in the array by reference permits the qsort routine to work on the actual array – not a copy of the array in memory. Passing in by reference is the default behavior of VB. It can make the changes for us here. A quick call to qsort from determineEntropy will sort the array. We can then print the array from determineEntropy and it will now be sorted for us.

The qsort routine works on the divide-and-conquer principle. It splits the array into two groups and then sorts both the top and the bottom. It does this by calling itself to sort each group. When a routine calls itself, that process is called recursion.

If the upper bound of the array is greater than the lower bound, we want to grab the middle. When we call qsort from determineEntropy, iHighBound = 2 and iLowBound = 0. In the next line we determine the middle of the array – which will be 1. Recall that the entropy value is in position 0 of the array and the classification is in position 1. So varMidBound is assigned the value from myArray(1, 0) which is the entropy value of the classification.

Now we loop as long as the lower bound <= the upper bound variables. If the entropy value stored in element intX (which will be 0 the first time though) is >= the entropy value stored in the middle value And the entropy value stored in the intY element (which will be 2) is <= the middle entropy value, we have some switching to do.

Do While intX <= intY
'-- if a value lower in the array is > than one --
'-- higher in the array, swap them now --
If myArray(intX, 0) >= varMidBound _
And myArray(intY, 0) <= varMidBound Then

So now we just swap what's in position intX with what is in position intY. Then we move the lower and upper boundaries towards each other and do the same test.

Now don't worry if this part looks a bit cryptic. Years are spent in some computer science courses discussing the various sorts and the math behind them. Luckily we can learn from the masters. It's unfortunate that VB does not have any built-in sorting capabilities like qsort. They are built into other languages, such as C.

To sort strings, instead of numbers, some programmers add a hidden list box to the form. The Sorted property is set to True. They just use the AddItem method to add strings in any order. When all of the strings are added, they start at position 0 of the list box and retrieve each of the items. They will of course now be in sorted order. The list box is not visible so the user never sees it, but this is a quick and dirty way to get some quick string sorting done. Unfortunately this will not work for numbers. A 2 would come before a –3, for example. Not exactly precise for our needs!

OK, finished! Press F5 to run your program. Don't forget to add a reference to the Microsoft ActiveX Data Objects 2.0 Library if you haven't done so already. Try the analyzing the various products and see how the results stack up. You are now ready to turn this beauty over to the product manager.

One word on the results. Since we only have a single country for each entry, this usually has the lowest entropy. So we can assume that the country is important. What really makes sense to look at is what is in the second position – after country. In this example for Hokkien Fried Mee, our program says that this is really on the decline in not a single country – but Europe in general! Now this is great information. The product manager might be wondering why sales are down in Germany, without realizing that sales across Europe for this product are in a nose-dive!

I think you can see the value of data mining and hopefully there is enough information provided for you to apply this technique to your real world problems. We re-aggregated the data into a single table. We then tried to find some logical classifications and applied the entropy approach to determine which one of them provided the least uncertainty about sales growth. We tried to illustrate a non-traditional use of database information.