Ex Machina: A Frolic through the Forests

Posted on 2018-12-28 by Steve Esling

In our previous entry of the Ex Machina series, we gave a broad overview of how machine learning is used in computer security, and briefly mentioned some of the techniques that InQuest is utilizing to apply the insights gained by our artificial co-workers. Today, we’re going to take a deeper dive into two of our classifiers, Random Forests (RF) and Gradient Boosting (GB), and discuss some of their interesting findings. As Gradient Boosting is more a subtype of Random Forest, rather than an entirely separate algorithm in and of itself, we’ll just be given an explanation of RF algorithms; GB forests add a few more high-level mathematical calculations to how they construct and uses their trees.
You might recall from our previous Ex Machina article that random forests are a type of supervised learning; in other words, they require some pre-labeled "good" and “bad” data to get started. Once we have a sufficient amount of each classification assembled, we mix them together and dump them into the classifier so it can learn from them. The end result is a series of decision trees which are able to work together in order to select the most likely group that a future data point falls under. All the steps in between revolve around the algorithms planting, growing, shaping, and pruning of these decision trees (forests). Let’s take a look inside this automated arboretum to get a better feel for what’s going on. In particular, let’s follow the growth of a single tree.

Down the Branching Rabbit Hole

A decision tree is often described as a long game of “20 questions,” but this doesn’t really provide a good visual metaphor for what it does. Another way of looking at it is as an incredibly complex laundry chute. At the top, we put in a load of laundry that we want to sort; perhaps we want the end result of this chute to result in all our clothes sorted into a pile that makes good work outfits and another for casual wear. We already have a group of each labeled appropriately for the benefit of our machine, so we mash them up and throw the entire mass in. It lands in what will be the first of many sorting rooms, where the chute will scan its features (for example, sleeve length, material, color, and so on,) and make some quick calculations. Its goal is to find one feature that splits the pile into the two most homogenous groups. In other words, it wants to reduce the work of future chambers by making groups where the odds of selecting one class of clothes from the pile at random is as high above the other as possible. This reduction in randomness is referred to as “reducing entropy,” and every decision our chute is going to make is based on making this reduction as large as possible.

So, let’s say that it’s found that sorting the clothes by sleeve length, with one pile containing exclusively short sleeves and one long, leads to the short sleeve pile being made up of 80% casual wear and the long sleeve pile being 80% formal. Splitting on other features may have gotten decent results, such as a 70-30 split along color, but this feature was the one with the highest decrease in entropy. Now, we could stop here and have the model say that any clothes with short sleeves are casual and any with long are formal, and be correct the majority of the time. But, of course, it would be prudent to split our piles farther, so each one is sent to another sorting room further down the chute. In the 80% casual pile’s chamber, it’s found that splitting by material, say cotton and wool, ends up perfectly separating the formal from the casual. In the other room, it’s found that splitting by color, darks/lights, creates one pile that’s 100% formal but still leaves some hangers-on in the casual pile. That pile will then be dumped down into a further chamber, and so on. The algorithm stops when all the piles are 100% casual or formal, though this is unlikely for real-world situations. Instead, we define an acceptable threshold for completion, say more than 95% classification one way or the other. The final pattern of chutes and sorting rooms are then “locked,” and now when we have a clothing article we would like to know the status of, we can drop it in where it slides from room to room in the appropriate order based on its features and end up in, ideally, the pile containing its correct classification.

Expanding the Business

However, if we do this with just one tree, it leaves very little room for error. With this one, a clothing item with short sleeves made of cotton is always going to end up in the casual pile. But what if, in the future, we find an outfit that’s more appropriate for the office, yet still has both of these attributes? In this case, it might actually be better if we had just stopped at the first split (as we discussed earlier) because at least then we would expect around 20% of the clothing to be formal, and wouldn’t be too surprised if we drew this outfit from it. And what if there’s no path of chutes and rooms that results in a 100% pure pile, no matter which features we use and in what order? The answer to that gives random forest its name.

All the trees in the forest each undergo the same construction process that our laundry chute does; however, they are cut off much earlier than when a pile reaches 100%. Their stopping point can be based on all sorts of criteria, such as when a pile reaches a 75-25 composition, or when the number of leaves (laundry piles) reaches a certain volume. This results in each tree taking their own slightly different path to come up with a final sorting, as there’s usually more leeway in deciding the decision points (rooms) that lead to a mostly homogenous grouping rather than a fully homogeneous one. We do this with hundreds, maybe even thousands of trees. Each tree simultaneously trying to find the best features to split their groups on. Once all of these trees are grown and healthy, and we put in a future data-point, each tree is going to have its own answer. We then total up how many trees “voted” for each class, and the class with the most votes is selected. This can lead to insanely accurate results, with our particular RF algorithm consistently classifying malware correctly over 98% of the time.

Harvesting the Fruits

Their usefulness doesn’t end at classification, though. By looking at where most trees made their first splits, second splits, and so on, we can actually get a ranking of how important certain features are in determining the classification of the data point in question. The following “feature importances” diagrams are extracted from a subset of our feature vector for identifying document macros as malicious or benign:

Some of the features and their significance as determined by the Random Forest classifier.

The same features and their importance as determined by Gradient Boosting.

It’s obvious from the histogram excerpt above that each classifier has its own idiosyncratic way of determining a macro’s likelihood of being malicious. Random forest, for example, has assigned a tremendous amount of weight to the number of strings over 18 characters; this is an insight our human researchers may not have overtly thought of. We'll pivot on such insights to discover interesting samples that may otherwise fly under the detection RADAR. The other weight assignments, while not as dramatic, are also notable as illustrating the differences between each classifier’s “thought process.” We can see that RF has given a lot of weight to empty lines and the overall length of the macro, while GB considers the maximum line and string lengths to be more important. Such differences, while sometimes contradictory between models, are also important for highlighting interesting avenues of further examination.

These are just two of the four classifiers we’re using for supervised learning, and we haven’t even touched on our unsupervised work yet. In a future edition of Ex Machina, we’ll take you through more tours of our computerized factories.