Here’s the thing about machine learning: use the right datasets and it’ll help you root out malware with great accuracy and efficiency. But the models are what they eat. Feed them a diet of questionable, biased data and it’ll produce garbage.

The machine learning movement

A lot of security experts tout machine learning as the next step in anti-malware technology. Indeed, Sophos’ acquisition of Invincea earlier this year was designed to bring machine learning into the fold.

Machine learning is considered a more efficient way to stop malware in its tracks before it becomes a problem for the end user. Some of the high points:

Sophos Home

But it would be dishonest to suggest that machine learning is the silver bullet – the security remedy that can do no wrong. As Sanders noted, no technology is perfect and its creators should always analyze weaknesses and come up with bigger and better models.

Biased data

Estimating the severity of that bias is important, and will help ensure your model isn’t garbage.

She said:

Standard model validation results can be misleading. We want to know how our model is going to actually do in the wild, so we can make sure it doesn’t fail horribly. This is impossible. But we can still estimate. If we have access to an unbiased sample of deployment-like data, we can simulate our model’s deployment errors via time decay analysis. However, if we don’t have access to deployment-like data, then it’s impossible to accurately estimate how well our model will do on deployment, because we don’t have the right data to test it on.

The next best option, she said, is to test how sensitive one’s models are to new datasets they weren’t trained on, and pick training datasets and model configurations that perform consistently well on a variety of test sets, not just the test datasets that originate from the same parent as the model’s training dataset.

That helps give us a sort of very rough ‘confidence interval’ surrounding deployment accuracy, and also improves the likelihood that our model won’t do poorly on deployment.

Minimize the probability of failing spectacularly

Since machine learning in security is still relatively new, there’s no bullet-proof answer to how to root out the garbage. But Sanders suggested some starting points.

In order to select the best training set and best model configuration possible, one must map the limitations of their fitted model so they have a more accurate starting point, she said.

To get a more accurate measurement, Sanders ran Black Hat attendees through some sensitivity results from the same deep learning model designed to detect malicious URLs, trained and tested across three different sources of URL data.

By simulating the errors, we can better develop training datasets and model configurations that are most likely to perform reliably well on deployment, Sanders said.