Introduction to CuckooML: Machine Learning for Cuckoo Sandbox

It’s all about data.. Malware datasets tend to be relatively large and sparse. They are mostly made of categorical and string data, hence there is a strong need for good feature extraction approaches to obtain numerical vectors that can be feed into machine learning algorithms [e.g. Back to the Future: Malware Detection with Temporally Consistent Labels; Miller B., et al.]. Another common problem is concept drift, the continuous variation of malware statistical properties caused by never ending arms race between malware and antivirus developers. Unfortunately, this makes fitting the clusters even harder and requires the chosen approach to be either easy to re-train or be adaptable to the drift, with the latter option being more desirable. For such big datasets the choice of the clustering algorithm is critical as most…