For the past two weeks I’ve been experimenting with various malware feature transformations and combinations. It came to my understanding that using any Euclidean distance-based clustering for a dataset where more than 95% of boolean features might not be the best approach.

I spent the last couple of days to implement miscellaneous functions helping with clustering. Among others: saving clustering results into JSON, label statistics per cluster, comparing a new sample to in-memory clustering and basic version of anomaly detection.

This time I did clustering with DBSCAN and HDBSCAN. These are fairly simple yet effective clustering algorithms suitable for any kind of data. Additionally, I’ve implemented a method evaluating the results of clustering with variety of metrics: Adjusted Mutual Information Score, Adjusted Random Index, Completeness, Homogeneity, Silhouette Coefficient, and V-measure. Finally, I did loads of experiments to pick te best parameters for those.

I was suggested to develop a mechanism to detect binaries that behave abnormally. To this end, I used count features that count numerous operations performed by the binaries like number of files written, number of network connections, etc. As a method for the outlier detection I used classic boxplots.

This week brought functions revolving around the complete feature set construction. These features are necessary for effective malware clustering and were extracted and preprocessed to be of high quality and useful for the project.

I spent this week playing around with generation and visualisation of simple malware features obtained through signatures JSON field. This field contains some high level binary descriptions provided by Cuckoo Sandbox analysis. This field among others contains a tag and its description. All these tags come from a finite set, hence, they can be used as binary (true/false) features for the analysis.