Centroid

Abstract

Code similarity measures create a comparison metric showing to what degree two code samples have the same functionality, e.g., to statically detect the use of known libraries in binary code. They are both an indispensable part of automated malware analysis, as well as a helper for the detection of plagiarism (IP protection) and the illegal use of open-source libraries in commercial apps. The centroid similarity metric extracts control-flow features from binary code and encodes them as geometric structures before comparing them. In our paper, we propose novel improvements to the centroid approach and apply it to the ARM architecture for the first time. We implement our approach as a plug-in for the IDA Pro disassembler and evaluate it regarding efficiency, accuracy and robustness on Android. Based on a dataset of 508.745 APKs, collected from 18 third-party app markets, we achieve a detection rate of 89% for the use of native code libraries, with an FPR of 10.8%. To test the robustness of our approach against the compiler version, optimization level, and other code transformations, we obfuscate and recompile known open-source libraries to evaluate which code transformations are resisted. Based on our results, we discuss how code re-use can be hidden by obfuscation and conclude with possible improvements. To encourage the reproducibility of our experiments and further developments, we publish our code under the MIT license.