A major problem in modern systems are vulnerable application. With modern operating systems becoming more user friendly, a huge part of its users are inexperienced and are not trained to prevent the exploitation of their system from vulnerabilities. Many users do not update their system regularly, which makes them vulnerable to public vulnerabilities that have been exploited. Forecasts show that security related expenditure are becoming extremely expensive. This only reiterates the fact that prevention is a key business plan for multiple corporations and exploitable vulnerabilities tend to cost a lot more. One such example is a report from IBM where an analysis between July 2018 and April 2019 based on 507 companies concluded the fact that the average data breach cost is 8.19 million dollars.
In this context, the need for automatic vulnerability detection has risen, especially given the large array of available libraries and open source programs. When an application is written, a trust is placed in any of the libraries it links to and also in any other application that is communicating with it. This paper aims to improve the results obtained for code vulnerability detection with an implementation of a C/C++ vulnerability detection method based on project source code lexicon normalization with the help of tokens and classification with the help of deep learning approaches. The proposed pipeline is a three staged process which includes: scraping source files from GitHub using a static analyzer to produce labels, training the deep learning model on the previous labeled and normalized samples and analyzing the results based on different deep learning methods. We supplement available labeled vulnerability datasets from an array of labeled open-source projects gathered with a web-crawler from GitHub. We improve the vulnerability detection tool implementation with binary crossentropys, and adam. We evaluated our implementation with real software packages, NIST SATE IV and Draper VDISC benchmark datasets and obtained 96% accuracy with our analysis.