Anomaly Behavior Analysis of Website Vulnerability and Security

By Pratik Satam

The world wide web has grown exponentially over the previous decade in terms of its size that is currently over a billion sites, as well as the number of users. In fact, web use has become pervasive to touch all aspects of our life, economy and education. These rapid advances have also significantly increase the vulnerabilities of websites that are being hacked on a daily basis. The internet currently hosts more than a billion websites and has an even more in the number of daily users. A wide range of heterogeneous devices (mobile or stationary) access the internet for various functionalities and with the introduction of Internet of Things (IoT), the number is expected to grow to more than 50 billion devices. Most of the content on the internet is hosted on websites which are basically Hyper Text Markup Language (HTML) webpages. The resent advances in html protocol and the browser stacks allow the webpages to be accessed by any device that runs a browser software and has Internet connectivity. This rapid advances have also significantly increased the vulnerabilities of websites that are being hacked on a daily basis. The web vulnerability has provided unprecedented opportunities for cybercrime and malicious activities that can be launched by individuals, groups or government such as illegal financial transactions, data breaches, identity theft, stealing intellectual properties, etc. The web attacks range from phishing, using webpages to deliver malware to more complex attacks that include cross site scripting attacks, cookie poisoning attack etc. With no effective website security measures in place, one can expect the website security to be even more critical.

The main goal of this research is to overcome this challenge by presenting an online anomaly behavior analysis of websites (e.g., HTML files) to detect any malicious codes or pages that have been injected by web attacks. Our anomaly analysis approach utilizes feature selection, data mining, data analytics and statistical techniques to identify accurately the webpage contents that have been compromised or can be exploited by attacks such as phishing attacks, cross site scripting attacks, html injection attacks, malware insertion attacks, just to name a few. We have validated our approach on more than 10,000 files and showed that our approach can detect malicious HTML files with a true positive rate of 99% and a false positive rate of 0.8% for abnormal files.