Abstract

In today's Internet, online content and especially webpages have increased exponentially. Alongside this huge rise, the number of users has also amplified considerably in the past two decades. Most responsible institutions such as banks and governments follow specific rules and regulations regarding conducts and security. But, most websites are designed and developed using little restrictions on these issues. That is why it is important to protect users from harmful webpages. Previous research has looked at to detect harmful webpages, by running the machine learning models on a remote website. The problem with this approach is that the detection rate is slow, because of the need to handle large number of webpages. There is a gap in knowledge to research into which machine learning algorithms are capable of detecting harmful web applications in real time on a local machine.

The conventional method of detecting malicious webpages is going through the black list and checking whether the webpages are listed. Black list is a list of webpages which are classified as malicious from a user's point of view. These black lists are created by trusted organisations and volunteers. They are then used by modern web browsers such as Chrome, Firefox, Internet Explorer, etc. However, black list is ineffective because of the frequent-changing nature of webpages, growing numbers of webpages that pose scalability issues and the crawlers' inability to visit intranet webpages that require computer operators to login as authenticated users.

The thesis proposes to use various machine learning algorithms, both supervised and unsupervised to categorise webpages based on parsing their features such as content (which played the most important role in this thesis), URL information, URL links and screenshots of webpages. The features were then converted to a format understandable by machine learning algorithms which analysed these features to make one important decision: whether a given webpage is malicious or not, using commonly available software and hardware. Prototype tools were developed to compare and analyse the efficiency of these machine learning techniques. These techniques include supervised algorithms such as Support Vector Machine, Naïve Bayes, Random Forest, Linear Discriminant Analysis, Quantitative Discriminant Analysis and Decision Tree. The unsupervised techniques are Self-Organising Map, Affinity Propagation and K-Means. Self-Organising Map was used instead of Neural Networks and the research suggests that the new version of Neural Network i.e. Deep Learning would be great for this research.

The supervised algorithms performed better than the unsupervised algorithms and the best out of all these techniques is SVM that achieves 98% accuracy. The result was validated by the Chrome extension which used the classifier in real time. Unsupervised algorithms came close to supervised algorithms. This is surprising given the fact that they do not have access to the class information beforehand.