Abstract

Malicious web content is a serious problem on the Internet today. In thispaper we propose a deep learning approach to detecting malevolent web pages.While past work on web content detection has relied on syntactic parsing or onemulation of HTML and Javascript to extract features, our approach operatesdirectly on a language-agnostic stream of tokens extracted directly from staticHTML files with a simple regular expression. This makes it fast enough tooperate in high-frequency data contexts like firewalls and web proxies, andallows it to avoid the attack surface exposure of complex parsing and emulationcode. Unlike well-known approaches such as bag-of-words models, which ignorespatial information, our neural network examines content at hierarchicalspatial scales, allowing our model to capture locality and yielding superioraccuracy compared to bag-of-words baselines. Our proposed architecture achievesa 97.5% detection rate at a 0.1% false positive rate, and classifiessmall-batched web pages at a rate of over 100 per second on commodity hardware.The speed and accuracy of our approach makes it appropriate for deployment toendpoints, firewalls, and web proxies.