I’ve been collecting them for years, trying to get my hands on anything that was released within the genre. It started as a necessity, transformed into a hobby, and eventually turned into a relatively huge collection… But that’s when the problems started.

While back in 2005 I could barely find freeware web application scanners, by 2008 I had SO MANY of them that I couldn’t decide which ones to use. By 2010 the collection became so big that I came to the realization that I HAVE to choose.

I started searching for benchmarks in the field, but at the time, only located benchmarks the focused on comparing commercial web application scanners (with the exception of one benchmark that also covered 3 open source web application scanners), leaving the freeware & open source scanners in an uncharted territory;

By 2010 I had over 50 tools, so I eventually decided to test them myself using the same model used in previous benchmarks (a big BIG mistake).

I initially tested the various tools against a vulnerable ASP.net web application and came to conclusions as to which tool is the “best”… and if it weren’t for my curiosity, that probably would have been the end of it and my conclusions might have mislead many more.

I decided to test the tools against another vulnerable web application, just to make sure the results were consistent, and arbitrarily selected “Insecure Web App” (a vulnerable JEE web application) as the second target… and to my surprise, the results of the tests against it were VERY different.

Some of the Tools that were efficient in the test against the vulnerable ASP.net application (which will stay anonymous for the time being) didn’t function very well and missed many exposures, while some of the tools that I previously classified as “useless” detected exposures that NONE of the other tools found.

After performing an in-depth analysis for the different vulnerabilities in the tested applications, I came to the conclusion that although the applications included a similar classification of exposures (SQL Injection, RXSS, Information disclosure, etc), the properties and restrictions in the exposure instances were VERY different in each application.

That’s when it dawned on me that the different methods that tools use to discover security exposures might be efficient for detecting certain common instances of a vulnerability while simultaneously being inefficient for detecting other instances of the same vulnerability, and that tools with “lesser” algorithms or different approaches (which might appear to be less effective at first) might be able to fill the gap.

So the question remains… Which tool is the best? Is there one that surpasses the others? Can there be only one?

I decided to find out…

It started as a bunch of test cases, and ended as a project containing hundreds of scenarios (currently focusing on Reflected XSS and SQL Injection) that will hopefully help in unveiling the mystery.

Before I’ll describe project WAVSEP and the results of the first scanner benchmark performed using it, I’d like to thank all the tool developers and vendors that shared freeware & open source tools with the community over the years; if it weren’t for the long hours they’ve invested and the generosity they had to share their creations, then my job (and that of others in my profession) would have been much harder.

I hope that the conclusions, ideas, information and payloads presented in this research (and the benchmarks and tools that will follow) will benefit all vendors, and specifically help the open source community to locate code sections that all tool vendors could assimilate to improve their products; to that end I’ll try and contact each vendor in the next few weeks, in order to notify them on source codes that could be assimilated in their product to make it even better (on the basis of development technology and the license of each code section).

Phase I – The “Traditional” Benchmark

Testing the scanners against vulnerable training & real life applications.

As I mentioned earlier, In the initial phase of the benchmark, I have tested the various scanners in front of different vulnerable “training” applications (OWASP InsecureWebApp, a vulnerable .Net Application and a simple vulnerable application I have written myself), and tested many of them against real life applications (ASP.Net applications, Java applications based on Spring, Web application written in PHP, etc).

I decided not to publish the results just yet, and for a damn good reason which I did not predict in the first place; nevertheless, the initial process was very helpful because it helped me to learn about the different aspects of the tools: features, vulnerability list, coverage, installation processes, configuration methods, usage, adaptability, stability, performance and a bunch of other aspects.

I have found VERY interesting results that prove that certain old scanners might provide great benefits in many cases that many modern projects will not handle properly.

The process also enabled me to verify the support of the various tools in their proclaimed features (which I have literally done for the vast majority of the tools, using proxies, sniffers and other experiments), and even get a general measure of their accuracy and capabilities.

However, after seeing the results diversity in different applications and technologies, and after dealing with the countless challenges that came along the way, I have discovered several limitations and even a fundamental flaw in testing the accuracy, coverage, stability and performance of scanners in this manner (I have managed to test around 50 free and open source scanners by this point, as insane and unbelievable as this number might sound);

We may be able to estimate the general capabilities of a scanner from the amount of REAL exposures that it located, the amount of exposures that it missed (false negatives) and from the amount of FALSE exposures (false positives) it identified as security exposures, BUT on the other hand, the output of such a process will very much depend on the type of exposures that exist in the tested application, how much each scanner is adapted to the tested application technology and which private cases of exposures and barriers exist in the tested application.

A scanner that will be very useful for scanning PHP web sites might completely fail the task of scanning a ASP.Net web application, and a tool perfectly suited for that task might crash when faced with certain application behaviors, or be useless in detecting a private case of a specific vulnerability that is not supported by the tool.

I guess what I’m trying to say is this:

There are many forms and variations to each security exposure, and in order to prove my point, I’ll use the example of reflected cross site scripting;

Locations vulnerable to reflected cross site scripting might appear in many forms; they may require the attacker to send a whole HTML tag as a part of the crafted link, require the injection of an HTML event (in case the input-affected-output is printed in the context of a tag and the usage of tag-composing-characters is restricted), they may appear in locations vulnerable to SQL injection (and thus restrict the use of certain characters, or even require the usage of initial payloads that “disable” the SQL injection vulnerability first), require browser specific payloads or even direct injection of javascript/vbscript (in case the context is within a script tag, certain HTML events or even in the context of certain properties), and these cases are only a fragment of the whole list!

So, how can the tester know which of these cases is handled by each scanner from the figures and numbers presented in a general benchmark?

I believe he can’t. No matter how solid the difference appears, he really can’t.

Such information may allow him to root out useless tools (tools that miss even the most obvious exposures), and even identify what appears to be a significant difference in the accuracy of locating certain exposure instances, but the latter case might have been very different if the tested applications would have been prone to certain exposure instances that are the specialty of a different scanner, or would have included a technological barrier that requires a specific feature or behavior to bypass.

Thus, I have come to believe that the only way I could truly provide useful information to testers on the accuracy and coverage of freely available web application scanners is by writing detailed test cases for different exposures, starting with some core common exposures such as SQL Injection, cross site scripting and maybe a couple of others.

And thus, I have ended up investing countless nights in the development of a new test-case based evaluation application, designed specifically to test the support of each tool for detecting MANY different cases of certain common exposures.

The results of the original benchmark (against the vulnerable training web applications) will be published separately in a different article (since by now, many of them have been updated, and the results require modifications).

Phase II - Project WAVSEP

After documenting and testing the features of every free & open source web application scanner and scan script that I could get my hands on, I discovered that the most common features were Reflected Cross Site Scripting (RXSS) and SQL Injection (SQLi). I decided to focus my initial efforts on these two vulnerabilities, and develop a platform that could truly evaluate how good each scanner is in detecting them, which tool combinations provide the best results and which tool can bypass the largest amount of detection barriers.

A test case is defined as a unique combination of the following elements:

A certain instance of a given vulnerability.

Attack vectors with certain input origins (either GET or POST values, and in the future, also URL/path, cookie, various headers, file upload content and other origins).

Currently, only GET and POST attack vectors are covered, since most scanners support only GET and POST vectors (future versions of WAVSEP will include support for additional databases, additional response types, additional detection barriers, additional attack vector origins and additional vulnerabilities).

As mentioned before, the benchmark focused on testing free & open source tools that are able to detect (and not necessarily exploit) security vulnerabilities on a wide range of URLs, and thus, each tool tested needed to support the following features:

Either open source or free to use, so that open source projects and vendors generous enough to contribute to the community will benefit from the benchmark first.

The ability to scan multiple URLs at once (using either a crawler/spider feature, URL/Log file parsing feature or a built-in proxy).

The ability to control and limit the scan to internal or external host (domain/IP).

As a direct implication, the test did NOT include the tools listed in Appendix A – A List of Tools Not Included In The Test.

The purpose of WAVSEP’s test cases is to provide a scale for understanding which detection barriers each scanning tool can bypass, and which vulnerability variations can be detected by each tool.

The Reflected Cross Site Scripting vulnerable pages are pretty standard & straightforward, and should provide reliable basis for assessing the detection capabilities of different scanners.

However, it is important to remember that the SQL Injection vulnerable pages used a MySQL database as a data repository, and thus, the SQL Injection detection results only reflect detection results of SQL Injection vulnerabilities in this type of database; the results that might vary when the back end data repository will be different (a theory that will be verified in the next benchmark).

Description of Comparison Tables

The list of tools tested in this benchmark is organized within the following reports:

Additional information was gathered during the benchmark, including information related to the different features of various scanners. These details are organized in the following reports, and might prove useful when searching for tools for specific tasks or tests:

Aside from the Count column (which represents the total amount of active vulnerability detection features supported by the tool, not including complementary features such as web server scanning and passive analysis), each column in the report represents an active vulnerability detection feature, which translates to the exposure presented in the following list:

SQL – SQL Injection

BSQL – Blind SQL Injection

RXSS – Reflected Cross Site Scripting

PXSS – Persistent / Stored Cross Site Scripting

DXSS – DOM XSS

Redirect – External Redirect / Phishing via Redirection

Bck – Backup File Detection

Auth – Authentication Bypass

CRLF – CRLF Injection / Response Splitting

LDAP – LDAP Injection

XPath – X-Path Injection

MX – MX Injection

Session Test – Session Identifier Complexity Analysis

SSI – Server Side Include

RFI-LFI – Directory Traversal / Remote File Include / Local File Include (Will be separated into different categories in future benchmarks)

Unstable - Crashes every once in a while, Freezes on a consistent basis

Fragile – Freezes or Crashes on a consistent basis, Fails performing the operation in many cases

(Unlike the accuracy values presented in the benchmark for W3AF, which are up date, the stability values for W3AF represent the condition of 1.0-RC3, and not 1.0-RC4; the values will be updated in the next benchmark, after the new version will be thoroughly tested)

Performance Scale

Very Fast - Fast implementation with limited amount of scanning tasks

Fast - Fast implementation with plenty of scanning tasks

Slow - Slow implementation with limited amount of scanning tasks

Very Slow - Slow implementation with plenty of scanning tasks

Comparison of Connection and Authentication Features

The following report (PDF) contains a comparison of connection, authentication and scan control features of different scanners:

The results only include vulnerable pages linked from the index-xss.jsp index page (RXSS-GET or RXSS-POST directories, in addition to the RXSS-FalsePositive directory). XSS Vulnerable locations in the SQL injection vulnerable pages were not taken into account, since they don’t necessarily represent a unique scenario (or at least not until the “layered vulnerabilities” scenario will be implemented).

Benchmark Results – SQL Injection Detection Accuracy

The overall results of the SQL Injection benchmark are presented in the following report (PDF format):

After performing an initial analysis on the data, I have come to a simple conclusion as to which combination of tools will be the most effective in detecting ReflectedXSS vulnerabilities in the public (unauthenticated) section of a tested web site, while providing the least amount of false positives:

Netsparker CE (42 cases), alongside Acunetix Free Edition (38 cases, including case 27 which is missed by Netsparker), alongside Skipfish (detects case 12 which is missed by both tools). I’d also recommend executing N-Stalker on small applications since it able to detect certain cases that none of the other tested tools can (but the XSS scanning feature is limited to 100 URLs).

Using Sandcat or Proxy Strike alongside Burp Spider/Paros Spider/External Spider can help detect additional potentially vulnerable locations (cases 10, 11, 13-15 and 17-21) that could be manually verified by a human tester.

So combining four tools will give the best possible result of RXSS detection in the unauthenticated section of an application, using today’s free & open source tools… WOW, it took some time to get to that conclusion. However, scanning the public section of the application is one thing, and scanning the internal section (authenticated section) of the application is another; effectively scanning the authenticated section requires various features such as authentication support, URL scanning restrictions, manual crawling (in case damage might be caused from crawling certain URLs), etc; so the conclusions for the public section are not necessarily fit for the internal section.

During the next few days, I’ll try and analyze the results and come to additional conclusions (internal RXSS scanning, external & internal SQLi scanning, etc). Simply check my blog in a few days to see which conclusions were already published.

An updated benchmark document will be released in the WAVSEP project homepage after each addition, conclusion or change.

A comment about accuracy and inconsistent results

During the benchmark, I have executed each tool more than once, and on rare occasions, dozens of times. I have discovered that some of the tools have inconsistent results in certain fields (particularly SQL injection). The following tools produced inconsistent results in the SQLi detection field: Skipfish (my guess is the inconsistencies are related to crawling problems and connection timeouts), Oedipus, and probably a couple of others that I can’t remember.

It is important to note that the 100% Reflected XSS detection ratio that Sandcat and ProxyStrike produce comes with a huge amount of false positives, a fact that signifies that the detection algorithm works more like a passive scanner (such as watcher by casaba), and less like an active intelligent scanner that verifies that the injection returned is sufficient to exploit the exposure in the given scope. This conclusion does not necessarily pinpoint anything about other features of these scanners (for example, the SQL injection detection module of proxystrike is pretty decent), or presume that the XSS scanning features of these tools are “useless”; on the contrary, these tools can be used as means to obtain more leads for human verification, and can be very useful in the right context.

Furthermore, the 100% SQL Injection detection ratio of Wapiti needs to be further investigated since andiparos produced the same ratio when the titles of the various pages contained the word SQL (which is part of the reason that in the latest version of WAVSEP, this word does not appear anywhere).

Additional conclusions will follow.

So What Now?

So now that we have plenty of statistics to analyze, and a new framework for testing scanners, it’s time to discuss the next phases.

Although the calendar tells me that it took me 9 months to conduct this research, in reality, it took me a couple of years to collect all the tools, learn how to install and use them, gather everything that was freely available for more than 5 minutes and test them all together.

However, since my research led me to develop a whole framework for benchmarking (aside from the WAVSEP project which was already published), I believe (or at least hope) that thanks to the platform, future benchmarks will be much easier to conduct, and in fact, I’m planning on updating the content of the web site (http://sectooladdict.blogspot.com/) with additional related content on a regular basis.

In addition to different classes of benchmarks, the following goals will be in the highest priority:

Perform additional benchmarks on the framework, and on a consistent basis. I'm currently aiming for one major benchmark per year, although I might start with twice per year, and a couple of initial releases that might come even sooner.

Publish the results of tests against sample vulnerable web applications, so that some sort of feedback on other types of exposures will be available (until other types of vulnerabilities will be implemented in the framework), as well as features such as authentication support, crawling, etc.

I hope that this content will help the various vendors improve their tools, help pen-testers choose the right tool for each task, and in addition, help create some method of testing the numerous tools out there.

The different vendors will receive an email message from an email address designated for communicating with them. I urge them to try and contact me through that address, and not using alternative means, so I’ll be able to set my priorities properly. I apologize in advance for any delays in my responses in the next few weeks.

Appendix A – A List of Tools Not Included In the Test

The benchmark focused on web application scanners that are free to use (freeware and/or open source), are able to detect either Reflected XSS or SQL Injection vulnerabilities, and are also able to scan multiple URLs in the same execution.

As a direct implication, the test did NOT include the following types of tools:

·Commercial scanners - The commercial versions of AppScan, WebInspect, Cenzic, NTOSpider, Acunetix, Netsparker, N-Stalker, WebCruiser, Sandcat and many other commercial tools that I failed to mention. Any tool in the benchmark that holds the same commercial name is actually a limited free version of the same product, and does not refer (or even necessarily reflect on) the full product.

·Uncontrollable Scanners - scanners that can’t be controlled or restricted to scan a single site, since they either receive the list of URLs to scan from Google Dork, or continue and scan external sites that are linked to the tested site. This list currently includes the following tools (and might include more):

oDarkjumper 5.8 (scans additional external hosts that are linked to the given tested host)

·Deprecated Scanners - incomplete tools that were not maintained for a very long time. This list currently includes the following tools (and might include more):

oWpoison (development stopped in 2003, the new official version was never released, although the 2002 development version can be obtained by manually composing the sourceforge URL which does not appear in the web site- http://sourceforge.net/projects/wpoison/files/ )

oetc

·De facto Fuzzers – tools that scan applications in a similar way to a scanner, but where the scanner attempts to conclude whether or not the application or is vulnerable (according to some sort of “intelligent” set of rules), the fuzzer simply collects abnormal responses to various inputs and behaviors, leaving the task of concluding to the human user.

oLilith 0.4c/0.6a (both versions 0.4c and 0.6a were tested, and although the tool seems to be a scanner at first glimpse, it doesn’t perform any intelligent analysis on the results).

oSpike proxy1.48 (although the tool has XSS and SQLi scan features, it acts like a fuzzer more then it acts like a scanner – it sends payloads of partial XSS and SQLi, and does not verify that the context of the returned output is sufficient for execution or that the error presented by the server is related to a database syntax injection, leaving the verification task for the user).

·Fuzzers – scanning tools that lack the independent ability to conclude whether a given response represents a vulnerable location, by using some sort of verification method (this category includes tools such as JBroFuzz, Firefuzzer, Proxmon, st4lk3r, etc). Fuzzers that had at least one type of exposure that was verified were included in the benchmark (Powerfuzzer).

·Single URL Vulnerability Scanners - scanners that can only scan one URL at a time, or can only scan information from a google dork (uncontrollable).

oHavij (by itsecteam.com)

oHexjector (by hkhexon)

oSimple XSS Fuzzer [SiXFu] (by www.EvilFingers.com)

oMysqloit (by muhaimindz)

oPHP Fuzzer (by RoMeO from DarkMindZ)

oSQLi-Scanner (by Valentin Hoebel)

oEtc.

·The following scanners:

osandcatCS 4.0.3.0 - Since sandcat 4.0 free edition, a more advanced tool from the same vendor is already tested in the benchmark.

oGNUCitizen JAVASCRIPT XSS SCANNER - since WebSecurify, a more advanced tool from the same vendor is already tested in the benchmark.

oVulnerability Scanner 1.0 (by cmiN, RST) - since the source code contained traces for remotely downloaded RFI lists from locations that do not exist anymore. I might attempt to test it anyway in the next benchmark.

oXSSRays 0.5.5 - I might attempt to test it in the next benchmark.

oXSSFuzz 1.1 - I might attempt to test it in the next benchmark.

oXSS Assistant - I might attempt to test it in the next benchmark.

·Vulnerability Detection Helpers – tools that aid in discovering a vulnerability, but do not detect the vulnerability themselves; for example:

oExploit-Me Suite (XSS-Me, SQL Inject-Me, Access-Me)

oFiddler X5s plugin

·Exploiters - tools that can exploit vulnerabilities but have no independent ability to automatically detect vulnerabilities on a large scale. Examples:

oMultiInjector

oXSS-Proxy-Scanner

oPangolin

oFGInjector

oAbsinth

oSafe3 SQL Injector (an exploitation tool with scanning features (pentest mode) that are not available in the free version).

oetc

·Exceptional Cases

oSecurityQA Toolbar (iSec) – various lists and rumors include this tool in the collection of free/open-source vulnerability scanners, but I wasn’t able to obtain it from the vendor’s web site, or from any other legitimate source, so I’m not really sure it fits the “free to use” category.

Appendix B – WAVSEP Scanning Logs

The execution logs, installation steps and configuration used while scanning with the various tools are all described in the following report (PDF format):

During the assessment, parts of the source code of open source scanners and the HTTP communication of some of the scanners was analyzed; some tools behaved in an abnormal manner that should be reported:

·Priamos IP Address Lookup – The tool Priamos attempts to access “whatismyip.com” (or some similar site) whenever a scan is initiated (verified by channeling the communication through Burp proxy). This behavior might derive from a trojan horse that infected the content on the project web site, so I’m not jumping to any conclusions just yet.

·VulnerabilityScanner Remote RFI List Retrieval (listed in the scanners that were not tested, appendix A, developed by a group called RST, http://pastebin.com/f3c267935) – In the source code of the tool VulnerabilityScanner (a python script), I found traces for remote access to external web sites for obtaining RFI lists (might be used to refer the user to external URLs listed in the list). I could not verify the purpose of this feature since I didn’t manage to activate the tool (yet); in theory, this could be a legitimate list update feature, but since all the lists the tool uses are hardcoded, I didn’t understand the purpose of the feature. Again, I’m not jumping to any conclusions; this feature might be related to the tool’s initial design, which was not fully implemented due to various considerations. I’ll try and drill deeper in the next benchmark (and hopefully, manage to test the tool’s accuracy as well).

Although I did not verify that any of these features is malicious in nature, these features and behaviors might be abused to compromise the security of the tester’s workstation (or to incriminate him in malicious actions), and thus, require additional investigation to disqualify this possibility.

Wow really, really nice work! I am very interested in your conclusions - although I didn't have any data to support the feeling, I also believed multiple scanners would improve results.

I just started a security scanning business, and the idea is to have some of the truly open source scanners be part of the scanning arsenal to find some of the more obscure stuff. My main challenge is automatically configuring each for the target site, then getting rid of (as many as possible) false positives before the customer sees the final report.