The results of this research are only valid for estimating the detection accuracy of SQLi & RXSS exposures, and for counting and comparing the various features of the tested tools.

The author did not evaluate every possible feature of each product, only the categories tested within the research, and thus, does not claim to be able to estimate the ROI from each individual product.

Furthermore, several vendors invested resources in improving their tools according to the recommendations of the WAVSEP platform which was publically available since December 2010. Some of them did so without any relation to the benchmark (and before they were aware of it), and some in preparation for it. Since the special structure of the WAVSEP testing platform actually requires the vendor to cover more vulnerable test scenarios, that action actually improves the detection ratio of the tool in any application (for the exposures covered by WAVSEP).

It is however, important to mention that a few vendors were not notified on this benchmark, and were not aware of the existence of the WAVSEP platform, and thus, could not have enhanced their tools in preparation for this benchmark (HP Webinspect, Tenable Nessus, and Janus security Webcruiser), while other vendors that were tested in the initial research phases released updated versions that were not tested (Portswigger Burpsuite and Cenzic Hailstorm)

That being said, the benchmark does represent the accuracy level of each tool in the date it was tested (the results of the vast majority of the tools are valid for the date this research was released), but future benchmark will use a different research model in order to ensure that the competition will be fair for all vendors.

I've always been curious about it… from the first moment I executed a commercial scanner, almost seven years ago, to the day I started performing this research. Although manual penetration testing has always been the main focus of the test, most of us use automated tools to easily detect "low hanging fruit" exposures, increase the coverage when testing large scale applications in limited timeframes and even to double check locations that were manually tested. The questions always pops up, in every penetration test in which these tools are used…

"Is it any good?", "Is it better than…" and "Can I rely on it to…" are questions that every pen-tester asks himself whenever he hits the scan button.

Well, curiosity is a strange beast… it can drive you to wander and search, consume all your time in a search for obscure solutions.

So recently, because of curiosity, I decided that I want to find out for myself, and invest whatever resources necessary to solve this mystery once and for all.

Although I can hardly state that all my questions were answered, I can definitely sate your curiosity for the moment, by sharing insights, interesting facts, useful information and even some surprises, all derived from my latest research which is focused on the subject of commercial & open source web application scanners.

This research covers the latest versions of 12 commercial web application scanners and 48 free & open source web application scanners, while comparing the following aspects of these tools:

·Number & Type of Vulnerability Detection Features

·SQL Injection Detection Accuracy

·Reflected Cross Site ScriptingDetection Accuracy

·General & Special Scanning Features

Although my previous research included similar information, I regretted one thing after it was published; I did not present the information in a format that was useful to the common reader. In fact, as I found out later, many readers skipped the actual content, and focused on sections of the article that were actually a side effect of the main research.

As a result, the following article will focus on presenting the information in a simple comprehendible graphical format, while still providing the detailed research information to those interested… and there's a lot of new information to be shared – knowledge that can aid pen-testers in choosing the right tools, managers in budget related decisions, and visionaries, in properly reading the map;

But before you read the statistics and insights presented in this report, and reach a conclusion as to which tool is the "best", it is crucial that you read ‎Appendix A - Section 29, which explains the complexity of assessing the overall quality of web application scanners… As you're about to find out, this question cannot be answered so easily… at least not yet.

…

So without any further delay, let's focus on the information you seek, and discuss the insights and conclusions later.

The benchmark focused on testing commercial & open source tools that are able to detect (and not necessarily exploit) security vulnerabilities on a wide range of URLs, and thus, each tool tested was required to support the following features:

·The ability to scan multiple URLs at once (using either a crawler/spider feature, URL/Log file parsing feature or a built-in proxy).

·The ability to control and limit the scan to internal or external host (domain/IP).

The testing procedure of all the tools included the following phases:

·The scanners were all tested against the latest version of WAVSEP (v1.0.3), a benchmarking platform designed to assess the detection accuracy of web application scanners. The purpose of WAVSEP’s test cases is to provide a scale for understanding which detection barriers each scanning tool can bypass, and which vulnerability variations can be detected by each tool. The various scanners were tested against the following test cases (GET and POST attack vectors):

o66 test cases that were vulnerable to Reflected Cross Site Scripting attacks.

o10 test cases that were vulnerable to Time Based SQL Injection attacks.

o7 different categories of false positive RXSS vulnerabilities.

o10 different categories of false positive SQLi vulnerabilities.

·In order to ensure the result consistency, the directory of each exposure sub category was individually scanned multiple times using various configurations.

·The features of each scanner were documented and compared, according to documentation, configuration, plugins and information received from the vendor.

·In order to ensure that the detection features of each scanner were truly effective, most of the scanners were tested against an additional benchmarking application that was prone to the same vulnerable test cases as the WAVSEP platform, but had a different design, slightly different behavior and different entry point format (currently nicknamed "bullshit").

The results of the main test categories are presented within three graphs (commercial graph, free & open source graph, unified graph), and the detailed information of each test is presented in a dedicated report.

So, now that you've learned about the testing process, it's time for the results…

The first assessment criterion was the number of audit features each tool supports.

Reasoning: An automated tool can't detect an exposure that it can't recognize (at least not directly, and not without manual analysis), and therefore, the number of audit features will affect the amount of exposures that the tool will be able to detect (assuming the audit features are implemented properly, that vulnerable entry points will be detected and that the tool will manage to scan the vulnerable input vectors).

For the purpose of the benchmark, an audit feature was defined as a commongeneric application-level scanning feature, supporting the detection of exposures which could be used to attack the tested web application, gain access to sensitive assets or attack legitimate clients.

The definition of the assessment criterion rules out product specific exposures and infrastructure related vulnerabilities, while unique and extremely rare features were documented and presented in a different section of this research, and were not taken into account when calculating the results. Exposures that were specific to Flash/Applet/Silverlight and Web Services Assessment were treated in the same manner.

The Number of Audit Features in Web Application Scanners – Commercial Tools

The Number of Audit Features in Web Application Scanners - Free & Open Source Tools

The Number of Audit Features in Web Application Scanners – Unified List

The second assessment criterion was the detection accuracy of SQL Injection, one of the most famous exposures and the most commonly implemented attack vector in web application scanners.

Reasoning: a scanner that is not accurate enough will miss many exposures, and classify non-vulnerable entry points as vulnerable. This test aims to assess how good is each tool at detecting SQL Injection exposures in a supported input vector, which is located in a known entry point, without any restrictions that can prevent the tool from operating properly.

The evaluation was performed on an application that uses MySQL 5.5.x as its data repository, and thus, will reflect the detection accuracy of the tool when scanning similar data repositories.

Result Chart Glossary

Note that the BLUE bar represents the vulnerable test case detection accuracy, while the REDbar represents false positive categories detected by the tool (which may result in more instances then what the bar actually presents, when compared to the detection accuracy bar).

The third assessment criterion was the detection accuracy of Reflected Cross Site Scripting, a common exposure which is the 2nd most commonly implemented feature in web application scanners.

Result Chart Glossary

Note that the BLUE bar represents the vulnerable test case detection accuracy, while the REDbar represents false positive categories detected by the tool (which may result in more instances then what the bar actually presents, when compared to the detection accuracy bar).

Additional information was gathered during the benchmark, including information related to the different features of the various scanners. These details are organized in the following reports, and might prove useful when searching for tools for specific tasks or tests:

Since the latest benchmark, many open source & commercial tools added new features and improved their detection accuracy.

The following list presents a summary of changes in the detection accuracy of free & open source tools that were tested in the previous benchmark:

·arachni – a dramatic improvement in the detection accuracy of Reflected XSS exposures, and a dramatic improvement in the detection accuracy of SQL Injection exposures (verified on mysql).

·sqlmap – a dramatic improvement in the detection accuracy of SQL Injection exposures (verified on mysql).

·Acunetix Free Edition – a majorimprovement in the detection accuracy of RXSS exposures.

·Watobo– a majorimprovement in the detection accuracy of SQL Injection exposures (verified on mysql).

·N-Stalker 2009 FE vs. 2012 FE– although this tool is a very similar to N-Stalker 2009 FE, the surprising discovery I had was that the detection accuracy of N-Stalker 2012 is very different – it detects only a quarter of what N-Stalker 2009 used to detect. Assuming this result is not related to a bug in the product or in my testing procedure, it means that the newer free version is significantly less effective than the previous free version, at least at detecting reflected XSS. A legitimate business decision, true, but surprising nevertheless.

·aidSQL – a major improvement in the detection accuracy of SQL Injection exposures (verified on mysql).

·XSSer – a major improvement in the detection accuracy of Reflected XSS exposures, even though the results were not consistent.

·Skipfish – a slight improvement in the detection accuracy of RXSS exposures (it is currently unknown if the RXSS detection improvement is related to changes in code or to the enhanced testing method), and a slight decrease in the detection accuracy of SQLi exposures (might be related to the different testing environment and the different method used to count the results).

·WebSecurify – a slight improvement in the detection accuracy of RXSS exposures (it is currently unknown if the RXSS detection improvement is related to changes in code or to the enhanced testing method).

The following section presents my own personalopinions on the results of the benchmark, and since opinions are beliefs, which are affected by emotions and circumstances, you are entitled to your own.

After testing over 48 open source scanners multiple times, and after comparing the results and experiences to the ones I had after testing 12 commercial ones (and those are just the ones that I reported), I have reached the following conclusions:

·As far as accuracy & features, the distance between open source tools and commercial tools is not as big as it used to be – tools such as sqlmap, arachni, wapiti, w3af and others are slowly closing the gap. That being said, there still is a significant difference in stability & false positives, in which most open source tools tend to have more false positives and be relatively unstable when compared to most commercial tools.

·Some open source tools, even the most accurate ones, are relatively difficult to install & use, and still require fine-tuning in various fields. In my opinion, a non-technical QA engineer will have difficulties using these tools, and as a general rule, I'll recommend using them if your background is relatively technical (consultant, developer, etc). For all the rest, especially non-technical enterprise employees that prefer a decent usage experience - stick with commercial produces, with their free versions, or with the simple variations of open source tools.

·If you are using a commercial product, it's best to merge the use of tools with a wide variety of features with tools with high detection accuracy. It's possible to use tools that have relatively good scores in both of these aspects, or use a tool with a wide variety of features with another tool that has enhanced accuracy. Yes, this statement can be interpreted to using combinations of commercial and open source tools, and even to using two different commercial tools, so that one tool will complete the other. Budget? Take a look at the cost diversity of the tools, before you make any harsh decisions; I promise you'll be surprised.

While testing the various commercial tools, I have dealt with certain moral issues that I want to share. Many vendors that were aware of this research enhanced their tools in preparation for it, an action I respect, and consider a positive step. Since the testing platform that included most of the tests was available online, preparing for the benchmark was a relatively easy task for any vendor that invested the resources.

So, is the benchmark fair for vendors that couldn’t improve their tools due to various circumstances?

The testing process of a commercial tool is usually much more complicated and restrictive then testing a free or open source tool; it is necessary to contact the vendor to obtain an evaluation license, and the latest version of the tool (a process that can take several weeks), the evaluation licenses are usually restricted to a short evaluation timeframe (usually two weeks), and thus, updating and testing the tools in a future date can become a hassle (since some of the process will have to be performed all over again)… but why am I telling you all this?

Simply, because I believe that the relevance of the test I performed for vendors that provided me an extended evaluation period and access to new builds was better; for example, a few days before the latest benchmark, immediately after testing the latest versions of two major vendors, I decided to rescan the platform using the latest versions of all the commercial tools I have, to ensure that the benchmark will be published with the most updated results.

I verified that JSky, WebCruiser, and ParosPro didn't release a new version, tested the latest versions of AppScan, WebInspect, Acunetix, Netsparker, Sandcat and Nessus.

It made sense that builds that were tested a short while ago (like NTO spider), were also something that I can rely on to represent the currently state of the tool (I hopeJ).

I did however, have a problem with Cenzic and Burp, two of the first tools that I tested in this research, since my evaluation licenses were no longer valid, and I couldn't update the tools to their latest version and scan again, and since I had 2-3 days until the end of my planned schedule, with a million tasks pending, I simply couldn't afford going through the evaluation request phase again, with all of my good intentions, and the willingness to sacrifice my spare time to ensure these tools will be properly represented.

Even though the results of some updated products (WebInspect and Nessus being the best examples) didn't change at all, even after I updated them to the latest version, who could say that the result would be the same for other vendors?

So, were the terms unfair to burp and cenzic?

Finally, several vendors sent me multiple versions and builds – they all wanted to succeed, a legitimate desire of any human being, even more so for a firm. Apart from the time each test took (a price I was willing to pay at the time), the new builds were sent even in the last day of the benchmark, and afterwards.

But if the new version is better, and more accurate, by limiting the amount of tests I perform for a given vendor, isn't that against what I'm trying to achieve in all my benchmarks, which is to release the benchmark with the most updated results, for all the tools?

(For example, Syhunt, a vendor that did very well in the last benchmark, sent me its final build (2.4.2.5) a day after the deadline, and included a time based SQL injection detection feature in that build, but since I couldn't afford the time anymore, I couldn't test the build, so, am I really reflecting the tool's current state in the most accurate manner? But if I would have tested this build, shouldn't I provide the rest of the vendors the same opportunity?)

One of the questions I believe I can answer – the accuracy question.

A benchmark is, in a very real sense, a competition, and since I take the scientific approach, I believe that the results are absolute, at least for the subject that is being tested. Since I'm not claiming that one tool is "better" than the other in every category, only at the tested criterion, I believe that priorities do not matter – as long as the test really reflects the current situation, the result is reliable.

I leave the interpretation of the results to the reader, at least until I'll cover enough aspects of the tools.

As for the rest of the open issues, I don't have good answers for all of those questions, and although I did my very best in this benchmark, and even exceeded what I thought I'm capable of, I will probably have to think of some solutions that will make the next benchmark terms equal, even for scanners that were tested in the beginning of the benchmark, and less time consuming then it has been.

The results of the benchmark can be verified by replicating the scan methods described in the scan log of each scanner, and by testing the scanner against WAVSEP v1.0.3.

The latest version of WAVSEP can be downloaded from the web site of project WAVSEP (binary/source code distributions, installation instructions and the test case description are provided in the web site download section):

The results of the benchmark clearly show how accurate each tool is in detecting the tested vulnerabilities (SQL Injection (MySQL ) & Reflected Cross Site Scripting), as long as it is able to locate and scan the vulnerable entry points. The results might even help to estimate how accurate each tool is in detecting related vulnerabilities (for example SQL Injection vulnerabilities which are based on other databases), and determine which exposure instances cannot be detected by certain tools;

However, currently, the results DO NOT evaluate the overall quality of the tool, since they don't include detailed information on the subjects such as crawling quality, technology support, scoping, profiling, stability in extreme cases, tolerance, detection accuracy of other exposures and so on... at least NOT YET.

I highly recommend reading the detailed results, and the appendix that deals with web application scanner evaluation, before getting to any conclusions.

Additional Notifications

During the benchmark, I have reported bugs that had a major affect on the detection accuracy to several commercial and open source vendors:

·A performance improvement feature in NTOSpider caused it not to scan many POST XSS test cases, and thus, the detection accuracy of RXSS POST test cases was significantly smaller then the RXSS GET detection accuracy. The vendor was notified on this issue, and provided me with a special build that overrides this feature (at least until they will have a feature in the GUI to disable this mechanism).

·A similar performance improvement feature in Netsparker caused the same issue, however, the feature could have been disabled in Netsparker, and thus, with the support of the relevant personal at Netsparker, I was able to work around the problem.

·A few bugs in arachni prevented the blind sql injection diff plugins from working properly. I notified the author, Tasos, on the issue, and he quickly fixed the issue and released the new version.

·Acunetix RXSS detection result was updated to match the results of the latest free version (one version above the tested commercial version) - Since the tested commercial version of Acunetix was older than the tested free version (20110608 vs 20110711), and since the results of the upgraded free version were actually better than the older commercial version I had tested, I changed the results of the commercial tool to match the ones of the new free version (from 22 to 24 in both the GET & POST RXSS detection scores).

Aside from the Count column (which represents the total amount of audit features supported by the tool, not including complementary features such as web server scanning and passive analysis), each column in the report represents an audit feature. The description of each column is presented in the following glossary table:

Title

Description

SQL

Error Dependant SQL Injection

BSQL

Blind & Intentional Time Delay SQL Injection

RXSS

Reflected Cross Site Scripting

PXSS

Persistent / Stored Cross Site Scripting

DXSS

DOM XSS

Redirect

External Redirect / Phishing via Redirection

Bck

Backup File Detection

Auth

Authentication Bypass

CRLF

CRLF Injection / Response Splitting

LDAP

LDAP Injection

XPath

X-Path Injection

MX

MX / SMTP / IMAP Injection

Session Test

Session Identifier Complexity Analysis

SSI

Server Side Include

RFI-LFI

Directory Traversal / Remote File Include / Local File Include (Will be separated into different categories in future benchmarks)

The results that were taken into account only include vulnerable pages linked from the index-xss.jsp index page (the RXSS-GET and/or RXSS-POST directories, in addition to the RXSS-FalsePositive directory). XSS Vulnerable entry points in the SQL injection vulnerable pages were not taken into account, since they don’t necessarily represent a unique scenario (or at least, not until the “layered vulnerabilities” scenario will be implemented).

While testing the various tools in this benchmark, I dealt with numerous difficulties, witnessed many inconsistent results and noticed that some tools had difficulties optimizing their scanning features on the tested platform. I had however, dealt with the other end of the spectrum, and used tools the easily overcome most of the difficulties related to detecting the tested vulnerabilities.

I'd like to share my conclusions, with the authors and vendors that are interested in improving their tools, and aren't offended by someone that's giving advice.

As far as detecting SQL injection exposures, I have noticed that tools that implemented the following features, detected more exposures, had less false positives, and provided consistent results:

·Time based SQL Injection detection vectors are very effective. They are, however, very tricky to use, since they might be affected by other attacks that are simultaneously executed, or affect the detection of other tests in the same manner. As a result, I recommended to all the authors & vendors to implement the following behavior in their product: execute time based attacks at the end of the scanning process, after all the rest of the tests are done, while using a reduced number of concurrent connections. Executing other tests in parallel might have a negative effect on the detection accuracy.

·Since the upper/lower timeout values used to determine whether or not a time based exploit was successful may change due to various circumstances, I recommend calculating and re-calculating this value during the scan, and revalidating each time based result independently, after verifying that the timeout values are "normal".

·Implement various payloads of time based attacks – the sleep method is not enough to cover all the databases, and not even all the versions of mysql.

So now that we have all those statistics, it's time to analyze them properly, and see which conclusions we can get to. Since this process will take time, I have to set some priorities;

In the near future, I will try to achieve the following goals:

·Find a better way to present the vast amount of information on web application scanners features & accuracy. I have been struggling with this issue for almost 2 years, but I think that I finally found a solution that will make the information more useful for the common reader… stay tuned for updates.

·Provide recommendations for the best current method of executing free & open source web application scanners; the most useful combinations, and the tiny tweaks required to achieve the best results.

·Release the new test case categories of WAVSEP that I have been working on. Yep, help needed.

In addition to the short term goals, the following long term goals will still have a high priority:

·Perform additional benchmarks on the framework, and on a consistent basis. I previously aimed for one major benchmark per year, but that formula might completely change, if I'll manage to work a few issues around a new initiative I have in this field.

·Publish the results of tests against sample vulnerable web applications, so that some sort of feedback on other types of exposures will be available (until other types of vulnerabilities will be implemented in the framework), as well as features such as authentication support, crawling, etc.

I hope that this content will help the various vendors improve their tools, help pen-testers choose the right tool for each task, and in addition, help create some method of testing the numerous tools out there.

Since I have already been in the situation in the past, then I know what's coming… so I apologize in advance for any delays in my responses in the next few weeks.

During the research described in this article, I have received help from quite a few individuals and resources, and I’d like to take the opportunity to thank them all.

For all the open sourcetool authors that assisted me in testing the various tools in unreasonable late night hours, for the kind souls that helped me obtain evaluation licenses for commercial products, for the QA, Support and Development teams of commercial vendors, which saved me tons of time and helped me overcome obstacles, and for the various individuals that helped me contact these vendors.

I hope that the conclusions, ideas, information and payloads presented in this research (and the benchmarks and tools that will follow) will be for the benefit of all vendors, open source community projects and commercial vendors alike.

Q: 60 web application scanners is an awful lot, how many scanners exist?

A: Assuming you are using the same definition for a scanner that I do, then I'm currently aware of 95 web application scanners that can claim to support the detection of generic application level exposures, in a safe an controllable manner, and in multiple URLs (48 free & open source scanners that were tested, 12 commercial scanners that were tested, 25 open source scanners that I didn't test yet, and 10 commercial scanners that slipped my grip). And yes, I'm planning on testing them all.

Q: Why RXSS and SQLi again? Will the benchmarks ever include additional exposures?

A: Yes, they will. In fact, I'm already working on test case categories of two different exposures, and will use them both for my next research. Besides, the last benchmark focused on free & open source products, and I couldn't help myself, I had to test them against each other.

Q: I can't wait for the next research, what can I do to speed things up?

A: I'm currently looking for methods to speed up the processes related to these researches, so if you're willing to help, contact me.

Q: What’s with the titles that contain cheesy movie quotes?

A: That's just it - Ihappen to like cheese. Let's see you coming up with better titles at 4AM.

Although this benchmark contains tons of information, and is very useful as a decision assisting tool, the content within it cannot be used to calculate the accurate ROI (return of investment) of each web application scanner. Furthermore, it can't predict on its own exactly how good will the results of each scanner be in every situation (but it can predict what won't be detected), since there are additional factors that need to be taken into account.

The results in this benchmark could serve as an accurate evaluation formula only if the scanner will be used to scan a technology that it supports, pages that it can detect (manual crawling features can be used to overcome many obstacles in this case), and locations without technological barriers that it cannot handle (for example, web application firewalls or anti-CSRF tokens).

In order for us to truly assess the full capability of web application vulnerability scanners, the following features must be tested:

·The entry point coverage of the web application scanner must be as high as possible; meaning, the tool must be able to locate and properlyactivate (or be manually "taught") all the application entry points (e.g. static & dynamic pages, in-page events, services, filters, etc). Vulnerabilities in an entry point that wasn't located will not be detected. The WIVET project can provide additional information on coverage and support.

·The attack vector coverage of the web application scanner – does it support input vectors such as GET / POST / Cookie parameters? HTTP headers? Parameter Names? Ajax Parameters? Serialized Objects? Each input vector that is not supported means exposures that won't be detected, regardless of the tool's accuracy level (assuming the unsupported attack/input vector is vulnerable).

·The scanner must be able to handle the technological barriers implemented in the application, ranging from authentication mechanism to automated access prevention mechanisms such as CAPTCHAs and anti-CSRF tokens.

·The scanner must be able to handle any application specific problems it encounters, including malformed HTML (tolerance), stability issues and other limitations. If the best scanner in the world will consistently cause the application to crash in a couple of seconds, then it's not useful for assessing the security of that application (in matters that don't relate to DoS attacks).

·The number of features (active & passive) implemented in the web application vulnerability scanner.

·The accuracy level of each and every plugin supported by the web application vulnerability scanner.

That being said, it's crucial to remember that even in the most ideal scenario, with the absence of human intelligence, scanners can't detect all the instances of exposures that are truly logical – meaning, are related to specific business logic, and thus, are not perceived as an issue by an entity that can't understand the business logic.

But the sheer complexity of the issue does not mean that we shouldn't start somewhere, and that's exactly what I'm trying to do in my benchmarks – create a scientific, accurate foundation for obtaining that goal, with enough investment, over time.

Note that my explanations describe only a portion of the actual tests that should be performed, and I'm sharing them only to emphasize the true complexity of the core issue; I haven't touched stability, bugs, and a lot of other subjects, which may affect the overall result you get.

The following commercial web application vulnerability scanners were not includedin the benchmark, since I didn't manage to get an evaluation version until the article publication deadline, or in the case of one scanner (mcafee), had problems with the evaluation version that I didn't manage to work out until the benchmark's deadline:

The benchmark focused on web application scanners that are able to detect either Reflected XSS or SQL Injection vulnerabilities, can be locally installed, and are also able to scan multiple URLs in the same execution.

·Uncontrollable Scanners - scanners that can’t be controlled or restricted to scan a single site, since they either receive the list of URLs to scan from Google Dork, or continue and scan external sites that are linked to the tested site. This list currently includes the following tools (and might include more):

oDarkjumper 5.8 (scans additional external hosts that are linked to the given tested host)

·Deprecated Scanners - incomplete tools that were not maintained for a very long time. This list currently includes the following tools (and might include more):

oWpoison (development stopped in 2003, the new official version was never released, although the 2002 development version can be obtained by manually composing the sourceforge URL which does not appear in the web site- http://sourceforge.net/projects/wpoison/files/ )

oetc

·De facto Fuzzers – tools that scan applications in a similar way to a scanner, but where the scanner attempts to conclude whether or not the application or is vulnerable (according to some sort of “intelligent” set of rules), the fuzzer simply collects abnormal responses to various inputs and behaviors, leaving the task of concluding to the human user.

oLilith 0.4c/0.6a (both versions 0.4c and 0.6a were tested, and although the tool seems to be a scanner at first glimpse, it doesn’t perform any intelligent analysis on the results).

oSpike proxy1.48 (although the tool has XSS and SQLi scan features, it acts like a fuzzer more then it acts like a scanner – it sends payloads of partial XSS and SQLi, and does not verify that the context of the returned output is sufficient for execution or that the error presented by the server is related to a database syntax injection, leaving the verification task for the user).

·Fuzzers – scanning tools that lack the independent ability to conclude whether a given response represents a vulnerable location, by using some sort of verification method (this category includes tools such as JBroFuzz, Firefuzzer, Proxmon, st4lk3r, etc). Fuzzers that had at least one type of exposure that was verified were included in the benchmark (Powerfuzzer).

·Exploiters - tools that can exploit vulnerabilities but have no independent ability to automatically detect vulnerabilities on a large scale. Examples:

oMultiInjector

oXSS-Proxy-Scanner

oPangolin

oFGInjector

oAbsinth

oSafe3 SQL Injector (an exploitation tool with scanning features (pentest mode) that are not available in the free version).

oetc

·Exceptional Cases

oSecurityQA Toolbar (iSec) – various lists and rumors include this tool in the collection of free/open-source vulnerability scanners, but I wasn’t able to obtain it from the vendor’s web site, or from any other legitimate source, so I’m not really sure it fits the “free to use” category.

The following appendix was published in my previous benchmark, but I decided to include in the current benchmark, mainly because I didn't manage to invest the time to get to the bottom of these mysteries, and didn't see any information on someone else that did.

During the current & previous assessment, parts of the source code of open source scanners and the HTTP communication of some of the scanners was analyzed; some tools behaved in an abnormal manner that should be reported:

·Priamos IP Address Lookup – The tool Priamos attempts to access “whatismyip.com” (or some similar site) whenever a scan is initiated (verified by channeling the communication through Burp proxy). This behavior might derive from a trojan horse that infected the content on the project web site, so I’m not jumping to any conclusions just yet.

·VulnerabilityScanner Remote RFI List Retrieval (listed in the scanners that were not tested, appendix A, developed by a group called RST, http://pastebin.com/f3c267935) – In the source code of the tool VulnerabilityScanner (a python script), I found traces for remote access to external web sites for obtaining RFI lists (might be used to refer the user to external URLs listed in the list). I could not verify the purpose of this feature since I didn’t manage to activate the tool (yet); in theory, this could be a legitimate list update feature, but since all the lists the tool uses are hardcoded, I didn’t understand the purpose of the feature. Again, I’m not jumping to any conclusions; this feature might be related to the tool’s initial design, which was not fully implemented due to various considerations.

Although I did not verify that any of these features is malicious in nature, these features and behaviors might be abused to compromise the security of the tester’s workstation (or to incriminate him in malicious actions), and thus, require additional investigation to disqualify this possibility.

50 comments:

Not taking false positives into the general score is at least frivolous. Just for consideration. I'll give you a program with 1 line of code (print "target is vulnerable") and it will have 100% of success rate and 100% of false positives and you'll put it at the top of the list. LOL :)

Yep. Tricky Issue, and apart from the ranking formula, I also have other issues:In fact, in some of the charts, in case two tools have an even score, it will present the one with the highest false positive ratio first (!), regardless of what I write in the query, or define in the form/report. I have been struggling on this issue with MS Access with zero success so far... my punishment for using MS Access to store the information :)

(reply for stiguru)As far as I know, Qualys only provide SaaS scanning services, and don't supply their product for local tests, one of the perquisites of all the scanners in this benchmark (which is required in order to make sure that there's no manual intervention in the test).I didn't manage to get an evaluation license of Rapid7 and eEye vulnerability scanners until the deadline, and wasn't aware of SAINT's web application scanning capabilities (until now, thanks to you), but I'll do my best to test these products in my next research.

i haven't said anything about the test(s) itself - i can sense some hard work was invested into this. i just wanted to say that with 'print "target is vulnerable"' you could get to the top in the given charts.

There's a good subject discussed by @Tasos and @Miroslav that I'd like to discuss further. It's related to the position of scanners in the benchmark, and the fact that currently, false positives do not "lower" the position of a scanner in the chart (even though the chart still presents the amount of false positives separately).A potential flaw in this scoring mechanism was raised by @Miroslav, since in this method, any scanner that reports 100% percent of the pages it scans as vulnerable will be ranked 1st, even if it's false positive ratio is also 100%.Well, in my opinion, there's a difference between a scanner with a high ratio of false positives (something I let the user's decide whether or not to use), and a scanner that reports everything as vulnerable, and two good examples of explaining this scenario are the cases of the tools *Lilith" and "Spike Proxy".Whenever I tested a tool, I also analyzed its communication (using burp/wireshark), and the payloads it was sending in order to conclude something was vulnerable (assuming its license permitted it). In many cases, the tools "spike proxy" and "Lilith" sent payloads that could not have confirmed vulnerabilities that were reported, and in the case of spike proxy (a tool from 2003), nearly every page was found vulnerable to every plugin the tool had.As a result, I decided not to include these tools in the benchmark (due to the absence of a logical detection algorithm), and instead, placed them under the category of "de-facto fuzzers" (since they provide the user with leads, without performing verifications themselves), found in Appendix B.I believe that if any of the other tools would have behaved in a similar manner, I would have found out, and place it under the same category (which isn't a punishment, just a classification).Is my test perfect? No.Is my method foolproof? Nonsense.But I did my best, and I did try to detect these issues, as well as a variety of others.

True, the copy of Cenzic was relatively old, and so was the copy of Burp Suite (even though some of the data and particularly the list of features for Cenzic was updated from other sources as well).I have touched the reasons for that in a whole section called: "10. Morale Issues in Commercial Product Benchmarks", in which I admit the fact that the test was less fair precisely for these two vendors, and that as a result, I will change the format of my researches in the future to a dynamic score – one that could present the immediate result of any vendor (will be explained in a different post).

Thanks for this! Of course it is impossible to create a perfect webapp test and please everyone with the results but I feel you have done better then anyone else has. You can tell that you have put a lot of work into this. I'm curious if you had a chance to test the latest version of BurpSuite (1.4) scanner and how it would rank? Thanks again!

Hi Dru,I didn't manage to test Burp 1.4; the version tested was 1.3.09, and 1.4 was released less than two months ago (http://www.pentestit.com/2011/06/06/update-burp-suite-v14/), about two months after I finished testing 1.3.09 (again, one of two vendors that had a newer version that wasn't tested – the other one being cenzic).I know from rumors that in 1.4 they also implemented a relatively rare feature that is only implemented by 4 other vendors (privilege escalation checks – which is a feature that as far as I know, is only implemented by appscan, WebInsepct, cenzic hailstorm and NTOSpider).Although 4 months are more than enough time to implement significant changes to the tested vulnerability detection mechanisms, I don't want to guess or estimate things I didn't manually test myself.I intend to contact them officially for my next research in order to test that, but I need to modify the way results are presented to ensure that next time, there won't be any significant time differences for any vendor… which means I have to do some infrastructure work upfront.

Great overview, thanks for that. Small remark: in a next test, can you include tests for spider functionality (e.g. by using WIVET). This might give a better insight whether or not some of the scanners can be used in a point-and-shoot scenario.

Hi Shay, any chance that you could put together and include cost per package next time? (In some kind of base/universal configuration like 1yr license/support/maint./updates, unlimited targets, etc)

Obviously Open Source stuff is going to be $0. But, if for example the cost of one commercial product is significantly more than a combination of other commercial products then that tells a good story and present interesting options as well.

Just as a totally fictitious example: If Burp Pro + NetSparker Commercial + WebInspect together cost less than IBM AppScan Standard then that's also becomes a noteworthy factor for companies, contractors, and analysts as well.

Hi Guys.Using WIVET for future tests is on my list (and I hope I'll have enough time to include it in my next research - although I haven't estimated the effort yet), and although I intentionally avoided publishing the prices of the various products, I will do it eventually, once the research reaches a mature phase (I have all the numbers, but didn't want to associate the research to ROI, at least not yet).

Incredible data and extremely useful. Despite the flaws inherent in trying to compare vastly different products, you have worked through it all logically. Your process has helped me understand how to evaluate the data and to help me choose the right tools for the job at hand.

Perhaps readers should think about not picking the apps at the top of the lists just because they are at the top, but use your work to choose the best combination of benefits for maximum impact.

For instance, I'm curious about using w3af's spider to feed arachni's scans. I know that arachni has a new spider in v0.3, but the ability to import might solve problems, too.

And it's research work like this site that makes questions like that possible.

The GamaSec Scan SaaS was off line during your verification I invite you to try it and to share your result as we believe we have one of the best performance online SaaS tools scan on the market especial again application vulnerabilitieswww.gamasec.com

The latest versions of some of the tools use selenium for Ajax crawling -I know ZAP 2.0+ uses it indirectly through the OWASP Ajax Crawling Tool, and some additional OWASP projects do the same (Fuzzops-ng, etc)...

That's pretty much what I know of the subject... didn't hear about any scanners that generate selenium/silk scripts, but then again, that wasn't part of the assessment.

Hay Dear, Do you find to web proxy free,proxy web free,unblock web proxy,secure proxy,unblock website proxy. I suggest you to visit this site. There are a lot of web proxy . For details: web proxy free

Hi admin, it is very nice blog, very informative blog so thank you so much for this information. But there is another website with same concept of free ad posting www.helpadya.com www.helpadya.comClassified Free Ads.

Safehands Security Services based in Adelaide provides top-notch event security guards services any time of the year. We deploy the best-trained security officers who can handle any eventuality and make the event a success. They can protect the guests, celebrities, patrons with equal dedication.

Power Suggest Pro is a highly acclaimed keyword and market research tool which has received numerous positive reviews from internet entrepreneurs and marketers who have found the software to be powerful yet extremely easy-to-use and more importantly, a valuable tool for their businesses. Read more here - https://www.powersuggestpro.com/hfy748

Power Suggest Pro is a highly acclaimed keyword and market research tool which has received numerous positive reviews from internet entrepreneurs and marketers who have found the software to be powerful yet extremely easy-to-use and more importantly, a valuable tool for their businesses. Read more here - https://www.powersuggestpro.com/hfy748

Power Suggest Pro is a highly acclaimed keyword and market research tool which has received numerous positive reviews from internet entrepreneurs and marketers who have found the software to be powerful yet extremely easy-to-use and more importantly, a valuable tool for their businesses. Read more here - https://www.powersuggestpro.com/hfy748

Power Suggest Pro is a highly acclaimed keyword and market research tool which has received numerous positive reviews from internet entrepreneurs and marketers who have found the software to be powerful yet extremely easy-to-use and more importantly, a valuable tool for their businesses. Read more here - https://www.powersuggestpro.com/hfy748