I've been meaning to put together a post like this for some time. Another discussion thread in this forum finally motivated me to do it.

The Wretched State of Anti-Spyware Testing

Anti-spyware testing is in a reprehensible state these days, with almost no good anti-spyware testing being performed. This is both surprising and depressing. The anti-virus industry has long enjoyed quality testing (not perfect testing) from a variety of sources, and there is an established body of literature on the subject, a collection of recognized entities performing testing, and a loosely defined expert community that works on defining the standards for good AV testing. The result has been testing that not only pushes AV vendors to improve the quality of their products, but test results that users and consumers can place some amount of trust in.

Sadly, this is not the case for the anti-spyware industry, which even 6 years after the emergence of adware on the internet enjoys no regular, trustworthy source of anti-spyware testing and lacks even the most basic forms of public discussion among industry experts to define what would constitute quality anti-spyware testing. The net effect of this collective failure has been chaos and confusion -- with consumers having no reliable guide to the comparative effectiveness of anti-spyware applications on the market (which is rife with low quality "rogue" apps that prey upon the gullible, the confused, and the frightened) and developers having no steady, rational source of guidance for improving the quality of their applications.

To be sure, we do see all kinds of purported anti-spyware testing on the Net -- from bogus "ratings" sites that are affiliates for the very products they pretend to evaluate, to random independent tests published in forums yet conducted in a haphazard fashion, to tech magazine tests that neglect to divulge critical details of the test bed and test methodology used for the tests.

While readers and consumers might be able to glean the occasional nugget of useful information from this hodge podge landscape of anti-spyware testing on the Net, the fact remains that the landscape is rather barren of quality testing, lacking even those few oases of useful and meaningful testing where users and admins might actually slake their thirst for good data on the apps being sold to them.

In what follows, I can't hope to right this situation -- certainly not in one forum post. But you can think of it as a starting point for defining what good anti-spyware testing could and ought to look like. It can also serve as a guide to identifying problems with the current crop of anti-spyware testing found on the Net.

Defining Quality Anti-Spyware Testing

Quality anti-spyware testing should produce results that are:

* valid
* reproducible (repeatable)
* meaningful

In order to perform anti-spyware testing that is reproducible, valid, and meaningful, one should...

6. Identify the threats in the test bed independently of the applications being tested --

7. Test against threats harvested by you from actual, operative threat environments --

I don't disagree, but there are some challenges, as I see it, to be able to accomplish those things.

When going into an operative threat environment, one does not necessarily know what they are going to encounter. Some malware sites change their payloads rapidly, even by the hour in some cases.

I'm not sure how one can be sure of getting complete threats as opposed to partial threats, especially with malware changing rapidly. What would a "complete threat" be for DollarRevenue, for example? And how can the tester know whether or not the threat is complete?

If the tester were going to Zango's site, for example, and downloaded some Zango apps, then they could be reasonably sure they got a complete threat. But going into the wild, I don't know how one could know if they got a complete threat.

In my mind, an actual, operative threat environment would not be Zango's site. To me it sounds more like some we know of -- warez sites and CWS porn sites. Maybe you can clarify what you mean and give an example.

Identify the threats in the test bed independently of the applications being tested -- I understand what you mean by that, but can you give an example of how to accomplish that? An InCtrl5 log will show the registry and file changes, as will some other system monitoring apps, but one has to be quite experienced in malware to identify the threats. Or one could use an app not being tested, for example using Kaspersky's online scanner, to identify threats, or using an anti-spyware app that is not being tested.

In the case of using another vendor's program, whether AV or anti-spyware, then there's the difference in naming conventions. Kaspersky uses AV names where an anti-spyware app might use a totally different name. We encounter this on a daily basis, as there is no standard in naming conventions. Then there are new threats that can be unidentified by any vendor at the time of testing.

At any rate, those might be useful points for further discussion._________________Former Microsoft MVP 2005-2009, Consumer Security
Please do not PM or Email me for personal support. Post in the Forums instead and we will all learn.

Well thought out Eric.
I agree with Suzi, both the AV and AS industries need to standardize the threat names. For Example, one company finds and names a new threat, the others should use the same name. It would make things easier I think for those of us cleaning up infected computers.

I could have qualified that by referring to "functionally and empirically complete" threats. That is, we want tests against threats that are present on the test box in a minimally functional state. Thus, don't claim to test against a particular trojan-proxy or rootkit, if one of the key .DLLs necessary to making the threat operational is missing for some reason.

To take your example, Dollar Revenue: what we call "Dollar Revenue" is really a collection of threats that we happen to hang the same name on -- trojan downloaders and backdoors for the most part.

As to the inherent randomness of today's threats, yes that's true, but ultimately I don't see an obstacle to the requirement that one identify and then test against complete threats. If one encounters a threat that is functionally incomplete, one should exclude it from the test bed.

6. Identify the threats in the test bed independently of the applications being tested --

This effectively means, don't rely on the apps being tested to tell you what's on the test box or introduced in the test environment.

Start with an independent source, such as an installation log or change log, coupled with manual inspection of the test box aided by various tools. The goal here is to build an inventory of malware components (files, folders, Registry data) that one then identifies (groups and names) through a variety of means.

Naming certainly can prove a hassle, but even in AV testing this is no final obstacle to testing because one can select names that are well recognized enough to allow for translation when inspecting scan logs. Moreover, the primary emphasis should be on threat components, whatever one names the ultimate threat.

All too often I see folks in forums claim to do an anti-spyware test by simply running a few scanners over an infested box and then trying to draw conclusions by comparing the scan logs. The fundamental problem is that they're essentially relying on the tested applications to tell them what threats are present without taking even the most minimal steps to inventory the threats themselves so as to establish a baseline against which to evaluate the performance of the applications.

The result is an analytical hall of mirrors as the tester attempts to make sense of the mess by comparing scan logs and reported numbers/counts from the tested applications, each of which gives an incomplete and decidedly peculiar picture of the threats on the test box. That's a losing strategy guaranteed to produce incomplete, flawed test results.

Why is it that no two entities doing amti-malware testing come up with the same results?

As someone who has done a great deal of benchmarking, development testing, and definition development, I already have quite a few ideas as to why this occures. But I'd like to know what you guys think are the reasons behind this anomaly._________________-

wouldn't that have to do with a lack of standards, mikey? or are you asking in ragard to a consistency of (some) standard being adhered-to, which nevertheless gives inconsistent results? if the latter, wouldnt other variables that werent accounted-for come into play?

Sorry H7 for the delay. I think you are probably on the right track in that the variables are endless but my 'question' was actually meant in a rhetorical manner. There are thousands of reasons why no one has ever been able to develop a methodology that wasn't extremely flawed.

I personally have no interest in brands...I don't even use these type tools anymore. However, I am concerned about users being pitched a brand under false pretenses. And regardless of motive, every so called comparison test I've seen to date is just that._________________-

The part that is troublesome is how severely F-Secure reacts to the presence of a previous antivirus product, including such symptoms as Windows won't boot or it goes into an endless loop of spontaneous reboots.

Quote:

ZoneAlarm seems to be a common interaction problem with F-Secure. Remember, several of ZA's different package levels include an antivirus module originally supplied by Computer Associates. If you're running one of those packages, you would have to remove it as well before installing F-Secure.

Quote:

It's not just F-Secure either. Reader Darryl Phillips wrote to me recently that he had trouble with his purchased copy of ZoneAlarm conflicting with his Nvidia graphics driver, and the result was that ZoneAlarm's antivirus component stopped performing email scans.

There is also the situation with McAfee VirusScan 2006 v10.0 conflicting with Ad-Aware. Apparently, there are also conflicts with McAfee and Zone Alarm Security Suite.

Helping people solve malware issues frequently involves requesting installation of trial copies of A/S or A/V software because the particular vendor detects and removes the particular object we're dealing with at the time. This is good publicity for the software vendor and often times results in a sale of their product. However, when the software causes conflicts with existing software, not only does that effect our ability to help, it loses a potential sale and hurts the company's reputation.

Bottom line, it doesn't matter which A/V, A/S, or Firewall is best if it conflicts with existing software on the computer._________________,

Unfortunately, compatibility issues among anti-malware apps are likely to get worse before they get better. As more and more anti-malware apps move into the kernel space to get the edge on rootkits and other nasties, there are going to be increasingly difficult compatibility issues to resolve.

It used to be that one could recommend running one anti-malware resident and using several others as on-demand scanners. That setup would avoid most compatibility problems.

With more and more apps resorting to kernel level filter drivers, the above advice is no longer likely to work, as many anti-malware apps will insist on loading their drivers and services even when they're not configured to provide real-time/resident protection.

Moreover, several programmers have told me that there are resource issues/constraints as well when you start loading more and more of these kernel level drivers.

In so many of these forums I see folks bragging openly about the number of anti-malware and fireall apps they're running. Some of these folks are running an insane number of resident protection apps -- one wonders how the system even manages to stay afloat under the load.

Such users are going to be in for a rude shock when they run into more and more conflicts -- and blue screens, unfortunately, for running at kernel level is a bit like doing a hire wire act without a net.

Those who are providing advice to users in forums will have to become familiar with what apps are running kernel level drivers, what apps have potential or real conflicts, and advise users to choose their "suite" of protection apps carefully.

The days of piling one app on top of another willy-nilly are coming to an end.

As busy as you guys are I hate to ask, but I was wondering if it would be too much on your plate for you or other SW staff to put together an informal 'sticky' on this topic warning users about these pittfalls.

In addition to the sys conflicts, perhaps also could be included the hazards of the 'willy-nilly' use of multiple anti-malware products and the removing of the items found without any real care or study.

"BrandX is better because it found x number of items that my other 10 or 15 tools missed." Those in the know would of course recognize this for what it is. However this user and those reading his words continue to foster more of the folly. Rarely do any staffers at any pri/sec site even mention the hazards when they see this behavior. In fact, many of them even seem to be of the same mind.

Well, whether you can manage this or not I thank you again for a great post well said._________________-

'Originally Posted by SUPERAntiSpy
Now things make a little more sense - they are not running these samples in their native environment, meaning infecting a machine/system with the infection, then cleaning it. Many of them appear to be copied simply to "c:\virus", which is not where these infections "live" in the real-world. Our heursitic guarding and false-positive prevention will often kick these out and not detect them as not to produce false positives when the infections are completely out of their observed and researched environments.

We could simply update our rules to get those - but these are not real world tests - I have contacted them several times, but can't seem to get a reply.

As for the comparison with fckudat's testing, I'm afraid you're mistaken. The sample size for this particular test round is about the same as fckudat's recent test, if not a little larger. The big difference, though, lies in the methodology, reporting, and other factors. Start going through my list of points in the first post of this thread and you'll start to see why.

I was referring to this: 'Many of them appear to be copied simply to "c:\virus". ' when I made the comments about Fcukdat`s testing as I belive he lets things set themselves in a more natural /real way. Maybe fcukdat could confirm this by posting here?_________________

Yes the infections were gained by how a victim would gain them as you would put it and not launched from archived installers as i used to do once apon a time

Eric has made some excellent points in this topic.My homemade amateur tests actually fufill more of his testing criteria then most think.

The points where my tests are flawed are that they do not cover all malware infections(my model is only limited now to 3 infections sources which have the capability to import more malware but in no way even cover 1% of potential malware threats out there.So their bitesized which is flaw 1.

Data presentation is not good,flaw2

My understanding of malware is no where near expert and thus my data presented is not detailed, flaw 3

The biggest flaw being what is tested one day and fails might update defs overnight and be sucessful the next,flaw 4(applies to all tests by anyone,so results are not current for long.)

Eric is correct in perceiving testing as a learning curve.It sure has taught me some stuff for example disconecting the modem so the 15 or so files trying to reach the outside world whilst testing.Too many malware infections will bring a PC to a halt and prevent meaningful testing
Proof read your posts before publishing despite eagerness to publish
I made a mistake once apon a time that was picked up by a fellow member with reguards published results being erred on a malware count report>>>
http://spywarewarrior.com/viewtopic.php?p=115146#115146
So for the record my learning curve ment that i should always proof read results to be published under review so the same mistake would'nt happen again

Flaw% Not enough details but my aim was to keep things as simple as possible so data published could not be misconstrued and maintained testing integrity.

Maybe in the future when i publish home gathered testing results then i will add a disclaimer about how all testing is inheritently flawed.

Hello Fcukdat.
I`ve been thinking about this:
'The biggest flaw being what is tested one day and fails might update defs overnight and be sucessful the next,flaw 4(applies to all tests by anyone,so results are not current for long.)'.
And you`re right, just think if the test takes say five hours then the last software you test will have the advantage because they might have just updated the sigs in the last hour giving it a major advantage.
But there is a way round this: say you have five scanners that you want to test, well all you need is five identical computers in every respect, infections etc. with a different scanner installed on each, then at a set time you simultaneously update them all._________________

just think if the test takes say five hours then the last software you test will have the advantage because they might have just updated the sigs in the last hour giving it a major advantage.
But there is a way round this: say you have five scanners that you want to test, well all you need is five identical computers in every respect, infections etc. with a different scanner installed on each, then at a set time you simultaneously update them all.

Hi Bigos
This is not an issue for my tests since the test model is based on all software to be tested installed(on demand capacity) and updated to current defs,malware infection(s) imported and then i take a drive image of the infected PC to roll back too.All softwares updated within mins of each other so no real advantage gained.

I can then test at my leasure,the flaw being if it takes me say 5 days to test all softwares & publish results that the vendor might have released defs during that time that would mean the software could detect& clean a test sample.This is where reults gained have a shortlife span & inheritanly flawed IMO

What does'nt nail Lool2Me yesterday might well do it tomorrow.Yet my publish results will only show how the software faired on the time of installing& updating

EG if you take Eric's tests of ASW a few years back and use the same software/malware & test model now.All then tested softwares worth their salt should get 100% on detect & clean tests now.Since they have had around 2 yrs devs to catch up with it.So although the original model used by Eric was good the data now published is inheritently past its use-by-date

As a reverse engineer I would personally prefer if the anti-malware companies (AMW companies) would create something like a challenge for each category (e.g. anti-spyware, anti-virus, firewall, ...) and officially allow reverse engineers to hold a bug-hunt or a "whitebox-review", if you will. There could be something like a gentlemen's agreement between the (independent!!!) reversers and the respective AMW company. This gentlemen's agreement could optionally set a reward per undisputable bug found per reverser and company (upon the discretion of the company). The fact of being undisputable or not will be judged by a 1st jury (also independents). Imagine cases where one bug could be seen as multiple bugs or vice versa, because of interdependencies of the code parts.

The findings would not be released before a certain period (I suggest 3 months) which allows the AMW company to fix the found bugs within this time. Afterwards the findings are published. I would suggest at least two reverse engineers per product, to allow for some kind of peer review - between those reversers working on the same product there has to be some kind of gentlemen's agreement as well so that the one having found the bug first will be able to get the reward. The peer reversers would have to document their findings in a paper. Before editing the paper there should be something like "reporting bug by module/address" on some website independently set up for this purpose. This is used to decide who found a bug first.

The papers for the findings will follow certain scientific rules and will be plain-text, LaTeX or HTML to allow the different revisions to be checked in to a secure CVS/SVN/[your favorite version control system here]. Checking in new bugs should be done in a timely manner, so that a peer is being informed of a found bug and will not waste time to the same bug. This might be combined with the above idea of the website for immediate bug reports. Certain criteria should be established for the proper reporting of a bug for certain code categories (e.g. kernel mode vs. user mode).

Used tools would include freely available tools (e.g. OllyDbg, WinDbg, IDE Free) as well as professional tools (IDA Pro, WDASM). In case of professional tools the reverser has to prove legal ownership.

Before expiry of the period given to the vendors to fix the bugs, the reversers will be "mixed" again to allow every paper to be reviewed by a competent person. This reviewer, however, will not be subject to any kind of NDA - only to the aforementioned gentle

After the reasonable period given to the vendors, the papers are being published, no matter whether the vendor has fixed the found bugs meanwhile - three months or more are definitely reasonable for any software, including Windows XP with several million lines of source code. If necessary a contract should be worked out about this, so no complains can occur afterwards.

This allows both, the vendors and the users to evaluate the performance of the products and the performance of the competition.

However, the reviews will not be about the number of targets found, about the number of targets included in the signature file, about the update-cycle of signatures by the vendor and other things. This is just not important for this kind of review and will be left to other reviewers who want to perform a blackbox-review.

More rules:

Reversers could be vetted before being allowed to take part in the bug-hunt. A reverser is not allowed to reverse the software of a former employee for the purpose of this "challenge".

If modules of the product are packed or encrypted, the vendor should supply the reverser with the unpacked/decrypted version of the product. An NDA or other contracts may be used to make sure that the material does not leak out.

The products have to be described and evaluated for a set of requirements set by a 2nd jury - which are multiple developers/researchers of the involved companies. This means that not just bugs have to be reported and described, but the program has to be reviewed as well.

Details that might touch the company's business secrets have to be discussed between the reverser, a representative of the affected company and an independent member of the 1st jury. The decision has to be published in the paper with a rough abstract of the detail being omitted and reasons.

It is up to the vendor to ask the reverser to review also a product which has not been publicly released, if the vendor finds this useful and allows the review with or without an NDA or a similar agreement.

Furthermore it is up to the companies to have a closer work-relationship between reverser and developer during the reverse engineering process already.

Peers and the company whose product is reviewed have the same rights to see the current version of the report papers being prepared. This allows for a timely bug-fixing process.

Recommendations:

Reversers should be paid per bug. Alternatives are possible, such as employing a reverser for the period of the challenge as a contractor/consultant/whatever. Reversing is a time-consuming process and it is only fair to compensate the reversers. Furthermore it is a gain/gain-situation since the reverser gains reputation and possibly knowledge while the company gets the possibility of an in-depth review of their software for a few bucks compared to a review by some company contracted to do it.

The rules and recommendations are just there to be modified or amend new rules. It's not set in stone, it should be seen as a proposal in early stage.

---- snip ----

How about this kind of review? Are the AMW companies afraid of it? It would be beneficial to users, vendors, reversers ... why has no one else ever come up with such an idea?

The competitive bug hunt competition that you've described is intriguing and could yield benefits to developers and users.

What you're describing, though, doesn't even remotely resemble comparative anti-spyware testing, which was the explicit topic of this thread.

And it is not meant to be. How can you compare technologies in an objective way? Most likely you cannot!

However, you can judge whether a certain technology matches the threat it is supposed to mitigate
Many of the used technologies might not match another vendor's technology when being compared, but as long as they help the application to do its job ...

Let's give an example:

If a driver component of the product uses hooking techniques you can look at the implementation and judge very well whether it is properly or badly implemented, without even judging about the usage of hooks (which regularly causes flame wars on the kernel mode developer lists).

If the product has no driver but claims to have real-time protection, one can argue whether this is safe. An in-depth review could reveal the weaknesses behind the used approach and recommend new ways to achieve the same functionality, but better implemented.

I can imagine that many vendors will not even cooperate in such a challenge, in fear of things being revealed that are questionable. In such cases reversers from jurisdictions allowing reverse engineering for security review (most I am aware of do) could jump in ...
On the other hand the above method of leaving out certain details could mitigate this problem anyway

Even years ago when I was not yet reverser and developer I was interested in how things work. In case of AMW products it would be a valuable addition to the reviews/tests covering detection rate and other "highly visible" features of a product.

Although some AMW companies might disagree for one or another reason, it is not always the most important to detect "known" threats and remove them (most likely signature-based), but it is highly important to not let the malware enter the system in the first place. Currently it seems to me that many AMW companies are mor focused on removing the results of an infection than trying to prevent the actual infection.

The compatibility issues mentioned above. Indeed it is complicated to have a certain number of competitive products installed on the same machine. Due to the file system filters used by most AVs, multiple of them would have to cooperate. In the driver writer scene Symantec is known for its bad compatibility on this software level. However, it is questionable to install multiple AV programs as each of them will degrade the system performance with - most likely (in terms of probability) - no added security.

But in the driver writer community there has been common sense that, if you are layering your driver below or above another one, you are responsible for any problems caused by it. I.e. your driver is the culprit if the other driver (e.g. Symantec's) driver does not work anymore. You can directly conclude from this, that a vendor having persistent trouble with some badly written other driver will probably insist to uninstall the other product during its installation procedure, instead of dealing with the incompatibility.

But if it's just for competition, this appears to be highly unfair in my opinion. Regardless of the products involved. This is not technical anymore then ...

Since there seem to be so many threads here now about all the so called 'testing' going on all over the web, I'll repeat some of my concerns again in this thread too;

Quote:

Why is it that no two entities doing anti-malware tool testing come up with the same results?

As someone who has done a great deal of benchmarking, anti-malware product development testing, malware source research, and malware definition development, I already have quite a few ideas as to why this occurs. In fact, I have many concerns about the benchmarking/reviewing/testing of these types of tools. Here are a few of them;

How does one who is testing take into account the fact that all of the variables involved as well as the tools themselves are in a perpetual state of flux? Any real testing takes a bit of time. Even before any test is published, the results will already be invalid. While BrandX is still working on a particular definition, BrandY has already done so. Yet the BrandX definition will be published the next day and possibly better developed than that of BrandY.

Any real benchmark of these tools must include the study of it's removal routines for each type of nasty currently in the wild. That alone is a daunting study. But without it, there is no value to the testing.

Most F/Ps (false positives) don't occur on a freshly installed system. Removing items falsely can and very often does cripple innocent components. How do you measure the probability of F/Ps?

A true benchmark of the detections must include a validating sampling of targets. Limiting the sampling does not represent a true test at all. I can make any tool look good by simply limiting the sampling for the test. In all of the testing that has been published by online magazines and other so called professionals, this limiting has caused every single one to have different results. This has also been used as a marketing strategy.

In addition to studying the detections and removals, what other features are offered by the tools? What proactive features exist and do they work as pitched? Almost every scanner advertises that it protects the system. Do they really?...to what degree?...how much of it is just bloat?

Should any testing done by those affiliated with a particular tool be considered viable? How does anyone reading a test result know if a test was done by someone who is affiliated or has interest in a particular tool?
==========

You raise valid questions and issues. Your error is in taking these questions to unwarranted, unproductive conclusions. In effect, you turn healthy skepticism into unsupportable dogma by turning the process of inquiry into a search for cynical reasons not to recognize value in anything.

To wit:

mikey wrote:

How does one who is testing take into account the fact that all of the variables involved as well as the tools themselves are in a perpetual state of flux? Any real testing takes a bit of time. Even before any test is published, the results will already be invalid. While BrandX is still working on a particular definition, BrandY has already done so. Yet the BrandX definition will be published the next day and possibly better developed than that of BrandY.

Everything in our world is in a state of flux -- not just malware and anti-malware tools -- yet we choose to test and measure anyway, with the priviso that we strive to recognize the limits and flaws of testing. That we cannot have perfect knowledge of our world is not reason not to strive to have better, if flawed knowledge of our world.

The goal of well-designed test environment and methodology is to control variables as much as possible. Where variables aren't controlled, that is cause for questioning the true meaning and validity of the testing, depending on the seriousness of the anomaly.

But in no way should recognition that the world is in flux be turned into an excuse to throw out all tests (or all tests of on-demand anti-spyware applications). The proper role of skepticism is to inform and refine the process of knowledge, not turn it on its head.

Put another way, just because one never steps into the same river twice, that cannot relieve you of the task of making the most intelligent observations and judgments possible regarding the thing swirling around your legs.

mikey wrote:

Any real benchmark of these tools must include the study of it's removal routines for each type of nasty currently in the wild. That alone is a daunting study. But without it, there is no value to the testing.

Why? No reasoning or justification whatsoever is offered to support this contention. And, no, the fact that such a study of removal routines might be interesting and revealing in no way stands in for a justification as to why such a study is necessary to establish the validity of a set of empirical black-box tests.

mikey wrote:

Most F/Ps (false positives) don't occur on a freshly installed system. Removing items falsely can and very often does cripple innocent components. How do you measure the probability of F/Ps?

This is a minor issue -- or should be for an outfit willing to take the time to do testing correctly. Just as one can set up valid testing methods and test beds to test for the detection and removal of "known bad" applications, one can do the inverse: set of test beds of "known good" apps and test for the non-removal of those apps.

mikey wrote:

A true benchmark of the detections must include a validating sampling of targets. Limiting the sampling does not represent a true test at all. I can make any tool look good by simply limiting the sampling for the test. In all of the testing that has been published by online magazines and other so called professionals, this limiting has caused every single one to have different results. This has also been used as a marketing strategy.

I'm sorry, but this passage is severely muddled. Any sampling is by definition limited, so to demand non-limited sampling (as the author effectively does here) is essentially to demand the impossible. The proper question or issue is not whether the sampling is limited -- it ain't sampling unless it was limited -- but how well the sampling produces test beds suitable for testing that yields results that are reliably extrapolated to supersets of the sample. Or, put simply, how representative is the sampling?

mikey wrote:

In addition to studying the detections and removals, what other features are offered by the tools? What proactive features exist and do they work as pitched? Almost every scanner advertises that it protects the system. Do they really?...to what degree?...how much of it is just bloat?

Again, you're asserting that activities or inquiries outside of a test itself are somehow necessary to establish the validity of the test without offering a single reason for thinking so. I would certainly agree that the kinds of questions you ask have value, but they're not strictly necessary to establish the validity of a test. That would make the perfect the enemy of the good.

Once again, you over-generalize to the point that skepticism turns knowledge on its head. The conclusion here is pure hyperbole -- i.e., "all tests are advertising tools." One might as well say that all meals are sales opportunities.

Since we're on the subject of anti-spyware testing, I was wondering of you'd care share your thoughts on the state of HIPS testing. How would or could such testing be done in a way that would produce repeatable, valid, and meaningful test results that would actually allow users and consumers to compare HIPS apps? Any thoughts on creating a useful test bed, test environment, and methodology?

At present there seems to be no useful comparative/competitive HIPS testing of any kind being done -- at least none that I know of. Indeed, I don't see that anyone has even proposed a basic set of standards for making the most basic evaluations or comparisons among these apps.

Given as much, how is it that we know that these applications, which have become such a topic for discussion among forum members here and in other forums, are not pure, unadulterated snake-oil?

At least with on-demand scan apps I have some amount of data, however questionable, to work with. With these new HIPS apps, users have little more than the bald and bold assertions of vendors themselves and the various fans of these applications in newsgroups. Is that not a little worrisome?

Contrary to my assumptions above, if some useful HIPS testing has been conducted, would you please point us to that testing and explain what makes the testing commendable as well as problematic?

Edit/Update:

A discussion thread asking a similar question about HIPS testing is underway at Wilders:

Perhaps a bit off-topic, but I was wondering if anyone had an opinion about Consumers' Union testing of software(s) in the security/anti-malware category? They tout themselves as the standard by which all consumer product testing is to be measured, at least in the US, so I was wondering if anyone had ever bothered to examine their efforts, and if so, how they measure-up?

Read the "Open Letter" from 2000. These are serious objections to the CU/CR's AV test. It's a fairly well-settled issue in the AV industry that there is no justification whatsoever for using lab-created viruses for testing. It makes no methodological sense, as there are better alternatives for testing the same apsect of AV engines and defs. It is also an ethical no-no.

Whether one thinks McAfee is whining or not, they're on solid ground in taking CU/CR to task for conducting this kind of testing, and they'll have the backing of respected figures in the AV industry.

Edit/Update:

As I thought, others from the AV industry have now joined McAfee in criticizing CU/CR:

Just a quick prelim skim of the links looks disappointing; the thing I was looking toward in reference to CU (et al) was the possibility of eliminating charges of bias, such that commercial organisations are vulnerable to.

My bigger question then would be something like: why is it that unbiased testing labs such as CU, Underwriters Labs, CSA, et al, not involved in this area of consumer products, or if they are, why their testing is flawed in comparison to their testing of (say toasters or razors)?

At the very least, I would expect them to be at the forefront of establishing the standards by which software products are to be tested. Instead, we find more of the same kind of abandonment the computer industry has practiced toward their patrons. I won't go so far as to say "complicit," at least not intentionally, but abandonment certainly does hold without reservation.

If this cannot be done for simple scanners, what hope is there for more advanced products (as you point out) like HIPS? So, not only is there no warranty on software products, they are "licensed" and not sold outright, but the consumer has NO BASIS on which to make an informed comparison for the purpose of choosing which one to buy. Sounds more like religion: believe me.

(I don't mean to sound so naive; its not a new revelation, I'm just spelling it out. Again, the analogy to human nutrition is the closest I can come.)

Hey Eric, I guess we can just let the users who see daily the corruption in this industry, including all the bogus testing and other advertising that has been published, decide for themselves which one of us is making more sense. Just as with my reasoning about process and content filtering...the users need only try for themselves to find out. Just for ref; "While all process firewalls are HIPS, all HIPS are not process firewalls."

BTW I have no prob with testing of HIPS programs. I've been testing programs and reviewing them ever since I started computing. HIPS don't depend on routine updates in order to perform. But I am sure that eventually so called testers with special interests will try to fudge them too once the big $$$ is involved...just like we see all the time within the anti-malware industry. Even folk that I know personally have been known to publish bogus testing that claimed that their own special interest group was the favorite. Many articles have even been published warning users about bogus testing. Here's an example; http://www.pcpitstop.com/spycheck/antispywareblues.asp Surely you wouldn't try to convince folks otherwise._________________-

Hey Eric, I guess we can just let the users who see daily the corruption in this industry, including all the bogus testing and other advertising that has been published, decide for themselves which one of us is making more sense.

This is rather disappointing, I must say. I didn't expect you to bow out of the discussion so quickly, esp. given that some serious questions about the (non)testing of HIPS apps were on the table.

mikey wrote:

Just as with my reasoning about process and content filtering...the users need only try for themselves to find out. Just for ref; "While all process firewalls are HIPS, all HIPS are not process firewalls."

So, you would be satisfied with users who concluded, "These HIPS apps are too much of a bother; they slow down my system, inundate me with prompts about stuff I don't know about; my system's less stable and more sluggish; and one of my favorite apps doesn't run now"? These are real possibilities with today's crop of HIPS apps, and we haven't even watched these HIPS vendors attempt to cross the great Vista divide yet.

I'm certainly not the only one who's contemplated these reactions -- see the disucssion over at Wilders:

But throwing users back to their own devices and judgment really isn't an adequate answer, is it? The number of HIPS apps out there is growing, and the feature sets can be difficult to understand even for technically savvy users. How is a lay user supposed to sift through these apps and make an intelligent, informed decision? How is the user supposed to distinguish quality HIPS protection from snake-oil?

mikey wrote:

BTW I have no prob with testing of HIPS programs.

I never suggested you did. I was more interested in what you consider to be the essential ingredients of a good testing program for HIPS apps, how we would distinguish meaningful and valid tests from worthless ones, and what users are supposed to do in the absence of any significant amount of recognized quality testing. Try every HIPS app on the market themselves? And evaluate those apps on the basis of what?

Yeah, there sure have been quite a few articles written -- both about bad testing and what constitutes good testing. I even pointed readers to a number of these myself in earlier posts, funny enough. Indeed this very discussion thread arose out of the need to establish guidelines for quality testing -- guidelines that could also be used to identify bad or bogus testing.

Folks, let me lay my cards on the table here. Like many others here and at other anti-malware/security forums, I am most interested in the emerging crop of HIPS apps, process filters, process firewalls, or whatever name you want to hang on them. As interested as I am, I am also a bit disturbed at the lack of substantive discussion surrounding these apps -- and by the utter lack of any recongized, quality testing of these apps.

All too many HIPS fans tend to spend more time repeating, knowingly or not, the sales pitches of the vendors themselves (and I've gotten to listen to more than a few sales pitches from HIPS vendors -- they all sound amazingly the same), discussing over and over the theoretical advantages of HIPS apps over traditional anti-malware scanners and other forms of anti-malware protection.

What I don't see, unfortunately, is much of the following:

* honest discussions of the shortcomings and problems (real and potential) with HIPS apps;

* detailed discussions of the feature sets and technologies behind these applications;

* meaningful considerations of the user experience with these applications;

* tips and advice to help non-technically savvy folks distinguish the high-quality HIPS offerings from the the not-so-good offerings and even the outright snake-oil masquerading as good HIPS;

* recognized, quality testing -- or even amateur comparative testing, for that matter.

THere's quite a bit of room for knowledgeable folks willing to do some hard work to make a great impact on our understanding of HIPS apps -- the field is wide-open, so far as I can tell. But we're going to have to see more than just vague discussions of the theoretical advantages of HIPS apps over anti-malware scanners and ritual denunciations of the vendors of "bot scanners."

If anyone's willing to dive in, you could do worse than by starting with the basics -- how about a simple, standardized guide to comparing and evaluating the respective capabilities or feature sets of HIPS apps? Even something as simple as a usable bulleted list or check list of basics would be helpful at this point.

Or, put another way: most of us have a fairy good sense of what to look for in an anti-malware scanner (even if we often come to different conclusions about particular apps). But what are we to look for in a HIPS app? How do we tell if one HIPS app might be better or more suitable for us than another HIPS app?