Matousec Proactive Security Challenge Analyzed

 

The following sections discuss misleading elements of a popular firewall tester in an attempt to help readers understand the meaning and limitations of its test results. Please don't take it as an attack against the independent service the website provides. I enjoy the website and praise its level of professionalism.

 

1. Overview

The scoring of the Matousec tests (as presented on its comparison table) and some of the site's claims are misleading. To simplify my article, let's consider a simple bowling analogy.

You go bowling, bowl 1 game, and get a 270. But you sit out the next 2 games. Your annoying little brother, a whiz at mathematics and awful at sports, bowls all three games at a laughable 100 something every time. He never scores remotely close to your 270.

But your little brother snickers that he beat you all three times, including the first game. While trying to prevent yourself from using him as a bowling ball, you politely ask for him to explain his fuzzy math.

Your little brother explains that he divided your 270 score in game 1 by all three games, and gave you zeros for the games you didn't bowl. You bowled a 90 in game 1 by this scoring method. You got beat by your snot nosed little brother.

The way Matousec scores its Proactive Security Challenge would not only agree with your whiz kid little brother but would also suggest that he should bowl another 10 games for the next 9 days, and give you zeroes for all those days too.

When you look at the comparison table at Matousec, pretend it was prepared by your wonderful little brother and ignore any score that wasn't based on all ten levels of testing. Those scores are more misleading than anything your little brother would prepare in a single bowling session.

I have other minor notes and complaints, but the story above illustrates the main problems inherent in the Matousec scoring method.

 

2. Background and Purpose

In this article I analyze the popular and influential Matousec firewall testing service. The project, starting in 2006 mostly by university students, originally focused on testing traditional firewalls or "packet filters," but after two years it broadened its testing to recently became the Proactive Security Challenge.

The challenge compares software products that perform an "application-based security model," including products normally called an "Internet security suite, a personal firewall, a HIPS, [or] a behavior blocker" (quotes from FAQ).

So, as an example, the proactive software that it tests must prevent "data and identity theft" and other attacks (Interpretation of results). Tested products must be able to block malware from running on a PC, getting to a user's private data, sending private data to outsiders, or attacking trusted parts of a user's system (Interpretation of results).

Tested products face a potential 10-level set of tests. The testing package (with most of the challenge's tests) is available free as a download from the Matousec website, but it's limited to personal use. Matousec describes its testing procedures and guidelines in its FAQ section.

 

3. The Scoring of Test Results is Misleading

The results of the Proactive Security Challenge are listed on a table by product and final test score (see Results). If a product fails a level of testing, then it is not subjected to further testing, according to rules posted on the site. Hence, it organizes products by the number of possible tests rather than the total number of actual tests.

They compare products that receive the full 184 tests to products that receive 12 tests. Those are two significantly different kinds of scores. It's not objective to compare them, and outright deception to say the products with 12 tests were tested with 184 tests.

When you arrive at the results table always look at the "level reached column" first across from the "product score column". You can trust the product score if and only if the level reached reads 10.

The scores for products below 10 do not mean the same as the top scores, at least for users who imagine actual tests being performed on actual software (rather than possible tests not being performed on anything). If you are interested in any of those lower products on the list, then ignore their scores as reported on the table and download their PDF file to interpret their results yourself (you will be surprised at the lack of actual testing and you won't know how to interpret the scant results compared to other products).

Since they don't distinguish between products that actually received all tests and products that didn't receive all tests, the number of tests a product received is misleading (except in cases where a product received all possible tests, or where a product was tested before the addition of a new test and gets NA for a test).

According to scoring rules posted on the site:

"All tests are equal to the intent that their scores are not weighted by their level or something else. The total score of the tested product is counted as follows. For all tests in all levels that the product did not reach, the product's score is 0%. For all other tests the score is determined by the testing. The total score of the product is a sum of the scores of all tests divided by the number of all tests and rounded to a whole number. It may happen that a new test is added to Proactive Security Challenge when some products already has their results. In such case, the result for already tested product is set to N/A for this new test, which means that it is not counted for this product and does not affect its score or level passing. Neither the number of the tests, nor the number of levels is final. We intend to create new tests in the future. We are also open to your ideas of new testing techniques or even complete tests." (Methodology and Rules).

So it is implicit in this quote that products are not always fully tested. If they don't advance through all the levels, they get an automatic 0% on the levels for which they were not tested. So the overall score will be very low if a product only made it to the first level.

This also matters in the reliability of the final score: the fewer the tests, the less reliable the scoring. It would be sort of like polling 11 people about their favorite firewall product rather than 84 people. The more tests they use, the more reliable the data and the easier it is to interpret the results.

It also matters to the clarity of the final results on the website. To place products given only 12 tests at level 1 on a table labeled "Products tested against the suite with 184 tests" is just plain 172 tests wrong for many products on the chart.

I recommend the categorization of products based on the number of actual tests they received.

I also recommend that users disregard any score that did not reach level 10. The scores below level 10 are weighed down, as if they were students with fewer chances to turn in assignments. Would you trust your teacher's grading method if he gave you zeroes for work you didn't have a chance to complete? Getting zeroes for late or missing work is bad enough.

 

4. Invalid Claims

Matousec explains to readers that the testing results seek to hold firewall products to their security claims:

"So, what does it mean if the product fails even the most basic tests of our challenge? It means that it is unable to do what its vendor claims it can. Such a product can hardly protect you against the mentioned threats" (Interpretation of Results).

What if such product vendors didn't claim to provide some kinds of security tested in the challenge? If a product was not designed to protect against certain threats and does not claim to protect against them, then it is incorrect to state that the goal of testing is to hold products accountable for the level of protection they claim to provide.

To be consistent with the quote above, Matousec ought to include or exclude products (or relevant tests) based on whether vendors claim to protect against tested threats.

Several vendor comments assert that their firewall products were meant to be part of a security package and were not intended as a stand-alone product. See the Bitdefender and AVG comments, for example, in the vendor responses.

Another vendor states that they do not provide anti-keylogger protection and leave such protection for other security products. It would be counter to the goal of testing (captured in the quote above) to give such products keylogging tests and lower their score for not providing security they do not claim to provide.

I personally like to see all products get the same tests. It makes for good comparison. But it's not valid to suggest that they "hold products to their claims" when some products explicitly deny that they offer some of the protection in the tests (and state such information in charts on their websites).

To analyze a product's security claims, the test scores would have to avoid contradicting explicit facts a product vendor makes readily available to users.

But Matousec incorrectly describes the goal of the test. It fails to describe the type of conclusions the test achieves. The test scores compare products based on a series of default tests (sometimes modified without warning to prevent cheating). It doesn't hold products to their protection claims.

 

5. Does Experience Count Too Much?

Testers in the challenge are experienced users whilst users of firewalls in the real world are often not experienced users. Popup alert information used by proactive firewalls (at their max settings) is often ambiguous. The alerts depend on a user's knowledge of their computer software, so the level of protection for average users may not reach as high as it does for the testers.

Firewall security is a two part relation between the user and the product. If the user answers "block" too late in the chain of alerts, then they get firewall crashes and maliciously launched browsers instead of high quality firewall security. And average users will no doubt let through many applications that experienced testers won't. Therefore, a software product can only reliably provide the level of security found in the tests for experienced users. If a product is too confusing, then it may rarely reach the Matousec level of security.

The proactive security tests would be more informative about the real world effectiveness of a firewall by testing random users. If they had enough users to volunteer in such hypothetical tests, then they could interpret the results through a statistical analysis. This methodology would generalize better to the public and to the real world effectiveness of firewall products, but, of course, as in all objective tests there is a trade off -- as you increase the generalizability of a test to the public, you decrease its internal validity (and vice versa).

The use of experienced users helps to filter out false positives (and sometimes they even modify the tests if they think a firewall has an obvious weak point), but such filtering and interpretation of results does not have to be part of testing. One may interpret the data after testing and one could modify tests after an average user completes testing to ready the firewall for retesting. However, such proposed tests would have to be more user-friendly and would probably decrease the validity of the tests themselves. It is difficult to get both internal validity while also generalizing the results to a broader population.

The Matousec results suggest a maximum level of security for a product. Though, it is difficult to make this claim because the challenge does not fully test products with all actual tests. So for products that did not get far up the levels of testing, the Matousec scores do not suggest a maximum level of security. Products may provide a higher level of security than the level of security suggested by the Matousec score.

If a low scoring product is user-friendly, then it may also provide more security for inexperienced users than a complicated firewall. Likewise if a user is more knowledgeable of a simpler product, then such a user may be more secure than with a feature rich product. But perhaps the more aggressive product would still outperform a user-friendly product. We wouldn't know without conducting tests on random users.

If the results of the Matousec tests do not generalize to the public, then users should consider other factors, such as user-friendliness. It's possible that for average users, user-friendliness increases the level of effectiveness of a product.

 

6. Final Thoughts

Of course any experienced user can use most of the same tests used in Matousec testing since they are located on the website for a free download (http://www.matousec.com/downloads/). Therefore, money can't plausibly influence the validity of the actual tests since the tests are available to everyone.

The test results are linked by a PDF file and anyone can see the types of tests a product fails or passes. Since the raw data is posted to the site, you can ignore the overall score and just look at the tests passed or failed. However, the PDF has little value when it doesn't list enough testing levels to allow readers to make sound interpretations of the results. They might as well not even list level 1 products.

I'm suspicious of many scoring practices in the Proactive Security Challenge. For example, I find it problematic that they give products 0% for levels not tested and that they score products by the number of possible tests (when many of the tests were not actually administered). I found it confusing that they compare products based on the total number of possible tests. And the claim that their results validate (or invalidate) the security claims of vendors is false.

However, no other similar testing service (for proactive security or outbound protection testing) exists as far as I know, so Matousec has little competition. And, as stated at the beginning, I appreciate the thoroughness and technical details of their service. It should be noted that their website is informative and detailed about their testing methods.

 

 

Related Links

*Warning: Downloads from Cnet (Download.com) now require the use of a proprietary installer.

 

Share this
3.62963
Average: 3.6 (27 votes)
Your rating: None

Comments

by Corsair on 31. July 2011 - 3:46  (76578)

I think it is actually important to make a decision on what free (since this is a freeware site) security software to use after consulting a number of resources. From what I can tell - Matousec really only tests out the firewall capabilities of a security suite - not other areas.

Therefore it is important to look at a number of resources:

1. AV Comparatives: http://www.av-comparatives.org/ - have a good set of reports based on a variety of testing. Most people would be interested in the following tests: http://www.av-comparatives.org/en/comparativesreviews/detection-test

2. AV-TEST: http://www.av-test.org/ - yet another set of good testing on security suites (where av-comparatives seems to only look at the anti-virus side of things)

3. Matousec - yes. This site is worth looking at as it analyses the firewall side of things. This should be used in conjunction with the above two to determine what security suite would be best for you.

4. Other people - forums are a great place because it allows you to ask others what their setups are. There is nothing better than first hand experience. It can help you make a decision.

5. You - it is worthwhile just spending time testing things out for yourself. Just make sure you make an image of your machine with something like Macrium Reflect Free or Paragon before you do so you can always restore your machine afterwards.

I'll actually be embarking on a test run myself of freeware as my current paid subscription for security software is coming to an end.

Personally - as I will be installing this freeware setup on my fiancee's, parents, and sisters machines I am really after a setup that is quite easy to use so will see.

by J_L on 1. September 2012 - 4:27  (98606)

About #3, that isn't tehnically true. Matousec mainly (or probably only) evaluates the HIPS portion of Firewalls.

If you want to test your Firewall (and router), try Shields Up.

by MidnightCowboy on 31. July 2011 - 6:44  (76586)

I find this is a good resource too:

http://www.virusbtn.com/vb100/rap-index.xml

by Anon20110607 (not verified) on 7. June 2011 - 13:10  (73435)

Criticizing unfinished testing is justified. However, the 10 bowling games analogy would be better if each game is a *different* activity. (bowling, unicycle demolition derby, teletubby moonwalking, runes speedreading, klingon spelling bee, f@rtbomb defusing, rocket surgery, watermelon feather-plucking, underground basketweaving, concrete-shoe fitting, etc ;-) )

by Anon20110607 (not verified) on 7. June 2011 - 13:03  (73434)

Diversity of types of protection and varying (or lack of) combinations of types in each product make any "comparative" test of security products more difficult.
however the user wants protection against all types of (realistic...) threats.
IOW, 'clear cut answers' don't exist.

Perhaps a test results chart could group test columns within (typically recognized) categories. (we typical readers know least about that)
Then readers would apply filters and sorts to the table (like a spreadsheet).
After that, readers use their fuzzy intuitive judgment to reduce choices.
graphics

by Rizar on 26. April 2011 - 1:59  (70915)

I was wondering why Online Armor isn't listed; they have some sort of unspecified spat with Matousec:

http://www.matousec.com/info/?news=146-Online_Armor_temporarily_disquali...

Some interesting comments here:
http://www.wilderssecurity.com/showthread.php?t=281529

by syntax_error on 25. April 2011 - 22:57  (70909)

Well done Rizar.

From their web site "In combination with their packet filtering capabilities, the tested products attempt to block attacks from other machines on the network as well as attacks performed by malicious codes that might run inside the protected machine."

as well as attacks performed by malicious codes that might run inside the protected machine - What's this got to do with a FW?

As they say it's a Proactive Security Challenge, but they should not include stand alone FW's in their testing. It is a Security Suite test. Which looses it's relevance because, as those who follow TSA well know, the best antivirus maker makes a lousy FW and so on.

FW's should be tested as a stand alone product without the aid of hips or anything else.
As should HIPS, antivirus etc be tested seperately.

by Rizar on 25. April 2011 - 23:55  (70911)

Thanks! That would be nice to have independent tests for HIPS products and firewalls, but I have a feeling the firewall test results would be quite boring and uniform (this is why the PCFlank comparison tests went extinct). Heck, I pass inbound testing with no firewall enabled at all, with all my ports either closed or stealthed (my ISP closes and stealths my ports for me, mostly closes them).

One good thing about Matousec is that sometimes vendors really do make bold claims about their outbound protection, or in some cases reviewers do. The bad thing is that they don't test products fully enough to allow us to interpret their scores very well.

I've come across a few reviews that rank ZoneAlarm's outbound protection highly (for free version), so it would be interesting to see how Matousec would rank it. Right now it's been tested and dumped in the heap of low test-level products (I wouldn't know where to start in interpreting it; the only thing to do is test it ourselves, which isn't as good without some random testing modifications to avoid static tests).

I'm not sure what the actual score would be like since I'm still letting it train up, but I suspect that ZA's outbound protection is marginal at best. A lot would depend on whether its DefenseNet could detect malicious programs before they get out to the Internet; it wouldn't stop them on the host computer before they attempt to connect (which some would say is the proper role of a firewall, but I'm not sure the line is quite that clear).

by Anonymous on 11. December 2009 - 14:06  (38288)

Matousec is the only one who tests fierwalls that is correct, personally i think that those tests are corect, i tested some of the products with their tools i can't say that it gave me the same results but I trust the results from the website.

by Anonymous on 2. October 2009 - 19:14  (33793)

test Result show's Gdata 2008, but now Gdata 2010 running. very slow update results. very bad..

by chris.p on 2. May 2009 - 6:09  (20922)

Matousec tests are not perfect, to be sure, but currently they are the only available guide as far as I know. In the old days it was Steve Gibson's stuff but that's long gone. The trouble is that Matousec effectively have a monopoly. Monopolies don't benefit the consumer.

Until someone else comes along it's hard to see what else you can use to judge a firewall. I don't think Matousec is an end-users' resource -- end-users look to Gizmo's to interpret the results the geeks come out with, and the site does that well enough.

Trouble is there's no money in rubbishing other peoples' products and that's the game Matousec are in. Shame, because we need more reference sources like this. While this is exactly the type of testing and data resource we need, it's also very easy to criticise because this area is so complex. But it's possible their presentation & logic could be improved.

On a personal level I'd really like to see 2 sets of tests, one for ordinary / normal firewalls, and one for firewalls with HIPS. You'd think they would prioritise for standard firewalls and not the HIPS type. Wonder why that is.

Excellent article Rizar, good analysis.

chris.p

by Anonymous on 2. October 2009 - 21:05  (33800)

so why doesn't GIZMO adopt a constructive position? in the past i've seen tests being done and that was a single man job, in the past tsa was useful. not now, not anymore.

by MidnightCowboy on 2. October 2009 - 22:08  (33803)

Well, now up to a million visitors a month and still climbing suggests that TSA remains useful for someone. The constructive approach though is to tell us exactly what you want to see. Be specific and whatever is possible will be considered for the agenda. We'll be leading the PC security field in another area quite soon so I guess we're not quite ready to roll over and die just yet.

by Anonymous on 27. April 2009 - 2:06  (20604)

YES SHOW US SOME BETTER FIREWALL TESTS!! That is very much needed. Gizmo can you help?

by Anonymous on 7. April 2009 - 6:52  (19443)

Mamutu is misplaced in this test because of a different definition of protection.

Matousec thinks that a security software must pass SIMULATED leak tests which are, of course, no real malware, but simulated.

Emsisoft states that Mamutu protects against real malware samples because it is a behavior blocker and not a HIPS or a firewall. As long as a testing tool does not act like real malware in full it will not be detected by Mamutu - because it's not built to do so.

Example: Mamutu triggers Keyloggers, but will only alert a key logging software if several other parameters say that this program is a real malware and not some kind of software that captures your keyprints to pass it to a braille reader for blind people.

- A typical firewall is made to alert a maximum number of things that COULD be dangerous.
- Mamutu is made to show as little alerts as possible and filters all programs that contain potential dangerous actions which are not indendet to do dangerous things.

That's the major point that you missed in this article.

It's not about hypothetical leaks and testing of them, it's about real malware. And here Mamutu does a great job. It's able to detect nearly all malware samples on a protection level similar to Norton Antibot and Threatfire (which are both not included in the Matousec tests btw.).

So I'd ask: Why intentionally draw a bad light on Mamutu if it performs world class when it comes to real malware?

You're right, it's on the reader to interpret the test results right. But who is really able to do this correctly? 99% of all visitors will simply think: Hey, Mamutu is rated worst, I'll never ever download and install this bad one..

That's fair?

by JonathanT on 7. April 2009 - 2:17  (19432)

Very informative article!
A minor point, you say Matousec uses a functional definition rather than traditional categories of products. But the function of Mamutu is fundamentally different to a firewall. And "There are several very well known products on the list that perform very poorly". One can still make a case that these are firewalls (though I personally still think Matousec is using a HIPS+Firewall test on firewalls), but Mamutu simply doesn't filter network traffic, so I can see why Emsisoft is offended.

by Rizar on 8. April 2009 - 6:01  (19514)

Thanks! And I just added another section (#4).

Oddly enough both could be true. A product could be both fundamentally different from a firewall and also qualify as a "personal firewall."

I think you are right; in fact, the functional definition excludes "firewalls" or "packet filters." A product must have a HIPS or a behavior blocker to qualify, at least according to the material I quote from the website at the beginning of the article.

by Anonymous on 6. April 2009 - 16:50  (19416)

Matousec test is very very suspicious. I remember when they tested up-to-date product against beta product which also had up-to-date version. After that I don't visit matousec.com anymore.

by Valentin N (not verified) on 22. February 2011 - 14:51  (66923)

Could you tell what's suspicious about that??? Nothing. If a company wants Matousec to test their beta product then beta product will be tested; I don't see anything wrong with that.

Regards,
Valentin N

by Anonymous on 6. April 2009 - 15:30  (19410)

Show us some BETTER tests.

by chris.p on 7. May 2009 - 13:16  (21196)

There are no better tests. That's the problem. There aren't even any similar tests.

Matousec seem to be the only viable testing outfit but not everyone agrees with their methods.

chris.p

Gizmos Needs You

Gizmo's Freeware is Recruiting

 We are looking for people with skills or interest in the following areas:
 -  Mobile Platform App Reviews for Android and iOS
 -  Windows, Mac and Linux software reviews       Interested? Click here