Matousec Proactive Security Challenge Analyzed



The following sections discuss misleading elements of a popular firewall tester in an attempt to help readers understand the meaning and limitations of its test results. Please don't take it as an attack against the independent service the website provides. I enjoy the website and praise its level of professionalism.


1. Overview

The scoring of the Matousec tests (as presented on its comparison table) and some of the site's claims are misleading. To simplify my article, let's consider a simple bowling analogy.

You go bowling, bowl 1 game, and get a 270. But you sit out the next 2 games. Your annoying little brother, a whiz at mathematics and awful at sports, bowls all three games at a laughable 100 something every time. He never scores remotely close to your 270.

But your little brother snickers that he beat you all three times, including the first game. While trying to prevent yourself from using him as a bowling ball, you politely ask for him to explain his fuzzy math.

Your little brother explains that he divided your 270 score in game 1 by all three games, and gave you zeros for the games you didn't bowl. You bowled a 90 in game 1 by this scoring method. You got beat by your snot nosed little brother.

The way Matousec scores its Proactive Security Challenge would not only agree with your whiz kid little brother but would also suggest that he should bowl another 10 games for the next 9 days, and give you zeroes for all those days too.

When you look at the comparison table at Matousec, pretend it was prepared by your wonderful little brother and ignore any score that wasn't based on all ten levels of testing. Those scores are more misleading than anything your little brother would prepare in a single bowling session.

I have other minor notes and complaints, but the story above illustrates the main problems inherent in the Matousec scoring method.


2. Background and Purpose

In this article I analyze the popular and influential Matousec firewall testing service. The project, starting in 2006 mostly by university students, originally focused on testing traditional firewalls or "packet filters," but after two years it broadened its testing to recently became the Proactive Security Challenge.

The challenge compares software products that perform an "application-based security model," including products normally called an "Internet security suite, a personal firewall, a HIPS, [or] a behavior blocker" (quotes from FAQ).

So, as an example, the proactive software that it tests must prevent "data and identity theft" and other attacks (Interpretation of results). Tested products must be able to block malware from running on a PC, getting to a user's private data, sending private data to outsiders, or attacking trusted parts of a user's system (Interpretation of results).

Tested products face a potential 10-level set of tests. The testing package (with most of the challenge's tests) is available free as a download from the Matousec website, but it's limited to personal use. Matousec describes its testing procedures and guidelines in its FAQ section.


3. The Scoring of Test Results is Misleading

The results of the Proactive Security Challenge are listed on a table by product and final test score (see Results). If a product fails a level of testing, then it is not subjected to further testing, according to rules posted on the site. Hence, it organizes products by the number of possible tests rather than the total number of actual tests.

They compare products that receive the full 184 tests to products that receive 12 tests. Those are two significantly different kinds of scores. It's not objective to compare them, and outright deception to say the products with 12 tests were tested with 184 tests.

When you arrive at the results table always look at the "level reached column" first across from the "product score column". You can trust the product score if and only if the level reached reads 10.

The scores for products below 10 do not mean the same as the top scores, at least for users who imagine actual tests being performed on actual software (rather than possible tests not being performed on anything). If you are interested in any of those lower products on the list, then ignore their scores as reported on the table and download their PDF file to interpret their results yourself (you will be surprised at the lack of actual testing and you won't know how to interpret the scant results compared to other products).

Since they don't distinguish between products that actually received all tests and products that didn't receive all tests, the number of tests a product received is misleading (except in cases where a product received all possible tests, or where a product was tested before the addition of a new test and gets NA for a test).

According to scoring rules posted on the site:

"All tests are equal to the intent that their scores are not weighted by their level or something else. The total score of the tested product is counted as follows. For all tests in all levels that the product did not reach, the product's score is 0%. For all other tests the score is determined by the testing. The total score of the product is a sum of the scores of all tests divided by the number of all tests and rounded to a whole number. It may happen that a new test is added to Proactive Security Challenge when some products already has their results. In such case, the result for already tested product is set to N/A for this new test, which means that it is not counted for this product and does not affect its score or level passing. Neither the number of the tests, nor the number of levels is final. We intend to create new tests in the future. We are also open to your ideas of new testing techniques or even complete tests." (Methodology and Rules).

So it is implicit in this quote that products are not always fully tested. If they don't advance through all the levels, they get an automatic 0% on the levels for which they were not tested. So the overall score will be very low if a product only made it to the first level.

This also matters in the reliability of the final score: the fewer the tests, the less reliable the scoring. It would be sort of like polling 11 people about their favorite firewall product rather than 84 people. The more tests they use, the more reliable the data and the easier it is to interpret the results.

It also matters to the clarity of the final results on the website. To place products given only 12 tests at level 1 on a table labeled "Products tested against the suite with 184 tests" is just plain 172 tests wrong for many products on the chart.

I recommend the categorization of products based on the number of actual tests they received.

I also recommend that users disregard any score that did not reach level 10. The scores below level 10 are weighed down, as if they were students with fewer chances to turn in assignments. Would you trust your teacher's grading method if he gave you zeroes for work you didn't have a chance to complete? Getting zeroes for late or missing work is bad enough.


4. Invalid Claims

Matousec explains to readers that the testing results seek to hold firewall products to their security claims:

"So, what does it mean if the product fails even the most basic tests of our challenge? It means that it is unable to do what its vendor claims it can. Such a product can hardly protect you against the mentioned threats" (Interpretation of Results).

What if such product vendors didn't claim to provide some kinds of security tested in the challenge? If a product was not designed to protect against certain threats and does not claim to protect against them, then it is incorrect to state that the goal of testing is to hold products accountable for the level of protection they claim to provide.

To be consistent with the quote above, Matousec ought to include or exclude products (or relevant tests) based on whether vendors claim to protect against tested threats.

Several vendor comments assert that their firewall products were meant to be part of a security package and were not intended as a stand-alone product. See the Bitdefender and AVG comments, for example, in the vendor responses.

Another vendor states that they do not provide anti-keylogger protection and leave such protection for other security products. It would be counter to the goal of testing (captured in the quote above) to give such products keylogging tests and lower their score for not providing security they do not claim to provide.

I personally like to see all products get the same tests. It makes for good comparison. But it's not valid to suggest that they "hold products to their claims" when some products explicitly deny that they offer some of the protection in the tests (and state such information in charts on their websites).

To analyze a product's security claims, the test scores would have to avoid contradicting explicit facts a product vendor makes readily available to users.

But Matousec incorrectly describes the goal of the test. It fails to describe the type of conclusions the test achieves. The test scores compare products based on a series of default tests (sometimes modified without warning to prevent cheating). It doesn't hold products to their protection claims.


5. Does Experience Count Too Much?

Testers in the challenge are experienced users whilst users of firewalls in the real world are often not experienced users. Popup alert information used by proactive firewalls (at their max settings) is often ambiguous. The alerts depend on a user's knowledge of their computer software, so the level of protection for average users may not reach as high as it does for the testers.

Firewall security is a two part relation between the user and the product. If the user answers "block" too late in the chain of alerts, then they get firewall crashes and maliciously launched browsers instead of high quality firewall security. And average users will no doubt let through many applications that experienced testers won't. Therefore, a software product can only reliably provide the level of security found in the tests for experienced users. If a product is too confusing, then it may rarely reach the Matousec level of security.

The proactive security tests would be more informative about the real world effectiveness of a firewall by testing random users. If they had enough users to volunteer in such hypothetical tests, then they could interpret the results through a statistical analysis. This methodology would generalize better to the public and to the real world effectiveness of firewall products, but, of course, as in all objective tests there is a trade off -- as you increase the generalizability of a test to the public, you decrease its internal validity (and vice versa).

The use of experienced users helps to filter out false positives (and sometimes they even modify the tests if they think a firewall has an obvious weak point), but such filtering and interpretation of results does not have to be part of testing. One may interpret the data after testing and one could modify tests after an average user completes testing to ready the firewall for retesting. However, such proposed tests would have to be more user-friendly and would probably decrease the validity of the tests themselves. It is difficult to get both internal validity while also generalizing the results to a broader population.

The Matousec results suggest a maximum level of security for a product. Though, it is difficult to make this claim because the challenge does not fully test products with all actual tests. So for products that did not get far up the levels of testing, the Matousec scores do not suggest a maximum level of security. Products may provide a higher level of security than the level of security suggested by the Matousec score.

If a low scoring product is user-friendly, then it may also provide more security for inexperienced users than a complicated firewall. Likewise if a user is more knowledgeable of a simpler product, then such a user may be more secure than with a feature rich product. But perhaps the more aggressive product would still outperform a user-friendly product. We wouldn't know without conducting tests on random users.

If the results of the Matousec tests do not generalize to the public, then users should consider other factors, such as user-friendliness. It's possible that for average users, user-friendliness increases the level of effectiveness of a product.


6. Final Thoughts

Of course any experienced user can use most of the same tests used in Matousec testing since they are located on the website for a free download ( Therefore, money can't plausibly influence the validity of the actual tests since the tests are available to everyone.

The test results are linked by a PDF file and anyone can see the types of tests a product fails or passes. Since the raw data is posted to the site, you can ignore the overall score and just look at the tests passed or failed. However, the PDF has little value when it doesn't list enough testing levels to allow readers to make sound interpretations of the results. They might as well not even list level 1 products.

I'm suspicious of many scoring practices in the Proactive Security Challenge. For example, I find it problematic that they give products 0% for levels not tested and that they score products by the number of possible tests (when many of the tests were not actually administered). I found it confusing that they compare products based on the total number of possible tests. And the claim that their results validate (or invalidate) the security claims of vendors is false.

However, no other similar testing service (for proactive security or outbound protection testing) exists as far as I know, so Matousec has little competition. And, as stated at the beginning, I appreciate the thoroughness and technical details of their service. It should be noted that their website is informative and detailed about their testing methods.


[node:2490 body collapsed]


Related Links

*Warning: Downloads from Cnet ( now require the use of a proprietary installer.


Please rate this article: 

Your rating: None
Average: 3.4 (31 votes)