Matousec Personal Firewall Tests Analyzed
Introduction
In this article I discuss the popular and influential Matousec firewall tests. The project started in 2006 and was run mostly by university students, but after two years it broadened its testing and became the Proactive Security Challenge. It now defines software products in a broader sense to include HIPS products and behavior based security products. It tests software firewalls and software suites that meet the functional definition of "personal firewalls," which must "implement process-based security." This definition includes the requirement to block malware from freely running on a PC. So, as an example, it only tests software that fulfill the condition that "personal firewalls should prevent spying and data and identity theft" (quotes from History and Introduction).
Therefore it doesn't test traditional firewalls; it considers them "packet filters" and excludes them from the test. Once a software product makes it into the challenge, it faces a 10-level set of tests. The testing package is available free as a download from the Matousec website, but it's limited to personal use.
Matousec is clear about its testing procedures up to a certain point, but there is always room for improvement! I will not argue that the tests themselves are invalid; it will be a mixed argument. I will mainly argue that the scoring of the tests and some of the site's claims are unclear and invalid in a few ways.
The following 4 sections discuss the top four misleading elements of the firewall tester, but please don't take it as an attack against the independent service the website provides. I do not claim that any of the unclarity or deception in the Matousec personal firewall tests is intentional and, in fact, I personally enjoy the website and praise its level of professionalism.
1. The Number of Tests is Misleading
The #1 most misleading element of the personal firewall tests is the data organization, which mainly consists in a list of test results and firewall products (see Results). The "separate tables view" organizes products by the number of possible tests they could face in the 10 levels of testing. If a product performs badly on a "level" of testing, however, then it is not subjected to further testing, according to rules posted on the site. Hence, it organizes products by the overall number of possible tests rather than the total number of tests actually used on products.
Since they don't distinguish between products that actually received all tests and products that didn't receive all tests, the number of tests a product received is misleading (except in cases where a product received all possible tests, or where a product was tested before the addition of a new test and gets NA for a test).
According to scoring rules posted on the site:
"The total score of the tested product is counted as follows. For all tests in all levels that the product did not reach, the product's score is 0%. For all other tests the score is determined by the testing. The total score of the product is a sum of the scores of all tests divided by the number of all tests and rounded to a whole number" (Methodology and Rules).
So it is implicit in this quote that products are not always fully tested. If they don't advance through all the levels, they get an automatic 0% on the levels for which they were not tested. So the overall score will be very low if a product only made it to the first level.
This matters most in the reliability of the final score: the fewer the tests, the less reliable the scoring. It would be sort of like polling 11 people about their favorite firewall product rather than 84 people. The more tests they use, the more reliable the data.
It also matters to the clarity of the final results on the website. To place products given only 11 tests at level 1 on a table labeled "Products tested against the suite with 84 tests" is 73 tests wrong. Products with fewer actual tests get compared to products that actually received all possible tests.
They could correct this unclarity by listing the products by the actual number of tests they received, rather than comparing test results from a product that receives 20-70 tests and performed badly to a product that received 11 tests and performed badly. Those are two significantly different comparisons of badly performing products, and not very objective to compare them; and outright deception to say they were given 84 tests. In the case of a score based on 11 tests, the test score may not be very reliable for a product, at least not as reliable as a score based on 84 tests.
I recommend the categorization of products based on the actual number of tests they received. Or perhaps two columns, one for a score based on the total number of possible tests and one for a score based on the total number of actual tests used for a particular product.
2. It's Stated Goals Are Not Quite Perfect; in Other Words, It Fails to Use Coherent Logic to Analyze Product Claims
It proposes a goal to hold firewall products to their claims:
"So, what does it mean if the product fails even the most basic tests of our challenge? It means that it is unable to do what its vendor claims it can. Such a product can hardly protect you against the mentioned threats" (Interpretation of Results).
What if they didn't claim to provide it? If a product was not designed to protect against certain threats and does not claim to protect against them, then it is incorrect to claim that the goal of testing is to hold products accountable for the level of protection they claim to provide. If the goal in the quote above is to make use of proper logical argument form, then Matousec testing ought to include or exclude products based on whether they claim to protect against threats represented by the tests, or the goal should be a basis for not giving a product certain specific tests.
Several vendor comments assert that their firewall products were meant to be part of a security package and were not intended as a stand-alone product. See the Bitdefender and AVG comments, for example, in the vendor responses. Of course, some users would rather put together their own security solution rather than having to have a single vendor's antivirus, antispyware, firewall, etc., and the comparative test would help these users decide which firewalls to choose.
Another vendor states that they do not provide anti-keylogger protection and leave such protection for other security products. It would be counter to the goal of testing (captured in the quote above) to give such products (see PC Tools & Online Armor) keylogging tests and lower their score for not providing security they do not claim to provide.
I personally like to see all products get the same tests. It makes for good comparison. But it's not completely correct to suggest "we hold products to their claims" when there is not evidence that products made claims to protect against all the tests.
This is the problem with analyzing the internal logic of a product's claims; it must be based on claims a product actually makes. Or else it's not objective since it is not based on logic (and logic is objective if you use formal arguments). But if Matousec does that (and uses formal logic) and proceeds to test personal firewalls based on product assertions, then the tests would not compare all the products based on the results of the same tests.
But Matousec incorrectly describes the goal of the test and fails to describe the type of conclusions the test achieves: it compares products based on a series of default tests (sometimes modified without warning to prevent cheating). It doesn't hold products to their protection claims.
But, if we remember #1 above, the tests do not completely succeed at this comparative either. It does not subject all products to the same number of tests.
3. No Statistical Analysis: No Margins of Error, No Inter-rater Reliability Scores
The website explicitly notes that the tests are not perfect:
"It should be noted that the testing programs are not perfect and in many cases they use methods, that are not reliable on 100%, to recognize whether the tested system passes or failed the test. This means that it might happen that the testing program reports that the tested system passed the test even if it failed, this is called a false positive result. The official result of the test is always set by an experienced human tester in order to filter false results. The opposite situations of false negative results should be rare but are also eliminated by the tester" (Methodology and Rules).
But all objective tests have a margin of error, so this is not a problem.
A third problem with the test results, though, is that they lack a margin of error percentage to determine whether the results are statistically significant. If there is too much error then we should ignore the results, but without knowing the margin of error we can't make a determination.
Also, because there is need to have interpretation by testers (see quote above), it is important to emphasize reliability between testers. Do they run tests to determine whether different testers make similar judgments? They ought to post inter-rater reliability statistics.
One type of solution is to have multiple testers for the same product and take an average to determine inter-rater reliability percentages. They wouldn't need multiple raters to test every product; they would only need multiple testers to get an initial inter-rater reliability score so that they could post that score for the public and improve inter-rater reliability before they start a new round of testing.
These are standards of an objective test; this objective test wouldn't make it into an academic journal for these reasons and would be considered unreliable work without them. But I think these problems also apply to the popular antivirus tests I see. It's just common for independent testers to ignore good practices when they don't have to worry about publishing in a peer reviewed, academic journal. PC Mag won't care!
But they certainly should automatically retest if a vendor gives good reason to suggest an error: there is a comment on the website right now in which a software vendor says they got different results on the test.
The products do have to pay in order to get a complete test result (if their product didn't make it through the tests); this is why there is no raw data given for certain levels in the test results if a product fails to advance through the levels. Money often gets in the way of knowledge! But it shouldn't effect the test results (the raw data, I mean, in the PDF files) because the tests are available to the public for a free download.
4. Does Experience Count Too Much?
Related to the problem of rater reliability is a problem of an experience and knowledge difference between testers and average users. Ian "Gizmo" Richards often notes this kind of problem. Testers in the challenge are experienced users whilst users of firewalls in the real world are often not experienced users.
When a firewall is tested, it gives the tester many popup alert messages and asks for their input. Popup alert information and advice used by personal firewalls is often ambiguous. The alerts often depend on a user's knowledge of their computer software, for example. So the level of protection for average users may not reach as high as it does for the testers.
Firewall security is a two part relation between the user and the product. If the user answers "block" too late in the chain of alerts, then they get firewall crashes and maliciously launched browsers instead of high quality firewall security. And average users will no doubt let through many applications that experienced testers won't. Therefore, a software product can only reliably provide the level of security found in the tests for experienced users. If a product is too confusing, then it may rarely reach the Matousec level of security.
The firewall tests could be more informative about the real world effectiveness of a firewall by testing it on random users. If they had enough users to volunteer in such hypothetical tests, then they could interpret the results through a statistical analysis. The use of experienced users helps to filter out false positives (and sometimes they even modify the tests if they think a firewall has an obvious weak point), but such filtering and interpretation of results does not have to be part of testing. One may interpret the data after testing and one could modify tests after an average user completes testing to ready the firewall for retesting.
This methodology would generalize better to the public and to the real world effectiveness of firewall products, but, of course, as in all objective tests there is a trade off -- as you increase the generalizability of a test to the public, you decrease its internal validity (and vice versa). For example, such proposed tests would have to be more user-friendly and would probably decrease the validity of the tests themselves. It is just methodologically difficult to get both internal validity in tests while also generalizing the results to a broader population.
The Matousec results might suggest a maximum level of security for a product. Though, it is difficult even to make this kind of claim because the challenge does not fully test products with all actual tests. So for products that did not get far up the levels of testing, the Matousec scores do not suggest a maximum level of security. Products may even provide a higher level of security for both experienced and inexperienced users than the level of security suggested by the Matousec score.
If a low scoring product is user-friendly, then it may actually provide more security for inexperienced users than a complicated firewall. Likewise if a user is more knowledgeable of a low-scoring product, then such a user may be more secure with the lower scoring product than with a higher scoring product. If a product completely confuses a user, then it might provide lower security in the real world than in the tests.
But perhaps the higher scoring product would be so high that it would still outperform a low-scoring, user-friendly product. We wouldn't know without conducting the tests on random users.
If the results of the Matousec tests really do not generalize well to the public, then users should consider other factors such as user-friendliness. It's an intuitive possibility that for average users, user-friendliness actually increases the level of effectiveness of a product (but not reliably past the level reached by an experienced user).
5. It's Not All Bad
Of course, any experienced user can use the same tests used in Matousec testing since they are located on the website for a free download (http://www.matousec.com/downloads/). Therefore, money can't plausibly influence the validity of the actual tests since the tests are available to everyone (though, #1-3 point out problems with the validity of the scoring of the tests and the tests may be modified during testing); money can only influence the way the site runs -- the politics!
The test results are linked by a PDF file and anyone can see the types of tests a product fails or passes. If it only fails tests not proper to it, then the lower score matters not to users.
Since the raw data is posted to the site, you can completely ignore the overall score and just look at the tests passed or failed. But some raw data results were the product of some individual tester, and they may have made an error or interpreted the tests differently than someone else.
Plus, they correct any errors in their tests, such as one in the case of Comodo in which they forgot to use the correct settings while testing. So they retested and the score changed from 84% to 90%. But it would be nice if they tested more often to update the results for new versions of products, though then maybe the site couldn't exist at all. There aren't many out there! And the few others I could find are wildly out of date, back to 2006 or so.
There are plenty of things to be suspicious of in this test. For example, I find it a problematic practice that they report test results so harshly by adding 0% for levels not tested and scoring results by the number of possible tests even if not actually administered. I found it confusing that they arrange products based on the total number of possible tests. And I found the claim that they merely test the security claims of vendors to be invalid.
6. Mamutu
This gets a special section all to its own. JonathanT noted that
"Emsisoft believes Mamutu is misplaced in that test, see here and here."
Of course, if it fails tests only a Firewall can pass; then it is obvious the test isn't fair for that particular product. But a product can still get a high score without certain types of security; for example Online Armor and PC Tools do not have much anti-keylogger security -- Online Armor strips it from the free version and PC Tools leaves it to other security products -- both perform excellent in the tests.
Matousec says Mamutu fits enough of their tests to make the definition and get tested, so they go based on the functions a product does and the protection it claims (in a loose sort of way, obviously!) rather than its category. They use a functional definition rather than traditional categories of products, so there is bound to be discrepancy between their functional definition and other definitions. The main point is whether the products provide important defense, such as defending against malware turning it off.
There are several very well known products on the list that perform very poorly, so Mamutu is not alone. We are talking about it, and sometimes "bad" press is good! But a valid argument for Mamutu to use are the one's in #1-3 above. Mamutu's test results are based on level 1 tests only and automatic 0% scores for all other levels. Mamutu clearly doesn't claim to protect against some things in the tests.
But Mamutu, instead, argues that they are misplaced in the tests because of category distinctions, instead of arguing that the test itself has flaws in its goals and flaws in its scoring/statistical methods.
I do have sympathy for Mamutu, especially when most readers don't read the rules, PDF files, and vendor comments. Though this is partly a problem of the readers themselves, it is also a problem with the way the test results are reported (adding 0% for levels not tested), the way they are posted (by number of possible tests even if not actually administered), and the partly incorrect claim of testing the security claims of vendors.
Related Links:
Rizar
Delicious
Digg
StumbleUpon
Please rate this article


Subscribe to our
test Result show's Gdata 2008, but now Gdata 2010 running. very slow update results. very bad..
Matousec tests are not perfect, to be sure, but currently they are the only available guide as far as I know. In the old days it was Steve Gibson's stuff but that's long gone. The trouble is that Matousec effectively have a monopoly. Monopolies don't benefit the consumer.
Until someone else comes along it's hard to see what else you can use to judge a firewall. I don't think Matousec is an end-users' resource -- end-users look to Gizmo's to interpret the results the geeks come out with, and the site does that well enough.
Trouble is there's no money in rubbishing other peoples' products and that's the game Matousec are in. Shame, because we need more reference sources like this. While this is exactly the type of testing and data resource we need, it's also very easy to criticise because this area is so complex. But it's possible their presentation & logic could be improved.
On a personal level I'd really like to see 2 sets of tests, one for ordinary / normal firewalls, and one for firewalls with HIPS. You'd think they would prioritise for standard firewalls and not the HIPS type. Wonder why that is.
Excellent article Rizar, good analysis.
chris.p
so why doesn't GIZMO adopt a constructive position? in the past i've seen tests being done and that was a single man job, in the past tsa was useful. not now, not anymore.
Well, now up to a million visitors a month and still climbing suggests that TSA remains useful for someone. The constructive approach though is to tell us exactly what you want to see. Be specific and whatever is possible will be considered for the agenda. We'll be leading the PC security field in another area quite soon so I guess we're not quite ready to roll over and die just yet.
YES SHOW US SOME BETTER FIREWALL TESTS!! That is very much needed. Gizmo can you help?
Mamutu is misplaced in this test because of a different definition of protection.
Matousec thinks that a security software must pass SIMULATED leak tests which are, of course, no real malware, but simulated.
Emsisoft states that Mamutu protects against real malware samples because it is a behavior blocker and not a HIPS or a firewall. As long as a testing tool does not act like real malware in full it will not be detected by Mamutu - because it's not built to do so.
Example: Mamutu triggers Keyloggers, but will only alert a key logging software if several other parameters say that this program is a real malware and not some kind of software that captures your keyprints to pass it to a braille reader for blind people.
- A typical firewall is made to alert a maximum number of things that COULD be dangerous.
- Mamutu is made to show as little alerts as possible and filters all programs that contain potential dangerous actions which are not indendet to do dangerous things.
That's the major point that you missed in this article.
It's not about hypothetical leaks and testing of them, it's about real malware. And here Mamutu does a great job. It's able to detect nearly all malware samples on a protection level similar to Norton Antibot and Threatfire (which are both not included in the Matousec tests btw.).
So I'd ask: Why intentionally draw a bad light on Mamutu if it performs world class when it comes to real malware?
You're right, it's on the reader to interpret the test results right. But who is really able to do this correctly? 99% of all visitors will simply think: Hey, Mamutu is rated worst, I'll never ever download and install this bad one..
That's fair?
Very informative article!
A minor point, you say Matousec uses a functional definition rather than traditional categories of products. But the function of Mamutu is fundamentally different to a firewall. And "There are several very well known products on the list that perform very poorly". One can still make a case that these are firewalls (though I personally still think Matousec is using a HIPS+Firewall test on firewalls), but Mamutu simply doesn't filter network traffic, so I can see why Emsisoft is offended.
Thanks! And I just added another section (#4).
Oddly enough both could be true. A product could be both fundamentally different from a firewall and also qualify as a "personal firewall."
I think you are right; in fact, the functional definition excludes "firewalls" or "packet filters." A product must have a HIPS or a behavior blocker to qualify, at least according to the material I quote from the website at the beginning of the article.
Matousec test is very very suspicious. I remember when they tested up-to-date product against beta product which also had up-to-date version. After that I don't visit matousec.com anymore.
Show us some BETTER tests.
There are no better tests. That's the problem. There aren't even any similar tests.
Matousec seem to be the only viable testing outfit but not everyone agrees with their methods.
chris.p
Post new comment