Saturday 22 October 2005

Statistical Spam Filters are Too Hard to Use

Statistical spam filters use powerful mathematics to decide if a message is spam or not. They classify email as spam or ham, using Bayesian analysis and other statistical methods. Examples of such filters are SpamBayes, POPfile, DSPAM, and CRM114.

State of the art statistical filters can achieve levels of accuracy as good as or better than a user manually filtering spam with the Delete button. However, such filters require several months of training before they can achieve the accuracy required. Filters that rely on end-users to train them aren't suitable for the majority of users.

This training can be done by feeding the filter a "corpus" of spam and legitimate messages (i.e. an archive of several months of spam and ham). However, the initial and ongoing training requirements are onerous and error-prone. When users complain that a good statistical spam filter isn't accurate, it's usually because they haven't trained it properly; but that's hardly fair -- users just want their filter to work.

Tags: .

No comments:

Post a Comment