1.2 Identifying Spam

SpamSieve uses a variety of methods to identify spam messages, but by far the most important is a statistical technique known as Bayesian analysis. For a more in-depth treatment of this technique applied to spam, see this article by Paul Graham and the papers it references. Bayesian spam filtering is highly accurate and adapts to new types of spam messages “in the field.”

First, you train SpamSieve with examples of your good mail and your spam. When you receive a new message, SpamSieve looks at how often its words occur in spam messages vs. good messages. Lots of spammy words mean that the message is probably spam. However, the presence of words that are common in your normal e-mail but rare in spam messages can tip the scale the other way. This “fuzzy” approach allows SpamSieve to catch nearly every spam message yet produce very few false positives. (A false positive is a good message mistakenly identified as spam. Most users consider false positives to be much worse than false negatives—spam messages that the user has to see.)

Because you train SpamSieve with your own mail, you have full control. If SpamSieve makes a mistake, you can train it with the message in question so that in the future it will do better. Further, since spammers don’t have access to the messages you trained SpamSieve with, they have no way of knowing how to change their messages to get through. Whereas other spam filters become less effective as spammers figure out their rules, SpamSieve becomes more effective over time because it has a larger corpus of your messages to work from.