Tell SpamSieve the Truth

November 11th, 2006 (SpamSieve)

Tips like this have been circulating recently. It’s suggested that you should not mark as spam phishing messages, the recent image and stock spams, and messages with spoofed sender addresses. I do not recommend that SpamSieve users follow this advice.

When you first install SpamSieve, you’ll do an initial training using up to 1,000 spam and good messages, about 65% of them spam. After that, SpamSieve will auto-train itself and you’ll only need to train it when it makes a mistake. In both cases, my recommendation is simple: tell SpamSieve the truth. If a message is spam, train it as such; don’t omit a message because you think it will confuse SpamSieve. This is for two reasons. First, there’s probably some spammy content that SpamSieve could learn from, even if it doesn’t appear so. Second, if you don’t train the message as spam, SpamSieve will assume that the message was good and that you want to see more such messages.

There are also suggestions that you should create a rule in your mail program to block messages with multipart/related parts, since this content type is common among image spams. Good messages that use multipart/related are common, so you are risking false positives if you make such a rule. Even if only applied when the sender address isn’t in your address book, I think it’s dangerous to create this kind of blanket rule. Instead, I suggest letting SpamSieve filter these messages. Version 2.5 can analyze the structure and content of image spams, and it should be able to catch most of them without producing false positives. This will happen automatically as you train it on image spams that get through, although if you have a very old or very large corpus it may learn faster if you reset the corpus after updating to 2.5 and then re-train with some more recent messages.