Spam Sieve Not Catching New Spam for Me Either

ajparker · March 8, 2008, 10:44am

Over the past few weeks Spam sieve has been letting more spam through than it has been. Before I might get one a day in my inbox, but now I’m seeing six or seven, sometimes more.

I dug through my log and I noticed something strange. It looks like SpamSieve is sometimes not acknowledging when I train a false positive. I’ve reproduced an example from my log below.

Notice the entries for 2008-03-06, the subject wonyun. The log shows Spam Sieve marking it as good, but there’s no entry that shows where I marked it as a false positive. I mark all spam that gets through as false positives, and I specifically remember marking this one, as I thought the subject was particularly weird. But SS has no record of my marking it, so it kept the address in the whitelist and let other spams with that address through. I looked at my whitelist and saw a number of very suspicious addresses on it, which make me think SS has missed more of the false negatives I’ve trained manually.

This seems to have started when I upgraded to 2.6.6. I’m reverting back to 2.6.5 to see if the problem persists.

=====================================================================
Predicted: Spam (95)
Subject: Designer Footwear from Gucci Prada Chanel & More, buy direct, forget department store prices
From: b6-liner123@acclamation.com
Identifier: mVTrvXQGPjhd/miUyLUAow==
Reason: P(spam)=1.000[0.999], bias=0.000, S:store(0.999), ^bg-fffff3(0.999), S:Chanel(0.999), S:Prada(0.999), S:Gucci(0.999), S:Designer(0.998), F:bernard(0.002), CT:FOOTWEAR6-parta4endd(0.995), ^fb-FOOTWEAR6-parta4endd(0.995), S:direct(0.995), S:Footwear(0.995), ^iw-410(0.995), ^file-name-FOOTWEAR6-parta4endd-gif(0.995), ^ih-327(0.995), ^i5-202d3a69(0.995)
Date: 2008-02-21 05:46:22 -0500

=====================================================================
Predicted: Good (27)
Subject: Re: wonyun
From: -liner123@acclamation.com
Identifier: DqHCJGmC8OOCUSG8bhzJFg==
Reason: P(spam)=0.487[0.492], bias=0.000, F:alston(0.005), R:^201^213(0.995), medication(0.875), R:^3(0.268), to:list-manager@(0.302), MI:^bad-host(0.675), F:sally(0.328), darkness(0.348), helps(0.369), R:^201(0.617), yourself(0.600)
Date: 2008-03-06 08:50:26 -0500

=====================================================================
Trained: Good (Auto)
Subject: Re: wonyun
Identifier: DqHCJGmC8OOCUSG8bhzJFg==
Actions: added rule <From (address) Is Equal to "-liner123@acclamation.com"> to SpamSieve whitelist, added rule <From (name) Is Equal to “alston sally”> to SpamSieve whitelist, added to Good corpus (1941)
Date: 2008-03-06 08:50:26 -0500

=====================================================================
Predicted: Good (1)
Subject: 80% discount. Code #EJ72
From: -liner123@acclamation.com
Identifier: Svz5dI0JZJZQ44pT8mRvFg==
Reason: (
"-liner123@acclamation.com"
) matched rule <From (address) Is Equal to "-liner123@acclamation.com"> in SpamSieve whitelist
Date: 2008-03-08 13:02:51 -0500

=====================================================================
Trained: Good (Auto)
Subject: 80% discount. Code #EJ72
Identifier: Svz5dI0JZJZQ44pT8mRvFg==
Actions: added to Good corpus (1976)
Date: 2008-03-08 13:02:51 -0500

=====================================================================
Trained: Spam (Manual)
Subject: 80% discount. Code #EJ72
Identifier: Svz5dI0JZJZQ44pT8mRvFg==
Actions: disabled rule <From (address) Is Equal to "-liner123@acclamation.com"> in SpamSieve whitelist, added rule <From (address) Is Equal to "-liner123@acclamation.com"> to SpamSieve blocklist, added to Spam corpus (3224), removed from Good corpus (1975)
Date: 2008-03-08 13:08:44 -0500

=====================================================================
Mistake: False Negative
Subject: 80% discount. Code #EJ72
Identifier: Svz5dI0JZJZQ44pT8mRvFg==
Classifier: Whitelist
Score: 1
Date: 2008-03-08 13:08:49 -0500

Michael_Tsai · March 8, 2008, 11:04am

Which mail program are you using, and which version of Mac OS X?

A spam message that SpamSieve thought was good is a false negative.

To be precise, there is no such thing as marking as a false positive or as a false negative. Instead, it’s a two-step process: first you train the message as good or as spam, and then SpamSieve determines whether it was a false negative, false positive, or neither.

What you should expect to see is a “Trained: Spam (Manual)” entry followed by a “Mistake: False Negative” entry. I see this for the “80% discount. Code #EJ72” message, but not for the “Re: wonyun” message. Please note that the “Trained” log entry would appear when the training occurred, so it would not necessarily be near the “Predicted: Good” in the log.

Where is the message now? It should be easy to know if the training was successful because when you train a message as spam it’s only moved to the Spam mailbox after (a) it told SpamSieve that the message as spam, and (b) SpamSieve did not report an error.

What happens if you train the message as spam now? Does that add a “Trained: Spam (Manual)” entry to the log?

Are those addresses on the whitelist enabled (checked)?

There were no changes in 2.6.6 that would affect that, so going back to 2.6.5 is a waste of time and will only confuse matters.

In any case, you are correct that if the training is not taking place, SpamSieve will be working from incorrect information, and this would greatly reduce the accuracy. First, we need to make sure the training is working, and then you probably need to reset the corpus, re-train SpamSieve, and clean out the whitelist, so that it’s once again working from correct information.

ajparker · March 8, 2008, 11:38am

Apple Mail 3.2 and OS X 10.5.2.

It’s not anywhere in the log. I did a search both on the from address and the subject.

It is in the spam folder.

Yes.

=====================================================================
Trained: Spam (Manual)
Subject: Re: wonyun
Identifier: DqHCJGmC8OOCUSG8bhzJFg==
Actions: disabled rule <From (name) Is Equal to “alston sally”> in SpamSieve whitelist, added rule <From (name) Is Equal to “alston sally”> to SpamSieve blocklist, added to Spam corpus (3227), removed from Good corpus (1975)
Date: 2008-03-08 14:35:15 -0500

Mistake: False Negative
Subject: Re: wonyun
Identifier: DqHCJGmC8OOCUSG8bhzJFg==
Classifier: Bayesian
Score: 27
Date: 2008-03-08 14:35:20 -0500

Yes.

I’d hate to do that since I have about two years of data in my corpus, but if that’s what it takes to get it working again, I’ll do it. Thanks!

Michael_Tsai · March 8, 2008, 12:23pm

Well, the alternatives are:

Find the particular messages for which the training didn’t take and train them as spam.
Restore from a backup that was made before the corpus and rules got messed up.

ajparker · March 8, 2008, 12:46pm

Thanks to Time Machine, I can do the latter. I’ll give it a shot. Thanks!

ajparker · March 10, 2008, 8:10am

FYI, I had 300 recent spams saved, so I decide just to retrain SS. It’s now back to it’s nearly 100% accuracy goodness. Thanks!