SpamSieve letting spam through -- subject with * and /! stuff?

ptroxler · February 4, 2011, 2:05am

Since a couple of days I get massive amounts of spam not caught by SpamSieve (while it is working on other messages just great).

Example log entries – worrying:

Predicted: Good (27)
Subject: “CLICK >>”
From: BILLYHickey@voila.fr
Identifier: xz0HyQSvlbPQIsRzhCou4Q==
Reason: P(spam)=0.000[0.499], bias=0.000, ^float-span(0.000), A:User(0.001), F:@voila.fr(0.998), ^a-style-floatleft(0.002), RP:@voila.fr(0.998), A:her(0.002), U:latest(0.002), A:daughter(0.005), LS:If(0.005), A:Agreement(0.005), U:medicanki(0.005), U:medicanki(0.005), foreclosure(0.995), foreclosure(0.995), slams(0.995)
Date: 2011-02-04 11:01:21 +0100

Trained: Good (Auto)
Subject: “CLICK >>”
Identifier: xz0HyQSvlbPQIsRzhCou4Q==
Actions: added rule <From (address) Is Equal to "BILLYHickey@voila.fr"> to SpamSieve whitelist, added rule <From (name) Is Equal to “BILLY Hickey”> to SpamSieve whitelist, added to Good corpus (1837)
Date: 2011-02-04 11:01:21 +0100

What all these messages have in common is some combination of stars or slashes with exclamation marks in the subject line

Predicted: Good (13)
Subject: -!\DISCOUNT VIAGRA!-
From: WaylonEngland@aol.com
Identifier: Vj4FDbzWGH/0yO+oKnkSDw==
Reason: P(spam)=0.000[0.183], bias=0.000, ^a-style-clearboth(0.000), A:love(0.000), ^float-span(0.000), U:analyst(0.000), A:employer(0.001), loud(0.001), loud(0.001), A:User(0.001), nfls(0.999), nfls(0.999), H:X-AOL-SENDER(0.002), R:^sysops^aol^com(0.002), x-mb-message-source:WebUI(0.002), x-mb-message-type:User(0.002), H:X-MB-Message-Source(0.002)
Date: 2011-02-04 10:56:22 +0100

Predicted: Good (5)
Subject: /!.ELITE PHARMACY HERE >>.!/
From: TILDAHaas@aol.com
Identifier: 5wLJkJG3QiNlnyjRjl5zTA==
Reason: P(spam)=0.000[0.002], bias=0.000, ^a-style-clearboth(0.000), ^float-span(0.000), A:really(0.001), A:User(0.001), A:close(0.001), U:Log(0.999), H:X-AOL-SENDER(0.002), R:^sysops^aol^com(0.002), x-mb-message-source:WebUI(0.002), x-mb-message-type:User(0.002), A:crisis(0.002), H:X-MB-Message-Source(0.002), H:X-MB-Message-Type(0.002), x-aol-sender:@aol.com(0.002), ^a-style-floatleft(0.002)
Date: 2011-02-04 10:56:22 +0100

Predicted: Good (27)
Subject: -!BE A CARNAL MANIAC!!-
From: SHERRILLANDIS@excite.com
Identifier: 0mjCq9SMMOSVA/DC2OSj5w==
Reason: P(spam)=0.000[0.500], bias=0.000, rss(0.000), rss(0.000), ^float-span(0.000), U:aid(0.001), U:loud(0.001), nba(0.999), A:User(0.002), A:least(0.002), tornadoes(0.998), tornadoes(0.998), A:metabolism(0.998), metabolism(0.998), A:NBA(0.005), ^a-style-floatleft(0.005), man!(0.005)
Date: 2011-02-04 10:56:22 +0100

Michael_Tsai · February 4, 2011, 7:40am

I don’t think the punctuation is the cause of the problem. Please follow the instructions to submit a report via e-mail.

ptroxler · February 4, 2011, 11:53am

watever the cause
pls fix it

Michael_Tsai · February 4, 2011, 11:57am

This is probably something that needs to be fixed on your Mac. I’m happy to help you with that, but in order to do that I need you to send in the report with your log file.

ptroxler · February 4, 2011, 12:07pm

sent.

Michael_Tsai · February 4, 2011, 12:28pm

Thanks. The log shows that despite the problems, SpamSieve was still over 98.5% accurate for the month of February and also the last two days. That’s lower accuracy than I’d like, but we’re talking about 17 spams here.

I see from the log that there are a few things that SpamSieve could be doing better, and I’ll make some adjustments for the next update.

However, I think the main cause of the lower accuracy is the accumulated data from the messages that you’ve trained it with. Elements that were previously appearing in your good messages are now appearing in your spam. To speed up the learning process, you could reset SpamSieve’s corpus, and then re-train it with a smaller number of recent messages.

ptroxler · February 5, 2011, 1:58am

I have not had to do a lot of training over say the last year certainly – and now suddenly there is this weird behaviour, requiring me to retrain the thing.

But worse: on top of all this hassle you insinuate that I had trained it wrongly – while it is apparent that SpamSieve autonomously started to “autotrain” spam as good.

Michael_Tsai · February 5, 2011, 8:25am

You misunderstand. I’m not saying that you did anything wrong; I’m saying that the content (both visible and invisible) of the spams has changed over time. The data that SpamSieve is working from that used to be correct is still correct (the messages that were spam before are still spam), but it is now somewhat out of date (the contents of those messages are no longer as representative of the current spams). Over time, as it you train it with more of the newer spams the accuracy would probably improve, but with more than 4,200 messages previously trained and just 17 mistakes this month, you can see that the new stuff is less than half a percent of the corpus. That’s why I said it would be more efficient to reset the corpus and re-train it with recent messages.

It’s true that SpamSieve had auto-trained some of these spams as good, but that’s not the reason for the problems. The way auto-training works is that SpamSieve looks at all the incoming messages to see what it can learn from them. Since most of the time it’s correct about whether a message is good or spam, this helps it stay up-to-date without your having to do anything. You should be correcting any mistakes that SpamSieve makes, so if SpamSieve incorrectly auto-trained a message as good, the spam message would be in your inbox, and then it would undo that training when you train the message as spam.

ptroxler · February 5, 2011, 8:31am

thx for the clarification. I’ve now reset, retrained with the 35/65 ratio you suggest … we’ll see what happens.