SpamSieve missing some obvious stuff...

GloverCom · July 29, 2008, 10:32am

I’ve been a SpamSieve user for almost 4 years now, but over the last several months it’s just gotten extremely bad at recognizing obvious spam.

I’m talking subject lines with obvious spam such as Viagra, Penis, Cialis, $$$ etc… (And yes, I checked that they are the actual characters, and not look-a-likes.)

I’m using Apple Mail. It works for a considerable amount of spam, but there’s a lot of stragglers. I then “train” it with the new Spam, then the next day re-spams (often the same messages) show up again.

I’m a techy, so I know it’s not a problem with my mail server or duplicates or anything like that.

What should I do? Start over? I’ve already tried to rebuild the corpus… And if I have to build my own filters some of the obvious stuff then it sort of defeats the purpose of having it…

It’s almost as if the corpus just “gave up”.

So any clues? I don’t wanna give up and just say, “Spam has gotten so good. It wins…”

Michael_Tsai · July 29, 2008, 10:56am

There hasn’t been an overall drop in SpamSieve’s accuracy, so the problem is likely on your Mac, either with the setup or with the training. Please check SpamSieve’s log and see what it says about the spams that you’re seeing.

GloverCom · July 29, 2008, 11:27am

Thanks for the reply Michael…

I didn’t know about the log…

Here’s an example from my log:

=============
Trained: Good (Auto)
Subject: After penis pills, you’ll have a trunk for her to swing off.
Identifier: r5g9SNM2bxZt02pKWEmpbw==
Actions: added rule <From (address) Is Equal to "lindabeosmet@dabeos.de"> to SpamSieve whitelist, added rule <From (name) Is Equal to “Jacob Richter”> to SpamSieve whitelist, added to Good corpus (5728)
Date: 2008-07-29 10:16:46 -0700

Seems kinda odd that the email address (which I’ve never seen before) would be on my SpamSieve whitelist… I certainly didn’t put it there…

So I trained it…

===================
Trained: Spam (Manual)
Subject: After penis pills, you’ll have a trunk for her to swing off.
Identifier: r5g9SNM2bxZt02pKWEmpbw==
Actions: disabled rule <From (address) Is Equal to "lindabeosmet@dabeos.de"> in SpamSieve whitelist, disabled rule <From (name) Is Equal to “Jacob Richter”> in SpamSieve whitelist, added rule <From (address) Is Equal to "lindabeosmet@dabeos.de"> to SpamSieve blocklist, added rule <From (name) Is Equal to “Jacob Richter”> to SpamSieve blocklist, added to Spam corpus (7048), removed from Good corpus (5727)
Date: 2008-07-29 10:16:54 -0700

Do you think it’s worth just clearing my whitelist completely? (Currently at 17,846 Rules…)

Michael_Tsai · July 29, 2008, 11:55am

The relevant part of the log is the “Predicted” entries, not the “Trained” entries.

SpamSieve’s auto-training feature put it there, because it thought the message was good. When you trained the message as spam, it realized the whitelist rule was incorrect and disabled it.

If you’ve already reset the corpus, why are there almost 13,000 messages in it? After re-training, there should be fewer than 1,000.

That should be harmless.

GloverCom · July 30, 2008, 9:38am

Thanks again for your help Michael… I guess I didn’t reset the corpus. It seems to me there was a rebuild option somewhere, but this was a while ago, so I’m not sure.

Anyway, here’s a really good example that came in this morning (Copied below).

Even with the extra characters on the words, it would seem this should still have been flagged as spam. (Full of percent signs, mg, Viagra?!?!)

Anything else I can do, other than start over completely?!?!

BTW, my settings are set to the most Aggressive as possible.

================================
Trained: Good (Auto)
Subject: Want your love back?? Check it out
Identifier: Uh6SOhsKnzfHSXIrlRe9dg==
Actions: added rule <From (address) Is Equal to "4confirmation@nationalenquirer.com"> to SpamSieve whitelist, added rule <From (name) Is Equal to “Brittany Whaley”> to SpamSieve whitelist, added to Good corpus (5736)
Date: 2008-07-30 09:31:05 -0700

[Email Contents:]

And here’s what the contents of the email was:

Visit our new online pharmacyy store and save up to 80%

We offer:- All popular drugs are aavailable (Viagra, Cialis,|_evitra and much much more )- World Wiide Shipþing-
No Doctõr Visits- No Prescriptionss- 100% Customer Satisfactionn- Cheapest Price

Today’s special offers on :
#1 Viagra, 90 x 100mg
#2 Cialiss,90 x 20mg
#3 Lévitra, 90 x 20mg

CLICK TO FIND OUT ABOUT MORÊ SPECIAL OFFERS

AND VISIT OUR NEW ONLINE PHARMACY STØRE

http://www.othersecond.com

Michael_Tsai · July 30, 2008, 9:58am

As I said, the relevant part of the log is the “Predicted” entries. You haven’t shown me any, so I have no information to work from.

Please send in your log first.

GloverCom · July 30, 2008, 10:12am

Ah… Damn… Sorry about that. I’ve totally become the idiot that I’m normally trying to do tech support for. I feel like a dork.

Attached is the log is that helps…

Michael_Tsai · July 30, 2008, 11:04am

Thanks. I do think that it would help if you reset SpamSieve’s corpus and re-trained it with a smaller number of recent messages. Second, it’s best if you can correct any mistakes as soon as possible. It looks like you left SpamSieve running all night, and one mistake early on led to many more before you started making corrections in the morning. If you cannot correct mistakes promptly, you may want to turn off auto-training.

quillson · August 1, 2008, 3:40pm

Same Experience for Me
I too have noticed a significant uptick in obvious Spam being let through recently. I’ve been a SpamSieve user for a year and a half, and it used to be a lot more reliable.

I have attached my log for your review. Can you give me any tips to improve performance?

Thanks.

Michael_Tsai · August 1, 2008, 4:11pm

Looking at the last few days, there were a bunch of spam messages that got through but that you waited a long time to correct (please see above).

More seriously, it looks like you’ve received some spam messages that SpamSieve thought were good but that you didn’t train as spam. So now SpamSieve will think they are good, and it will classify similar spam messages as good.

In order to fix this, you need to reset the corpus and then re-train SpamSieve. After doing so, please be very careful to always train spam messages that aren’t put into the spam mailbox as spam. SpamSieve cannot do an accurate job of filtering if it’s working from incorrect information.

By the way, since log files may contain some private information, I recommend that people e-mail them to spamsieve-fn@c-command.com rather than posting them on the Web. (Thus, I’ve removed the two log files posted above.)

quillson · August 1, 2008, 4:28pm

My Training Behavior Has Not Changed…
Michael,

Thank you for your lightning-fast response!

I am puzzled. First of all, I never knew that there was a time sensitivity to training spam. For the last 18 months, I have always left my Mail client on and checking for messages every 5 minutes. Then, the next time I sit down at my computer, I go through the Inbox. It could be a few hours or a few days in between sessions.

It does not make much sense to me that SpamSieve’s effectiveness has suddenly eroded after all of this time, even though my e-mail checking/training behavior has not.

Michael_Tsai · August 1, 2008, 4:40pm

This sensitivity has been in place for several years. (The benefit is that it allows SpamSieve to learn, and keep up to date with changes in the mail that you receive, without your having to explicitly train it except with mistakes.) It matters less when SpamSieve is very accurate. If you start to encounter problems, delayed corrections (unless you turn off auto-training) will magnify them.

I noticed some (apparent) spam messages that got through but that you didn’t train as spam. Are you saying that you’ve been (not) doing that all along? Depending on the messages that you receive and train SpamSieve with, it could take a while for this to start causing visible problems. Also, at this point your corpus is probably a little old and a little too large, so that will cause a general reduction accuracy.

quillson · August 1, 2008, 8:09pm

I have not changed my training behavior the entire time. There have probably always been some messages that slip by untrained. Your most recent explanation does make it a little clearer to me why the performance could degrade over time.

I will follow your suggestions to reset the corpus and retrain SpamSieve. Hopefully, that will regain the effectiveness. I’ll post again if it remains a problem.’

Thanks.