PDA

View Full Version : masses of false negatives



imajes
08-03-2007, 01:34 AM
Hey,

getting many many false negatives (often with attachments).

Raw Message One -



Return-Path: <wej@home.nl>
Delivered-To: ><
Received: (qmail 13373 invoked by uid 0); 3 Aug 2007 06:02:54 -0000
Received: by simscan 1.2.0 ppid: 13367, pid: 13368, t: 1.0158s
scanners: clamav: 0.88.2/m:39/d:1524 spam: 3.0.6
X-Spam-Checker-Version: SpamAssassin 3.1.8 (2007-02-13) on yt.23i.net
X-Spam-Level: *
X-Spam-Status: No, score=1.8 required=5.0 tests=BAYES_40,HELO_DYNAMIC_OOL
autolearn=no version=3.1.8
Received: from unknown (HELO ool-4578d749.dyn.optonline.net) (69.120.215.73)
by mail.23i.net with SMTP; 3 Aug 2007 06:02:53 -0000
Received: (qmail 22675 invoked from network); Fri, 3 Aug 2007 01:20:52 -0400
Received: from unknown (HELO pra) (130.216.167.104)
by ool-4578d749.dyn.optonline.net with SMTP; Fri, 3 Aug 2007 01:20:52 -0400
Message-ID: <46B2BB34.1020600@home.nl>
Date: Fri, 3 Aug 2007 01:20:52 -0400
From: Spears <wej@home.nl>
User-Agent: Thunderbird 1.5.0.12 (Windows/20070509)
MIME-Version: 1.0
To:><
Subject: journal
Content-Type: multipart/mixed;
boundary="------------070506020405000501010708"


--------------070506020405000501010708
Content-Transfer-Encoding: 7bit
Content-Type: text/plain;
charset=iso-8859-2;
format=flowed



--------------070506020405000501010708
Content-Transfer-Encoding: base64
Content-Type: application/octet-stream;
name=journal.zip
Content-Disposition: attachment;
filename=journal.zip

<file content>


Raw Message Two -




Return-Path: <koko_mcbroom@boysdesignerwatch.com>
Delivered-To: ><
Received: (qmail 13384 invoked by uid 0); 3 Aug 2007 06:03:02 -0000
Received: by simscan 1.2.0 ppid: 13378, pid: 13379, t: 1.3899s
scanners: clamav: 0.88.2/m:39/d:1524 spam: 3.0.6
X-Spam-Checker-Version: SpamAssassin 3.1.8 (2007-02-13) on yt.23i.net
X-Spam-Level:
X-Spam-Status: No, score=-0.6 required=5.0 tests=BAYES_20,SUBJ_HAS_UNIQ_ID
autolearn=ham version=3.1.8
Received: from unknown (HELO ?87.245.184.109?) (87.245.184.109)
by mail.23i.net with SMTP; 3 Aug 2007 06:03:00 -0000
Received: from amm100 ([143.162.27.161])
by boysdesignerwatch.com.local (8.13.2/8.13.2) with SMTP id UKhCZHlJsQ4648;
Fri, 3 Aug 2007 09:20:59 +0400
Message-ID: <78135129.9BC2D54@boysdesignerwatch.com>
Date: Fri, 3 Aug 2007 09:20:58 +0400
From: koko <koko_mcbroom@boysdesignerwatch.com>
User-Agent: Thunderbird 1.5.0.12 (Windows/20070509)
MIME-Version: 1.0
To:><
Subject: investor-news-571419098
Content-Type: multipart/mixed;
boundary="------------020205030304020507050600"

--------------020205030304020507050600
Content-Type: text/plain; charset=iso-8859-2; format=flowed
Content-Transfer-Encoding: 7bit



--------------020205030304020507050600
Content-Type: application/pdf;
name="investor-news-571419098.pdf"
Content-Transfer-Encoding: base64
Content-Disposition: inline;
filename="investor-news-571419098.pdf"

<file content>



The Log -


Predicted: Good (43)
Subject: journal
From: wej@home.nl
Identifier: b7LreuiZxWD/Vdb2Ez+3Qg==
Reason: P(spam)=0.825[0.860], bias=0.000, ^fb-journal(0.995), S:journal(0.106), X:BAYES-40(0.817), CT:iso-8859-2(0.781),
^file-size-13(0.754), R:^69(0.252), user-agent:Thunderbird1.5.0.12Windows20070509(0.728), H:Content-Disposition(0.279),
^file(0.689), ^mm(0.689), content-disposition:zip(0.631), ^fe-zip(0.631), CT:zip(0.631), CT:journal(0.373),
content-disposition:journal(0.373)
Date: 2007-08-03 06:25:01 +0100
================================================== ===================
Trained: Good (Auto)
Subject: journal
Identifier: b7LreuiZxWD/Vdb2Ez+3Qg==
Actions: added rule <From (address) Is Equal to "wej@home.nl"> to SpamSieve whitelist, added to Good corpus (670)
Date: 2007-08-03 06:25:01 +0100
================================================== ===================
Predicted: Good (21)
Subject: investor-news-571419098
From: koko_mcbroom@boysdesignerwatch.com
Identifier: PXAPiqdyx/S8wTZCr4dmMg==
Reason: P(spam)=0.175[0.358], bias=0.000, RP:koko(0.005), F:koko(0.005), F:koko(0.005), X:SUBJ-HAS-UNIQ-ID(0.889), ^iw-680(0.867),
^ih-800(0.867), CT:iso-8859-2(0.764), ^file-size-13(0.740), R:^com^local(0.735), X:BAYES-20(0.724),
user-agent:Thunderbird1.5.0.12Windows20070509(0.722), H:Content-Disposition(0.278), ^file(0.686), ^mm(0.686), to:james@(0.387)
Date: 2007-08-03 06:25:02 +0100
================================================== ===================
Trained: Good (Auto)
Subject: investor-news-571419098
Identifier: PXAPiqdyx/S8wTZCr4dmMg==
Actions: added rule <From (address) Is Equal to "koko_mcbroom@boysdesignerwatch.com"> to SpamSieve whitelist, added to Good corpus (671)
Date: 2007-08-03 06:25:02 +0100



i appreciate that these are realllly hard to figure out, however i've been training tens (if not hundreds) of these as spam:-


Trained: Spam (Manual)
Subject: journal
Identifier: b7LreuiZxWD/Vdb2Ez+3Qg==
Actions: disabled rule <From (address) Is Equal to "wej@home.nl"> in SpamSieve whitelist, added rule <From (address) Is Equal to "wej@home.nl">
to SpamSieve blocklist, added to Spam corpus (1127), removed from Good corpus (670)
Date: 2007-08-03 06:31:30 +0100
================================================== ===================
Mistake: False Negative
Subject: journal
Identifier: b7LreuiZxWD/Vdb2Ez+3Qg==
Classifier: Bayesian
Score: 43
Date: 2007-08-03 06:31:35 +0100
================================================== ===================
Trained: Spam (Manual)
Subject: investor-news-571419098
Identifier: PXAPiqdyx/S8wTZCr4dmMg==
Actions: disabled rule <From (address) Is Equal to "koko_mcbroom@boysdesignerwatch.com"> in SpamSieve whitelist, added rule <From (address)
Is Equal to "koko_mcbroom@boysdesignerwatch.com"> to SpamSieve blocklist, added to Spam corpus (1128), removed from Good corpus (669)
Date: 2007-08-03 06:32:35 +0100
================================================== ===================
Mistake: False Negative
Subject: investor-news-571419098
Identifier: PXAPiqdyx/S8wTZCr4dmMg==
Classifier: Bayesian
Score: 21
Date: 2007-08-03 06:32:40 +0100


stats:


Filtered Mail
56,246 Good Messages
24,063 Spam Messages (30&#37;)
120 Spam Messages Per Day

SpamSieve Accuracy
105 False Positives
249 False Negatives (70%)
99.6% Correct

Corpus
669 Good Messages
1,128 Spam Messages (63%)
96,660 Total Words

Rules
22,742 Blocklist Rules
3,543 Whitelist Rules

Showing Statistics Since
16/01/2007 03:48


any tips?

best... james

Michael Tsai
08-03-2007, 12:18 PM
I recommend resetting SpamSieve’s corpus (http://c-command.com/spamsieve/manual-ah/reset-corpus), updating to SpamSieve 2.6.3 (http://c-command.com/blog/2007/08/03/spamsieve-263/), and re-training SpamSieve (http://c-command.com/spamsieve/manual-ah/using-spamsieve-with-yo) with a smaller number of recent messages.

imajes
08-03-2007, 12:22 PM
thanks-- will do!

brianwestchest
08-07-2007, 01:22 PM
I recently started experiencing the same thing. SpamSieve had been working great for me. Then, suddenly, masses of false negatives- particularly from one spammer in Nevada. I kept marking his messages as spam hoping SpamSieve would eventually pick up on K & C Enterprises in Henderson, NV was a spammer. But, they kept getting through. Not being too technical I couldn't figure out how. I also was getting a lot of other particular ones, too.

Finally, I completely reset my corpus and decided to look through my WhiteList rules, (which I had never done). Much to my surprise, there were many rules there that had "From" email addresses listed as WhiteList and they were NOT people I would have chosen to WhiteList. So, I wiped out all the WhiteList rules too.

Now, just a few hours later (I get a lot of spam), I checked the WhiteList rules again and there people WhiteListed that I know I marked as spam.

So... my question is how do these WhiteList rules get generated? I suspect they may be causing me to get false negatives.

Thanks,
Brian

Michael Tsai
08-07-2007, 01:43 PM
Then, suddenly, masses of false negatives- particularly from one spammer in Nevada.


There is a FAQ page (http://c-command.com/spamsieve/manual-ah/why-is-spamsieve-not-ca) that describes what to do in situations such as this. The first step is to check the log to see whether the messages are actually false negatives (spam messages that SpamSieve thought were good) or whether they are messages that SpamSieve was not asked to filter.



Finally, I completely reset my corpus


This should only be done as a last resort, after you’ve (a) verified that the messages are false negatives, and (b) that SpamSieve’s is thinking that they’re good due to the words in the messages rather than, e.g. because the sender is in your address book and you’ve told SpamSieve never to consider such messages as spam.



So... my question is how do these WhiteList rules get generated?


As a safety feature, SpamSieve automatically adds whitelist rules for messages that it thinks are good. This doesn’t ordinarily cause any problems because if the message turns out not to be good you’ll train it as spam and then SpamSieve will disable the whitelist rule. You should not delete these rules, since doing so would make SpamSieve forget that the addresses not good.



I suspect they may be causing me to get false negatives.


I suspect not, but you can check the log (http://c-command.com/spamsieve/manual-ah/open-log) to see for sure.