PDA

View Full Version : Corpus too large



andrewrodney@mac.com
08-23-2006, 03:42 PM
Been using this fine product Since Jan 04. For grins I looked at a Training Tip and I'm told the Corpus is too large (I'm told that's OK if accuracy is OK). From the beginning of using the product, the stats are pretty good 99.1% but in the last few months, a lot of spam email from .Mac is coming in and needing to be selected and taught to SpamSieve. I'm wondering if I should take the advise and pare it down or not. Stat's are below.

Filtered Mail
64956 Good Messages
140997 Spam Messages (68%)
150 Spam Messages Per Day

SpamSieve Accuracy
458 False Positives
1382 False Negatives (75%)
99.1% Correct

Corpus
19788 Good Messages
85270 Spam Messages (81%)
897400 Total Words

Rules
72105 Blocklist Rules
5585 Whitelist Rules

Showing Statistics Since
1/29/04 11:37 AM

Michael Tsai
08-23-2006, 03:52 PM
Yes, if you've been using it for a long time and the recent accuracy is not as good as in the past, then it's time to reset the corpus. Spam has changed since 2004, so all that old data is probably holding it back and making it slower to adapt. After resetting the corpus, re-train it (http://c-command.com/spamsieve/manual-ah/using-spamsieve-with-yo) with a smaller number of recent messages.

Mike Guilbault
08-27-2006, 10:17 AM
What exactly is 'corpus'?

Michael Tsai
08-27-2006, 03:03 PM
The corpus is a collection of messages, both spam and good, with which you have trained SpamSieve. SpamSieve uses the corpus to evaluate the contents of incoming messages to determine whether they're spam. Please see: this page about the Show Corpus command (http://c-command.com/spamsieve/manual-ah/show-corpus) and this page about training SpamSieve (http://c-command.com/spamsieve/manual-ah/using-spamsieve-with-yo).

nivag
08-31-2006, 08:54 AM
Just thought i'd point out that you really shouldn't use an email address as your username, it's bound to get farmed and then added to a spaming list.

khani
11-08-2006, 07:19 AM
Hi,

I'm receiving quite a lot of e-mails and my corpus gets too big very fast.
There is not "automatic" option in SpamSieve to reduce it when needed ?
it is a pity to loose everything each time you have to erase the corpus and restart training and all painfull tasks;..

Michael Tsai
11-08-2006, 09:52 AM
I'm receiving quite a lot of e-mails and my corpus gets too big very fast.

How many e-mails do you receive per day, and how many do you have in the corpus right now? SpamSieve's auto-training should prevent the corpus from growing too large very fast. So the corpus should grow mainly when you train SpamSieve with messages that it put in the wrong mailbox, which is probably not enough to make it grow too fast.



There is not "automatic" option in SpamSieve to reduce it when needed ?


No, because there's no simple way to reduce it in way that provides accuracy benefits. It's better to start out (http://c-command.com/spamsieve/manual-ah/using-spamsieve-with-yo) with a reasonably sized corpus and then control the growth.