Results 1 to 7 of 7

Thread: Corpus too large

  1. #1

    Default Corpus too large

    Been using this fine product Since Jan 04. For grins I looked at a Training Tip and I'm told the Corpus is too large (I'm told that's OK if accuracy is OK). From the beginning of using the product, the stats are pretty good 99.1% but in the last few months, a lot of spam email from .Mac is coming in and needing to be selected and taught to SpamSieve. I'm wondering if I should take the advise and pare it down or not. Stat's are below.

    Filtered Mail
    64956 Good Messages
    140997 Spam Messages (68%)
    150 Spam Messages Per Day

    SpamSieve Accuracy
    458 False Positives
    1382 False Negatives (75%)
    99.1% Correct

    Corpus
    19788 Good Messages
    85270 Spam Messages (81%)
    897400 Total Words

    Rules
    72105 Blocklist Rules
    5585 Whitelist Rules

    Showing Statistics Since
    1/29/04 11:37 AM

  2. #2

    Default

    Yes, if you've been using it for a long time and the recent accuracy is not as good as in the past, then it's time to reset the corpus. Spam has changed since 2004, so all that old data is probably holding it back and making it slower to adapt. After resetting the corpus, re-train it with a smaller number of recent messages.

  3. #3

    Default

    What exactly is 'corpus'?

  4. #4

    Default

    The corpus is a collection of messages, both spam and good, with which you have trained SpamSieve. SpamSieve uses the corpus to evaluate the contents of incoming messages to determine whether they're spam. Please see: this page about the Show Corpus command and this page about training SpamSieve.

  5. #5
    nivag
    Guest

    Default

    Just thought i'd point out that you really shouldn't use an email address as your username, it's bound to get farmed and then added to a spaming list.

  6. #6

    Default Corpus too big

    Hi,

    I'm receiving quite a lot of e-mails and my corpus gets too big very fast.
    There is not "automatic" option in SpamSieve to reduce it when needed ?
    it is a pity to loose everything each time you have to erase the corpus and restart training and all painfull tasks;..
    Last edited by khani; 11-08-2006 at 07:20 AM. Reason: mis-spelling

  7. #7

    Default

    Quote Originally Posted by khani View Post
    I'm receiving quite a lot of e-mails and my corpus gets too big very fast.
    How many e-mails do you receive per day, and how many do you have in the corpus right now? SpamSieve's auto-training should prevent the corpus from growing too large very fast. So the corpus should grow mainly when you train SpamSieve with messages that it put in the wrong mailbox, which is probably not enough to make it grow too fast.

    Quote Originally Posted by khani View Post
    There is not "automatic" option in SpamSieve to reduce it when needed ?
    No, because there's no simple way to reduce it in way that provides accuracy benefits. It's better to start out with a reasonably sized corpus and then control the growth.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •