Results 1 to 4 of 4

Thread: Can I simply delete old corpus entries?

  1. #1

    Default Can I simply delete old corpus entries?

    Hi,

    Long time, satisfied Spamsieve user. About every 12-18 months, I suddenly find my corpus has (surprise!) grown large and my accuracy has dropped off sharply.

    Today''s stats:

    Filtered Mail
    8,759 Good Messages
    19,105 Spam Messages (69%)
    37 Spam Messages Per Day

    SpamSieve Accuracy
    82 False Positives
    467 False Negatives (85%)
    98.0% Correct

    Corpus
    3,150 Good Messages
    5,101 Spam Messages (62%)
    278,600 Total Words

    Rules
    12,949 Blocklist Rules
    10,072 Whitelist Rules

    Showing Statistics Since
    1/1/08 9:40 PM


    The specter of resetting the corpus and retraining from scratch loomed before me. I shuddered.

    So today I tried something different. I went into the 2 rules lists and deleted all entries with a zero in the hits field. I then went into the corpus and deleted everything from 2008. This trimmed all those entries considerably.

    Will this work or do I really have to suck it up, reset the corpus and start fresh?

  2. #2

    Default

    Quote Originally Posted by Fork View Post
    I went into the 2 rules lists and deleted all entries with a zero in the hits field. I then went into the corpus and deleted everything from 2008. This trimmed all those entries considerably.

    Will this work or do I really have to suck it up, reset the corpus and start fresh?
    That will speed it up, but it won’t affect the accuracy. Resetting the corpus and re-training shouldn’t be a big deal because you only need to use a few hundred messages these days. Of course, if the accuracy has dropped off suddenly, you should first check that the problem is actually with the corpus, rather than in the settings for your mail program or SpamSieve.

  3. #3

    Default

    Okay, I'll start saving up my spam until I reach 400 and then retrain with those and 240 good messages I have already saved and filed.

    But what numbers do I want to see in the Stats window after retraining?

    Right now it shows 98% correct and 85% false positives. Shouldn't the correct number be higher, like around 99+? And what about the false positives? I've seen higher than 85% but that stat is usually much lower in the stats posted by others on this forum.

  4. #4

    Default

    Quote Originally Posted by Fork View Post
    But what numbers do I want to see in the Stats window after retraining?

    Right now it shows 98% correct and 85% false positives. Shouldn't the correct number be higher, like around 99+?
    Yes, it should be above 99%. Right now, you have it set to show the average since January 2008. You would need to change the date in order to track the more recent statistics.

    Quote Originally Posted by Fork View Post
    And what about the false positives? I've seen higher than 85% but that stat is usually much lower in the stats posted by others on this forum.
    You want most of the mistakes to be false negatives rather than false positives, but in most cases there’s probably not much you can do to affect this number, as it depends on the kind of mail you receive. If the overall accuracy is good, this number should be good, too.

Similar Threads

  1. print multiple entries at once
    By thoresson in forum EagleFiler
    Replies: 1
    Last Post: 07-03-2008, 09:33 AM
  2. Replies: 3
    Last Post: 03-12-2008, 08:34 AM
  3. Replies: 2
    Last Post: 11-07-2007, 04:31 PM
  4. corpus 'larger than necessary' (again)
    By ophiochos in forum SpamSieve
    Replies: 12
    Last Post: 06-14-2007, 11:40 AM
  5. Corpus too large
    By andrewrodney@mac.com in forum SpamSieve
    Replies: 6
    Last Post: 11-08-2006, 09:52 AM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •