3.11 Do an Initial Training

SpamSieve does not have distinct “training” and “working” modes. As soon as you install it, it is always learning from the messages it sees and always filtering out the spam that it finds.

Although you can start using SpamSieve immediately and just correct any mistakes that it makes, it will do a better job of filtering if you use some of your old mail to do an initial training. This simply means that you give it some examples of messages you consider to be spam, and ones which you do not. You do this by selecting some messages in your mail program and choosing a training command from the menu. For Apple Mail and Outlook, choose Train as Good or Train as Spam from the SpamSieve menu bar icon. For other clients, see the “Setting Up” section for your mail program. There’s more information about this in the Train as Good/Spam section of the manual.

apple mail menus

SpamSieve collects information from the messages it’s trained with into its corpus, which it uses to predict whether subsequent messages are spam. Don’t worry; it learns quickly!

How many messages you should train SpamSieve with depends on how many old messages you have and on how much time you want to put into the process. 195 spam messages and 105 representative good ones are enough for most people to get very good accuracy, but it’s OK if you don’t have that many. The important points are:

Do not use more than 1,000 messages.: Using up to 1,000 recent messages in the initial training lets SpamSieve start out with a high level of accuracy. In general, the more messages you train SpamSieve with, the better its accuracy will be. However, using more than 1,000 messages initially, would “fill up” SpamSieve’s corpus with older messages, making it slower and less effective at adapting to new kinds of spam that you’ll receive in the future.
The messages should be approximately 65% spam.: For example, use 650 spams and 350 good messages or 65 spams and 35 good messages. It is better to use fewer messages in the initial training (i.e. not use all your saved mail) than to deviate from the recommended percentage. For example, if you have 500 good messages but only 195 saved spam messages, don’t train SpamSieve with all 695 messages. Instead, train it with the 195 spams and about 105 representative good messages.

In order to monitor your progress, you can go to SpamSieve’s Window menu and choose Statistics. The Corpus section in the middle of the Statistics window shows how many good and spam messages SpamSieve has been trained with, and what percentage of them are spam. After the initial training, SpamSieve will automatically train itself, and you’ll only need to train it to correct mistakes.

After the initial training, you don’t have to worry about the number or percentage of messages in the corpus. SpamSieve will automatically learn from new messages as they arrive and keep its corpus properly balanced.

Accuracy will improve with time, but if you’ve used at least 100 or so messages in the initial training, SpamSieve should immediately start moving some of the incoming spam messages to your junk mailbox. If you don’t see results right away, check the setup in your mail program. After a few hundred messages of each type are in the corpus, SpamSieve should be catching most of your spam.

Now you’re done setting up SpamSieve. The Correct All Mistakes section explains how you can keep SpamSieve’s accuracy high by telling it if it puts any messages in the wrong mailbox.