5.4.1 Corpus

The corpus is a collection of messages, both spam and good, with which you have trained SpamSieve. SpamSieve’s Bayesian classifier analyzes the contents of the messages and uses this information to predict whether future messages are spam or good. The contents of the corpus are managed by SpamSieve; therefore, once you’ve trained SpamSieve with a message, deleting the message from your e-mail program will not affect SpamSieve because the information from that message is stored in the corpus.

Good Messages and Spam Messages

This section shows the lists of good and spam messages that you’ve trained, as well as ones that SpamSieve has auto-trained. If you find that SpamSieve classified a message incorrectly, it’s best to correct the mistake by training it from your mail client. However, you can also use the Train as Good/Spam commands in the corpus window to correct messages that you don’t see in the mail client. Looking at the lists of messages in the corpus is a good way to make sure that no messages were mis-trained and that no mistakes went uncorrected.

The following columns are shown:

⚑

Whether the message has been marked as flagged.

#

The number of attachments.

Subject

The message’s subject.

Received

When the message was received by your mail server.

Trained

When you (or SpamSieve’s auto-training feature) added the message to SpamSieve’s corpus.

Size

The size of the message’s Raw Source, if it’s stored by SpamSieve. You can sort by this column to find messages that are using a lot of disk space.

The Info tab shows summary information about the message itself, as well as how it was trained.

corpus spam messages

The Message tab shows a preview of the message’s contents. SpamSieve does not load remote images here, so you are protected from Web bugs. You can also use the Open in External Viewer command to open the message in your mail client.

corpus good messages

The Raw Source tab shows the message data that SpamSieve received from your mail client. You can export a message’s raw source by dragging the message from the list to the Finder.

The Structure tab shows information about how SpamSieve interpreted the raw source.

Searching

You can search the Corpus window by entering text to match a message’s metadata or a word. A multi-word query is treated as a phrase search. Searches support wildcards such as * (which matches any number of additional characters) and ? (which matches a single character). To search for a literal wildcard character, you can escape it, e.g. \? to search for a question mark.

Words

This section shows the words that SpamSieve has extracted from the trained messages.

corpus words

The following columns are shown:

Word: A word in the corpus. Some words are words that you would see, while others are tokens that relate to how the message or its attachments were formatted or structured.
Spam: The number of spam messages in which the word has occurred.
Good: The number of good messages in which the word has occurred.
Total: The total number of messages in which the word has occurred.
Probability: The probability that a message is spam, given that it contains the word (and in the absence of other evidence).
Last Used: The date that the word was added or updated in the corpus, or the date that it last appeared in a received message (whichever is later).