Synchronizing SpamSieve corpus and statistics between multiple machines

Hi there,

I’ve set up SpamSieve with Mail.app on both my home and work machines, both of which are checking the same pair of IMAP accounts. I’ve set things up such that SpamSieve properly dumps spam into the Junk folder on the server. But sometimes both machines are active, so one will inevitably get to the spam in the Inbox first and, theoretically, shuttle it off to the Junk folder before the other client sees it.

The consequence is that each copy of SpamSieve sees about half the spam, and has a totally separate corpus and set of statistics. Is there any good way to synchronize them?

Thanks,
-Chris

There is no good way to synchronize (i.e. merge) the corpus and rules from different copies of SpamSieve. If you want to use the same data on both machines, you could alias the ~/Library/Application Support/SpamSieve folder to your iPod and bring it with you. But, in general, I don’t think that’s necessary; the accuracy should be about the same with two separate copies of SpamSieve.

I would, however, caution you not to run two copies of SpamSieve on the same IMAP account at the same time. The problem is that if a message is put into the wrong folder, you won’t know which one to correct. If you train the wrong copy of SpamSieve, the other one will think it was correct, learn from that, and its accuracy will decline. To get around this problem, you could turn off auto-training, though that would also reduce the accuracy somewhat. That’s why I recommend only running one SpamSieve on an account at a time.

.Mac Syncronisation…
It would be great to have .Mac synchronisation (or even WebDAV or FTP synchronisation of non .Mac users) of the Corpus to be able to keep two machines Corpii in sync (e.g. a Desktop and a Laptop).

Maybe for version 3?

/ Hami

Is there a particular reason that you want the two corpora synchronized? I mean, I understand that in theory it would be nice if everything auto-synchronized. But at the same time, two separate copies of SpamSieve can learn independently and stay at a high level of accuracy. I’m reluctant to allocate development time to fill out a feature checklist, so I’d like to make sure that synchronization would solve a real-world problem. For example, are you adding a lot of manually-created rules that you need to have synchronized?

One of the main reasons I am evaluating SpamSieve is that I have a problem with Mail.app and it’s Junk filtering over two computers. I have my desktop which I use 75% of the time and my Laptop which is used 25%. Needless to say the maturity of the my Desktops Mail.app Junk filter is leagues above that of my Laptop.

When I need to use my Laptop, when I’m generally away from home, I would like to not have clear out my Inbox and to train my Junk filter all over again.

This is my primary reason for looking at SpamSieve as I can copy the Corpus from my desktop to my laptop and have a fairly mature Junk filter ready to go.

I would dearly love to be able to sync the Corpora between both machines and not have to dig around in the Library. The added bonus would be my Corpus would be backed up to my sync service incase of machine failure.

Sure - if it’s just me that would want this I can understand that it’s not worth the effort of development, but I can’t imagine that I am.

After only using SpamSieve for a few days though, I am most impressed with it’s results - so much cleaner and more accurate than the default Mail.app Junk filter.

thanks,

/ Hami

My experience is that if you get one machine to a good accuracy level and manually sync once, then the accuracy level will remain high on the laptop, even if it’s only used 25% of the time. Please let me know how it goes for you.

I’d rather people didn’t have to dig around in the library, and it’s not that I don’t think other people would want syncing. It’s on my to-do list. The question is about how to prioritize. That’s why I’m interested in hearing about people’s specific situations.

Hi Michael,

I need to do this for Entourage with Exchange. It’s not a problem for more than one Entourage client to be accessing the Exchange server simultaneously. I’ve been using SpamSieve for a long time now so I am confident in it’s accuracy. My reason for wanting to share the SpamSieve lists with my desktop and laptop Macs are for the times when I mark a message as being good, mark a domain as being safe, or reload addresses into SpamSieve. Using Exchange now eliminates the need for me to copy the Entourage database back and forth so I would like SpamSieve to be automatic as well.

Instead of linking ~/Library/Application Support/SpamSieve to an iPod, how about linking it to an iDisk? I’m doing this for an application called GarageSale and it’s working fine only I am never running GarageSale on both computers simultaneously. Entourage/SpamSieve would be different, as I might leave the house with Entourage/SpamSieve still running on my desktop. (But what if I didn’t? Would it be ok then?)

Does SpamSieve perform any sort of file locking that would prevent one Mac from clobbering the files of the other? I can envision how so long as I wasn’t running SpamSieve on both Macs simultaneously and that if enough time had elapsed to allow the iDisk updates to be applied that it would probably work fine. However if both Macs updated the files at approximately the same time, one of the Macs might report an iDisk file conflict. One way around this might be for there to be three folders: A shared folder which all SpamSieve instances read from, a dedicated folder which each SpamSieve writes changes to, and a process running on each Mac that periodically rolls its changes into the shared folder, using some type of logic that spaced the updates far enough apart to eliminate iDisk conflicts.

Humm… maybe a better scenario would not have a shared folder but each SpamSieve instance would just look at the other system(s)'s changes and roll them into it’s own.

Or perhaps you could somehow utilize Apple SyncServices?

SpamSieve is one of the best applications I have ever used, and the ability to transparently use it across multiple macs would be a great addition for me.

Thanks very much!

I see, so you’re more interested in syncing the whitelist than the corpus and statistics?

That should work, but I don’t really recommend it because it would probably be slow, and you would have to remember to mount your iDisk before opening your mail program.

No. You need to make sure that only one copy of SpamSieve is running with that data store at a time.

FWIW, I came here looking for the same thing. I’m currently in evaluation mode.

I use my desktop system most of the time, but my laptop when I am out and about. I try to remember to exit Mail.app on my desktop before I leave the house, but sometimes it doesn’t happen. It would be a bummer if this messed up SpamSieve’s filtering.

I would love to have .Mac syncing, since my desktop is always going to be much more up to date on the latest spam than my laptop, which can go a couple of weeks between uses. Maybe I’m not typical but I see new types of spam all the time; I think my email address ended up on some kind of “known good” list. ;(

I’m not so sure about putting the data on an iDisk; in my experience they are too slow and not reliable enough to be accessed in real time. One thing to note about .Mac sync, though, is that this application’s needs are a bit different than most. Most of us will always want to replace the data on one system (laptop) with the data from another. Merge is probably not a useful option. I don’t know if .Mac allows this; all the applications I sync are merging the data.

This probably wouldn’t be a significant problem, but I wanted to mention it for completeness.

Just out of curiosity, do you sync (manually or automatically) other types of data to your laptop before you start using it?

I don’t really understand this comment. Why wouldn’t you want it to merge the messages that it was trained with on different machines into one corpus?

Yes, but only via .Mac sync.

I suppose you could, but it seems like everyone who has posted here thinks of their desktop instance as the authoritative one and just wants to move it to their laptop.

Yes, but then what happens when you receive mail on the laptop? Presumably you’d want the training to sync back to the desktop.

I guess what I’m saying is that if you only work with one machine at a time, then syncing is the same as replacing. And if you work with both independently and then connect them, what (I think) you’d want is a merge. So I think it might as well always merge.

Syncing corpus would be good, but maybe extra headers?
Michael,

Syncing the corpus is definitely something that would be development time well spent. I may be wrong , but seems like enough people are using SpamSieve on multiple computers to justify the time.

However syncing the corpus may not be proactive enough to address the problems of running two copies of SpamSieve against IMAP accounts. There may be a way to provide a solution to the primary problem without developing a full-blown sync (though a sync would still be incredibly helpful).

My bacground: I use mail.app and 12 or so IMAP accounts. I have about 5 commonly used home directories with mail.app set up the same. (I understand I’m an aberration, but most of my time is spent in my laptop homedirectory.) I have to keep my mail in junk folders on the server because it is incredibly possible that I may never know which homedirectory my false positive was moved to. On top of this, I also check 5 of the accounts on my phone. I especially like to have a machine on and filtering all the time so I don’t have to wade through junk on my phone.

However, running SpamSieve on two machines makes this very difficult. Let’s say I leave my home desktop logged in and SpamSieve’ing my accounts. Let’s also assume it’s got an up-to-date corpus as do all other home directories. My home machine flags a false positive and drops it into the server’s Junk folder. I see the false positive on my laptop and run ‘Train as Good’. It drops it back into the INBOX and my home computer promptly marks it as Spam and moves it to my Junk folder. (I understand you already know this. Just want to make sure anyone following the thread gets it).

This makes running two copies of SpamSieve just plain impossible for me. So my first idea for a solution was to sync the corpus. This would have to be incredibly active for it to work. I doubt it could be fast enough to work. (But like I said, it’d still be nice). So my final idea is to take a cue from MailTags and drop something into the header when you train a piece of spam as good.

That way when I train an email as good on my laptop and my home computer sees it again, it picks up the tag in the header and knows it’s a good email and updates it’s corpus accordingly.

Sorry for the super long post, but this is, so far, the only thing that’s keeping me from truly enjoying the benefits of your product all the time.

thanks.peet

I agree that syncing the corpus would not be fast enough to solve this problem. However, if all the copies of SpamSieve are properly trained, there should be very few false positives, so this should be a rare occurrence. Adding a header the way MailTags does is problematic because modifying the message content is totally unsupported by Mail and because there would then be the problem of spammers forging the header.