[olug] Bogofilter

William E. Kempf wekempf at cox.net
Thu Mar 27 17:15:08 UTC 2003


Eric Penne said:
> I wanted to test out bayesian filtering on my email.  I downloaded,
> compiled and installed bogofilter from Eric Raymond at
> http://bogofilter.sf.net on the olug server.

I use Ifile for the same thing.

> The program is standalone the way I'm using it but I bet there would be
> economies of scale if it was installed system wide.  I don't really have
> any information on cpu/disk usage and such.
>
> No matter what though,it works wonderfully.

Bayes filtering is nice... though spammers are starting to figure out ways
around it.

> It bases it's filtering on keywords and phrases that it stores in files
> in my .bogofilter/ directory.  Words it deems as good go into a
> goodlist.db files and spam words go into a spamlist.db file.  Bogofilter
> gives the mail a spamicity.  The spamicity is between 0 and 1.  In the
> bogofilter file I have set the threshold value of the spamicity at 0.95.
>  >= 0.95 is spam and < 0.95 is not spam.  When bogofilter considers
> something to be spam it adds those words and phrases to the spamlist.db
> file.  The same thing happens for the goodlist.db file.  Occasionally I
> get False
> Positives and False Negatives which I then file accordingly into
> directories name FalsePositive and FalseNegative.  Everyday I log in to
> olug.org and run a script that reassociates the mail from bad to good or
> good to bad.

Ifile allows for unlimited categorization.  This means that not only do I
catch spam, but ifile also sorts my mail into various other folders.  For
instance, mail from this group automatically goes to an "olug" folder.

> False Positives and False Negatives happen because the email has a lot
> of info that looks like spam.  In the beginning before my spamlist.db
> was built up it put all spam in my goodlist.db because it didn't have
> anything in my spamlist.db.  I pull a bunch of email down from Yahoo! so
> after reassociating 30 or 40 spam it quickly caught on to what I
> considered spam.  Now my spamlist.db file is approx 20MB and my
> goodlist.db file is approx 1MB and it has a about a 0.5% error rate.
> The error is usually when I get Yahoo! reminders that get put in spam
> but I want to get them in my Inbox.

I've got spam corpuses you can use to "seed" the engine, if you're
interested.

For refiling with ifile, I use a different approach.  My procmail scripts
inject a new header, Ifile-hint, into the mail.  If it's misfiled, I just
move it to the correct folder.  A cron script then finds mail with
Ifile-hint headers that don't match the folder and "relearn" the mail. 
This means I don't need to have extra folders cluttering my system that
aren't really used.

-- 
William E. Kempf




More information about the OLUG mailing list