[olug] Bogofilter

Eric Penne epenne at olug.org
Thu Mar 27 17:06:48 UTC 2003


I wanted to test out bayesian filtering on my email.  I downloaded,
compiled and installed bogofilter from Eric Raymond at
http://bogofilter.sf.net on the olug server.

The program is standalone the way I'm using it but I bet there would be
economies of scale if it was installed system wide.  I don't really have
any information on cpu/disk usage and such.

No matter what though,it works wonderfully.

It bases it's filtering on keywords and phrases that it stores in files in
my .bogofilter/ directory.  Words it deems as good go into a goodlist.db
files and spam words go into a spamlist.db file.  Bogofilter gives the
mail a spamicity.  The spamicity is between 0 and 1.  In the bogofilter
file I have set the threshold value of the spamicity at 0.95.  >= 0.95 is
spam and < 0.95 is not spam.  When bogofilter considers something to be
spam it adds those words and phrases to the spamlist.db file.  The same
thing happens for the goodlist.db file.  Occasionally I get False
Positives and False Negatives which I then file accordingly into
directories name FalsePositive and FalseNegative.  Everyday I log in to
olug.org and run a script that reassociates the mail from bad to good or
good to bad.

False Positives and False Negatives happen because the email has a lot of
info that looks like spam.  In the beginning before my spamlist.db was
built up it put all spam in my goodlist.db because it didn't have anything
in my spamlist.db.  I pull a bunch of email down from Yahoo! so after
reassociating 30 or 40 spam it quickly caught on to what I considered
spam.  Now my spamlist.db file is approx 20MB and my goodlist.db file is
approx 1MB and it has a about a 0.5% error rate.  The error is usually
when I get Yahoo! reminders that get put in spam but I want to get them in
my Inbox.

The following is what bogofilter adds to all incoming email:

X-Bogosity: Yes, tests=bogofilter, spamicity=1.000000, version=0.11.1.3

The X-Bogosity: header is used by procmail to put the email in certain
folders.  In this case, Yes means it is spam based on the spamicity value
of 1.  The beginning of my .procmailrc file follows:

:0fw
| bogofilter -u -e -p

# if bogofilter failed, return the mail to the queue, the MTA will
# retry to deliver it later
# 75 is the value for EX_TEMPFAIL in /usr/include/sysexits.h

:0e
{ EXITCODE=75 HOST }

# file the mail to spam-bogofilter if it's spam.

:0:
* ^X-Bogosity: Yes, tests=bogofilter
Spam

:0:
* ^Subject:.*\[olug\]*
Lists/OLUG

*************************************************************************


Hopefully this is OK with Brian.  If not I will simply remove it.  There
are probably many other bayesian filters out there that can do a better
job, but this one was pretty simple.  I'm hoping Brian will install a
bayesian filter that we can all chose to use or not on the OLUG computer.




More information about the OLUG mailing list