Re: Bayesian Spam Filtering and Bogofilter

Replies:

Parents:

* Joseph Reagle <[email protected]> [2002-12-12 10:15-0500]
> Bayesian Spam Filtering [0] is very cool, so yesterday I set up Bogofilter
> [1] as a spam filtering technique. Presently, it's running in parallel with
> SpamAssassin. This morning, Bogofilter had 0 false positives (email that I
> care about), and 0 false negatives (emails which aren't identified as spam
> but are), whereas SA had ~5 false negatives, such as:
[..]

How much training did it take you to get there?

I had a look at Bogofilter a few months ago, and saw that I had to
(constantly) train it which made me think that:
- there was some maybe fairly expensive bootstrapping process.
- since I am anal, instead of just letting it live its life and don't
 care about my spam like I do with SpamAssassin, I was going to spend
 my time training it to make it better, and better, and better!

--
Hugo Haas - http://larve.net/people/hugo/

Re: Bayesian Spam Filtering and Bogofilter

Replies:

  • None.

Parents:

On Friday 13 December 2002 01:52 am, Hugo Haas wrote:
> How much training did it take you to get there?

It took about 5 seconds to do the initial training of good and bad spam:

 for f in {A-Spam,junk}; do
     sed -e "/X-Spam/d" -e "/X-Bogosity/d" -e "/  * /d" $f | bogofilter -s;
 done
 for f in {trash,inbox,Friends,W3-Legal,W3-PR,W3-xkms-WG}; do
     sed -e "/X-Spam/d" -e "/X-Bogosity/d" -e "/  * /d" $f | bogofilter -n;
 done

Then I already had performance exceeding SA. Now, like I said, when I nuke
my junk boxes I automatically take about ~1sec to refresh the training.

Re: Bayesian Spam Filtering and Bogofilter

Replies:

Parents:

On Fri, 13 Dec 2002, Hugo Haas wrote:

> * Joseph Reagle <[email protected]> [2002-12-12 10:15-0500]
> > Bayesian Spam Filtering [0] is very cool, so yesterday I set up Bogofilter
> > [1] as a spam filtering technique. Presently, it's running in parallel with
> > SpamAssassin. This morning, Bogofilter had 0 false positives (email that I
> > care about), and 0 false negatives (emails which aren't identified as spam
> > but are), whereas SA had ~5 false negatives, such as:
> [..]
>
> How much training did it take you to get there?
>
> I had a look at Bogofilter a few months ago, and saw that I had to
> (constantly) train it which made me think that:
> - there was some maybe fairly expensive bootstrapping process.

Joseph has already answered no, and I agree with him.
I just ran bogofilter over my spambox, then over my
inbox (and a few other spam-free mail files) and it was
done. Then I set up mutt keystrokes to train bogofilter
(although I've never used them - it's easier to do as
Joseph recommends, a monthly/weekly training).

The way I use bogofilter is slightly different from Joseph, and
is probably obvious enough that it isn't worth polluting your
email with, but here goes! I run bogofilter on the server with
spamassassin. I send anything with two "Yes" votes into spambox,
and anything with one "Yes" and one "No" to a maybe box.

# bogofilter
:0fw
* ! ^X-Bogosity
| bogofilter -e -p

:0:
* ^X-Spam-Status: No
* ^X-Bogosity: Yes
$MAYBESPAMBOX

:0:
* ^X-Spam-Status: Yes
* ^X-Bogosity: No
$MAYBESPAMBOX

:0:
* ^X-Spam-Status: Yes
* ^X-Bogosity: Yes
$SPAMBOX

I don't get much in the maybe box, and everything that has made it has
been spam, so the experiment isn't a huge success.

One annoying thing about bogofilter is the command line options seem to
change between platforms/releases.

A total of about 15 minutes to install and set up.

> - since I am anal, instead of just letting it live its life and don't
>   care about my spam like I do with SpamAssassin, I was going to spend
>   my time training it to make it better, and better, and better!

Drugs may help with this problem.

dean

Re: Bayesian Spam Filtering and Bogofilter

Replies:

  • None.

Parents:

* Dean Jackson <[email protected]> [2003-01-06 15:33+1100]
> Joseph has already answered no, and I agree with him.

So, having just done it, the answer is indeed no if you have a decent
system to install bogofilter on.

On tux, I had to compile libdb4 beforehand, which took me a while.
Anyway, I have succeeded.

> I just ran bogofilter over my spambox, then over my
> inbox (and a few other spam-free mail files) and it was
> done. Then I set up mutt keystrokes to train bogofilter
> (although I've never used them - it's easier to do as
> Joseph recommends, a monthly/weekly training).

Here is my Mutt setup (with my muttrc cpp-processing[1]):

 # Bogofilter
 # Show spam headers
 unignore X-Bogosity
 #define BOGOFILTER_NONSPAM "|bogofilter -n\n"
 #define BOGOFILTER_SPAM "|bogofilter -s\n"
 #define SYNCHRONIZE_BOGOFILTER "!unison bogofilter\n"
 macro index \eS BOGOFILTER_SPAM "Declare to bogofilter as spam"
 macro pager \eS BOGOFILTER_SPAM "Declare to bogofilter as spam"
 macro index \eN BOGOFILTER_NONSPAM "Declare to bogofilter as non-spam"
 macro pager \eN BOGOFILTER_NONSPAM "Declare to bogofilter as non-spam"
 macro index \eU SYNCHRONIZE_BOGOFILTER "Synchronize bogofilter databases"

More on Unison synchronization further.
 
> The way I use bogofilter is slightly different from Joseph, and
> is probably obvious enough that it isn't worth polluting your
> email with, but here goes! I run bogofilter on the server with
> spamassassin. I send anything with two "Yes" votes into spambox,
> and anything with one "Yes" and one "No" to a maybe box.
[..]
> > - since I am anal, instead of just letting it live its life and don't
> >   care about my spam like I do with SpamAssassin, I was going to spend
> >   my time training it to make it better, and better, and better!
>
> Drugs may help with this problem.

Well, I decided to keep away from drugs, for now at least, and used
your "maybe folder" technique to train bogofilter.

I will train bogofilter with what is in this folder, after having fed
it with the content of my private folder for good vocabulary. I expect
to have to do a fair bit of work at the beginning, but I am sure it
will decrease fairly rapidly, and when I am satisfied by it, I will
probably stop using SpamAssassin.

Regarding the Unison[2] synchronization, I have several copies of my
mail thanks to isync[3], which means that I want to do my bogofilter
filtering locally, and then propagate the changes on my mail server
where procmail does its magic. I use Unison to do so.

One gotcha: I *think* that libdb3 and libdb4 don't use the same
format, or so it seems when I did a few tests. I have compiled mine
with libdb4 since Sarge's bogofilter uses libdb4. However, Woody only
has libdb3, so you will need to compile libdb4 too.

 1. http://larve.net/people/hugo/2002/04/mutt-cpp
 2. http://www.cis.upenn.edu/~bcpierce/unison/
 3. http://www.cs.hmc.edu/~me/isync/
--
Hugo Haas - http://larve.net/people/hugo/

HURL: fogo mailing list archives, maintained by Gerald Oskoboiny