Bayesian Spam Filtering and Bogofilter

from Joseph Reagle <reagle@mit.edu>, Thu, 12 Dec 2002 10:15:19 -0500

Replies:

Parents:

None.

Bayesian Spam Filtering [0] is very cool, so yesterday I set up Bogofilter
[1] as a spam filtering technique. Presently, it's running in parallel with
SpamAssassin. This morning, Bogofilter had 0 false positives (email that I
care about), and 0 false negatives (emails which aren't identified as spam
but are), whereas SA had ~5 false negatives, such as:
X-Spam-Status: No, hits=0.0 required=2.2 tests= version=2.20
X-Spam-Level:
X-Bogosity: Yes, tests=bogofilter, spamicity=0.633510, version=0.9.1.2

The other neat thing is that I used to do my filtering during the fetch to
my local machine, but SA was so slow I had to have the mail hosts do that
for me. Bogofilter (written in C) is very *fast* and has no noticeable
effect on my fetchmailing. Highly recommended!

[0] http://www.paulgraham.com/spam.html
[1] http://www.tuxedo.org/~esr/bogofilter/

Email not directed to me or to a list ends up in a A-Spam (non-egregious
spam that I occasionally review), A-Bulk (probably not spam but I'm on a
bcc), and A-Admin (bounces and such). I occasionally delete ('nukem') these
emails. My trash is set to 30 day expiration and an 8M limit, so no need to
clog up my real trash with this junk. When I now nuke these mailboxes, I
also use them for bogofiltering training (the commented bit was what I used
for the initial non-spam training). There's other strategies and mutt
bindings available.

#!/bin/bash

pushd ~/Mail

sed -e "/X-Spam/d" -e "/X-Bogosity/d" -e "/ * /d" A-Spam | bogofilter -s
sed -e "/X-Spam/d" -e "/X-Bogosity/d" -e "/ * /d" A-Bulk | bogofilter -n
sed -e "/X-Spam/d" -e "/X-Bogosity/d" -e "/ * /d" A-Trash | bogofilter -n

rm A-Bulk A-Spam A-Admin A-Trash; touch A-Bulk A-Spam A-Admin A-Trash

popd

#for f in {trash,inbox,Friends,W3-Legal,W3-PR,W3-xkms-WG}; do
# sed -e "/X-Spam/d" -e "/X-Bogosity/d" -e "/ * /d" $f | bogofilter -n;
#done

Re: Bayesian Spam Filtering and Bogofilter

from Hugo Haas <hugo@larve.net>, Fri, 13 Dec 2002 07:52:24 +0100

Replies:

Parents:

reagle

* Joseph Reagle <reagle@mit.edu> [2002-12-12 10:15-0500]
> Bayesian Spam Filtering [0] is very cool, so yesterday I set up Bogofilter
> [1] as a spam filtering technique. Presently, it's running in parallel with
> SpamAssassin. This morning, Bogofilter had 0 false positives (email that I
> care about), and 0 false negatives (emails which aren't identified as spam
> but are), whereas SA had ~5 false negatives, such as:
[..]

How much training did it take you to get there?

I had a look at Bogofilter a few months ago, and saw that I had to
(constantly) train it which made me think that:
- there was some maybe fairly expensive bootstrapping process.
- since I am anal, instead of just letting it live its life and don't
care about my spam like I do with SpamAssassin, I was going to spend
my time training it to make it better, and better, and better!

--
Hugo Haas - http://larve.net/people/hugo/

Re: Bayesian Spam Filtering and Bogofilter

from Joseph Reagle <reagle@mit.edu>, Fri, 13 Dec 2002 08:13:45 -0500

Replies:

None.

Parents:

reagle
hugo

On Friday 13 December 2002 01:52 am, Hugo Haas wrote:
> How much training did it take you to get there?

It took about 5 seconds to do the initial training of good and bad spam:

for f in {A-Spam,junk}; do
sed -e "/X-Spam/d" -e "/X-Bogosity/d" -e "/ * /d" $f | bogofilter -s;
done
for f in {trash,inbox,Friends,W3-Legal,W3-PR,W3-xkms-WG}; do
sed -e "/X-Spam/d" -e "/X-Bogosity/d" -e "/ * /d" $f | bogofilter -n;
done

Then I already had performance exceeding SA. Now, like I said, when I nuke
my junk boxes I automatically take about ~1sec to refresh the training.

Re: Bayesian Spam Filtering and Bogofilter

from Dean Jackson <dean@w3.org>, Mon, 6 Jan 2003 15:33:57 +1100

Replies:

hugo

Parents:

reagle
hugo

On Fri, 13 Dec 2002, Hugo Haas wrote:

> * Joseph Reagle <reagle@mit.edu> [2002-12-12 10:15-0500]
> > Bayesian Spam Filtering [0] is very cool, so yesterday I set up Bogofilter
> > [1] as a spam filtering technique. Presently, it's running in parallel with
> > SpamAssassin. This morning, Bogofilter had 0 false positives (email that I
> > care about), and 0 false negatives (emails which aren't identified as spam
> > but are), whereas SA had ~5 false negatives, such as:
> [..]
>
> How much training did it take you to get there?
>
> I had a look at Bogofilter a few months ago, and saw that I had to
> (constantly) train it which made me think that:
> - there was some maybe fairly expensive bootstrapping process.

Joseph has already answered no, and I agree with him.
I just ran bogofilter over my spambox, then over my
inbox (and a few other spam-free mail files) and it was
done. Then I set up mutt keystrokes to train bogofilter
(although I've never used them - it's easier to do as
Joseph recommends, a monthly/weekly training).

The way I use bogofilter is slightly different from Joseph, and
is probably obvious enough that it isn't worth polluting your
email with, but here goes! I run bogofilter on the server with
spamassassin. I send anything with two "Yes" votes into spambox,
and anything with one "Yes" and one "No" to a maybe box.

# bogofilter
:0fw
* ! ^X-Bogosity
| bogofilter -e -p

:0:
* ^X-Spam-Status: No
* ^X-Bogosity: Yes
$MAYBESPAMBOX

:0:
* ^X-Spam-Status: Yes
* ^X-Bogosity: No
$MAYBESPAMBOX

:0:
* ^X-Spam-Status: Yes
* ^X-Bogosity: Yes
$SPAMBOX

I don't get much in the maybe box, and everything that has made it has
been spam, so the experiment isn't a huge success.

One annoying thing about bogofilter is the command line options seem to
change between platforms/releases.

A total of about 15 minutes to install and set up.

> - since I am anal, instead of just letting it live its life and don't
> care about my spam like I do with SpamAssassin, I was going to spend
> my time training it to make it better, and better, and better!

Drugs may help with this problem.

dean

Re: Bayesian Spam Filtering and Bogofilter

from Hugo Haas <hugo@larve.net>, Sat, 11 Jan 2003 13:36:56 +0100

Replies:

None.

Parents:

* Dean Jackson <dean@w3.org> [2003-01-06 15:33+1100]
> Joseph has already answered no, and I agree with him.

So, having just done it, the answer is indeed no if you have a decent
system to install bogofilter on.

On tux, I had to compile libdb4 beforehand, which took me a while.
Anyway, I have succeeded.

> I just ran bogofilter over my spambox, then over my
> inbox (and a few other spam-free mail files) and it was
> done. Then I set up mutt keystrokes to train bogofilter
> (although I've never used them - it's easier to do as
> Joseph recommends, a monthly/weekly training).

Here is my Mutt setup (with my muttrc cpp-processing[1]):

# Bogofilter
# Show spam headers
unignore X-Bogosity
#define BOGOFILTER_NONSPAM "|bogofilter -n\n"
#define BOGOFILTER_SPAM "|bogofilter -s\n"
#define SYNCHRONIZE_BOGOFILTER "!unison bogofilter\n"
macro index \eS BOGOFILTER_SPAM "Declare to bogofilter as spam"
macro pager \eS BOGOFILTER_SPAM "Declare to bogofilter as spam"
macro index \eN BOGOFILTER_NONSPAM "Declare to bogofilter as non-spam"
macro pager \eN BOGOFILTER_NONSPAM "Declare to bogofilter as non-spam"
macro index \eU SYNCHRONIZE_BOGOFILTER "Synchronize bogofilter databases"

More on Unison synchronization further.

> The way I use bogofilter is slightly different from Joseph, and
> is probably obvious enough that it isn't worth polluting your
> email with, but here goes! I run bogofilter on the server with
> spamassassin. I send anything with two "Yes" votes into spambox,
> and anything with one "Yes" and one "No" to a maybe box.
[..]
> > - since I am anal, instead of just letting it live its life and don't
> > care about my spam like I do with SpamAssassin, I was going to spend
> > my time training it to make it better, and better, and better!
>
> Drugs may help with this problem.

Well, I decided to keep away from drugs, for now at least, and used
your "maybe folder" technique to train bogofilter.

I will train bogofilter with what is in this folder, after having fed
it with the content of my private folder for good vocabulary. I expect
to have to do a fair bit of work at the beginning, but I am sure it
will decrease fairly rapidly, and when I am satisfied by it, I will
probably stop using SpamAssassin.

Regarding the Unison[2] synchronization, I have several copies of my
mail thanks to isync[3], which means that I want to do my bogofilter
filtering locally, and then propagate the changes on my mail server
where procmail does its magic. I use Unison to do so.

One gotcha: I *think* that libdb3 and libdb4 don't use the same
format, or so it seems when I did a few tests. I have compiled mine
with libdb4 since Sarge's bogofilter uses libdb4. However, Woody only
has libdb3, so you will need to compile libdb4 too.

1. http://larve.net/people/hugo/2002/04/mutt-cpp
2. http://www.cis.upenn.edu/~bcpierce/unison/
3. http://www.cs.hmc.edu/~me/isync/
--
Hugo Haas - http://larve.net/people/hugo/

Re: Bayesian Spam Filtering and Bogofilter

from Gerald Oskoboiny <gerald@impressive.net>, Tue, 11 Feb 2003 01:23:41 -0500

Replies:

Parents:

reagle

* Joseph Reagle <reagle@mit.edu> [2002-12-12 10:15-0500]
>
> Bayesian Spam Filtering [0] is very cool, so yesterday I set up
> Bogofilter [1] as a spam filtering technique. Presently, it's
> running in parallel with SpamAssassin.

I switched to Bogofilter (from Spamassassin) last Thursday, and
I'm really happy with it so far. My mail still gets labelled by
spamassassin on the way in, but I don't use it to decide where
my mail goes any more.

My .procmailrc currently has a bunch of list-specific filters,
then my whitelist stuff, then uses "bogofilter -u -e -p" on
anything from senders not in my whitelist. (I wasn't using -u
until just now; I didn't like the thought of it auto-training
itself, but I think it'll be ok. I might even set up a couple
honeypots to collect spam and feed my bogofilter ratings.)

I decided to keep the spamassassin-labelling stuff around because
bogofilter can learn probability ratings for spamassassin tokens,
e.g. "user_agent_mutt" currently has pgood=0.007957, pbad=0.000354.

I initially trained bogofilter on a bunch of spam and non-spam I
had around, and I wasn't sure if I should do it with the whole
mailbox as input or with individual messages, but after doing
some testing just now it seems it doesn't matter. (bogofilter
must have some heuristics to recognize individual messages)

I wrote a little script [2] to use as a wrapper around bogofilter
when I am training it from within Mutt, because it seems to take
about 2-3 seconds when invoked in training mode. (this script
returns immediately and calls bogofilter in the background.)

And here's my procmailrc [3] and muttrc [4].

> [0] http://www.paulgraham.com/spam.html
> [1] http://www.tuxedo.org/~esr/bogofilter/

[2] http://impressive.net/people/gerald/2003/02/train-bogofilter
[3] http://impressive.net/people/gerald/misc/dotfiles/procmailrc
[4] http://impressive.net/people/gerald/misc/dotfiles/muttrc

--
Gerald Oskoboiny <gerald@impressive.net>
http://impressive.net/people/gerald/

Re: Bayesian Spam Filtering and Bogofilter

from Joseph Reagle <reagle@mit.edu>, Tue, 11 Feb 2003 08:59:19 -0500

Replies:

Parents:

On Tuesday 11 February 2003 01:23, Gerald Oskoboiny wrote:
> I switched to Bogofilter (from Spamassassin) last Thursday, and
> I'm really happy with it so far. My mail still gets labelled by
> spamassassin on the way in, but I don't use it to decide where
> my mail goes any more.

I too no longer use spamassasin as a filtering criteria though I still run
it to get rid of most of the horrid email on the pop server, and compare
the results. I'm fairly happy with bogofilter though it does let some
stupid spam through occasionally, and isn't catching the latest spams which
is just a non-spammy natural language sentence or two and a link. (Also, it
still misses some html mail, and I'm willing to consider that as very
probable spam from the start.)

> I decided to keep the spamassassin-labelling stuff around because
> bogofilter can learn probability ratings for spamassassin tokens,
> e.g. "user_agent_mutt" currently has pgood=0.007957, pbad=0.000354.

What do you mean, the spam-assassin headers are still part of the email when
you run bogofilter? That's just cheating then, right? (I originally trained
bogofilter with the headers included and was stunned by it's performance
when the headers were present, but not when they weren't, so that's why I :
sed -e "/X-KMail/d" -e "/X-Spam/d" -e "/X-Bogosity/d" -e "/ * /d" $f

But, the combination of SA's features/heuristics and Bayesian filtering will
ruck and I'm looking forward to playing with that feature in the new
version of SA.

Re: Bayesian Spam Filtering and Bogofilter

from Joseph Reagle <reagle@mit.edu>, Tue, 11 Feb 2003 09:21:50 -0500

Replies:

None.

Parents:

On Tuesday 11 February 2003 09:05, Dan Brickley wrote:
> Not cheating at all! This isn't a competition between SA and bogofilter;

Well, it is when I'm testing them. <smile/>

> if they bogofilter algorithms can be used to take into account features
> that SA detects, so much the better.

Ok, so the folks that are using bogofilter in this mode, do you find that
bogofilter is then able to correct SA's false negatives and positives, or
is it just parrotting what you would've learned from SA in the first place?
(I expect the SA+bayesian to add value, but didn't with bogofilter...)

Re: Bayesian Spam Filtering and Bogofilter

from Dan Brickley <danbri@w3.org>, Tue, 11 Feb 2003 09:05:04 -0500

Replies:

reagle

Parents:

* Joseph Reagle <reagle@mit.edu> [2003-02-11 08:59-0500]
> On Tuesday 11 February 2003 01:23, Gerald Oskoboiny wrote:
> > I switched to Bogofilter (from Spamassassin) last Thursday, and
> > I'm really happy with it so far. My mail still gets labelled by
> > spamassassin on the way in, but I don't use it to decide where
> > my mail goes any more.
>
> I too no longer use spamassasin as a filtering criteria though I still run
> it to get rid of most of the horrid email on the pop server, and compare
> the results. I'm fairly happy with bogofilter though it does let some
> stupid spam through occasionally, and isn't catching the latest spams which
> is just a non-spammy natural language sentence or two and a link. (Also, it
> still misses some html mail, and I'm willing to consider that as very
> probable spam from the start.)
>
> > I decided to keep the spamassassin-labelling stuff around because
> > bogofilter can learn probability ratings for spamassassin tokens,
> > e.g. "user_agent_mutt" currently has pgood=0.007957, pbad=0.000354.
>
> What do you mean, the spam-assassin headers are still part of the email when
> you run bogofilter? That's just cheating then, right? (I originally trained
> bogofilter with the headers included and was stunned by it's performance
> when the headers were present, but not when they weren't, so that's why I :
> sed -e "/X-KMail/d" -e "/X-Spam/d" -e "/X-Bogosity/d" -e "/ * /d" $f

Not cheating at all! This isn't a competition between SA and bogofilter; if they
bogofilter algorithms can be used to take into account features that SA detects,
so much the better.

Very loose analogy: its like a multi-layer feedforward neural network, where the earlier
layers reorganise the data and emphasise salient features in such a way as to make
it easier for later layers to do even more useful processing...

Dan

Re: Bayesian Spam Filtering and Bogofilter

from Ted Guild <ted@guilds.net>, 11 Feb 2003 10:39:38 -0500

Replies:

None.

Parents:

Joseph Reagle <reagle@mit.edu> writes:

> I'm fairly happy with bogofilter though it does let some stupid spam
> through occasionally, and isn't catching the latest spams which is
> just a non-spammy natural language sentence or two and a
> link.

Those are the ones that bug me the most, and don't know what to do
about them.

I'm still SA and going to add bogofilter on my mail server. My plan
is to do all filtering on the mail server, using it's cycles and
having more of my mail processing take place before my mail comes to
me. For training purposes I'll resend a message to an alias on the
mail server which will procmail into bogofilter. I'll probably have
the procmail recipe (at least for training on non-spam) look at other
headers (Received, MUA, etc.) to avoid outside influences from
tainting.

Actually I think the known sold (eg ted+company@foo.org) and harvested
addresses I will just send to the spam alias bucket. Might even make
some honey pots for autotraining in this manner.

> (Also, it still misses some html mail, and I'm willing to consider
> that as very probable spam from the start.)

Ditto.

--
Ted Guild <ted@guilds.net>
http://www.guilds.net