Re: Bayesian Spam Filtering and Bogofilter

from Gerald Oskoboiny <[email protected]>, Tue, 11 Feb 2003 01:23:41 -0500

Replies:

Parents:

reagle

* Joseph Reagle <[email protected]> [2002-12-12 10:15-0500]
>
> Bayesian Spam Filtering [0] is very cool, so yesterday I set up
> Bogofilter [1] as a spam filtering technique. Presently, it's
> running in parallel with SpamAssassin.

I switched to Bogofilter (from Spamassassin) last Thursday, and
I'm really happy with it so far. My mail still gets labelled by
spamassassin on the way in, but I don't use it to decide where
my mail goes any more.

My .procmailrc currently has a bunch of list-specific filters,
then my whitelist stuff, then uses "bogofilter -u -e -p" on
anything from senders not in my whitelist. (I wasn't using -u
until just now; I didn't like the thought of it auto-training
itself, but I think it'll be ok. I might even set up a couple
honeypots to collect spam and feed my bogofilter ratings.)

I decided to keep the spamassassin-labelling stuff around because
bogofilter can learn probability ratings for spamassassin tokens,
e.g. "user_agent_mutt" currently has pgood=0.007957, pbad=0.000354.

I initially trained bogofilter on a bunch of spam and non-spam I
had around, and I wasn't sure if I should do it with the whole
mailbox as input or with individual messages, but after doing
some testing just now it seems it doesn't matter. (bogofilter
must have some heuristics to recognize individual messages)

I wrote a little script [2] to use as a wrapper around bogofilter
when I am training it from within Mutt, because it seems to take
about 2-3 seconds when invoked in training mode. (this script
returns immediately and calls bogofilter in the background.)

And here's my procmailrc [3] and muttrc [4].

> [0] http://www.paulgraham.com/spam.html
> [1] http://www.tuxedo.org/~esr/bogofilter/

[2] http://impressive.net/people/gerald/2003/02/train-bogofilter
[3] http://impressive.net/people/gerald/misc/dotfiles/procmailrc
[4] http://impressive.net/people/gerald/misc/dotfiles/muttrc

--
Gerald Oskoboiny <[email protected]>
http://impressive.net/people/gerald/

Re: Bayesian Spam Filtering and Bogofilter

from Joseph Reagle <[email protected]>, Tue, 11 Feb 2003 08:59:19 -0500

Replies:

Parents:

On Tuesday 11 February 2003 01:23, Gerald Oskoboiny wrote:
> I switched to Bogofilter (from Spamassassin) last Thursday, and
> I'm really happy with it so far. My mail still gets labelled by
> spamassassin on the way in, but I don't use it to decide where
> my mail goes any more.

I too no longer use spamassasin as a filtering criteria though I still run
it to get rid of most of the horrid email on the pop server, and compare
the results. I'm fairly happy with bogofilter though it does let some
stupid spam through occasionally, and isn't catching the latest spams which
is just a non-spammy natural language sentence or two and a link. (Also, it
still misses some html mail, and I'm willing to consider that as very
probable spam from the start.)

> I decided to keep the spamassassin-labelling stuff around because
> bogofilter can learn probability ratings for spamassassin tokens,
> e.g. "user_agent_mutt" currently has pgood=0.007957, pbad=0.000354.

What do you mean, the spam-assassin headers are still part of the email when
you run bogofilter? That's just cheating then, right? (I originally trained
bogofilter with the headers included and was stunned by it's performance
when the headers were present, but not when they weren't, so that's why I :
sed -e "/X-KMail/d" -e "/X-Spam/d" -e "/X-Bogosity/d" -e "/ * /d" $f

But, the combination of SA's features/heuristics and Bayesian filtering will
ruck and I'm looking forward to playing with that feature in the new
version of SA.

Re: Bayesian Spam Filtering and Bogofilter

from Joseph Reagle <[email protected]>, Tue, 11 Feb 2003 09:21:50 -0500

Replies:

None.

Parents:

On Tuesday 11 February 2003 09:05, Dan Brickley wrote:
> Not cheating at all! This isn't a competition between SA and bogofilter;

Well, it is when I'm testing them. <smile/>

> if they bogofilter algorithms can be used to take into account features
> that SA detects, so much the better.

Ok, so the folks that are using bogofilter in this mode, do you find that
bogofilter is then able to correct SA's false negatives and positives, or
is it just parrotting what you would've learned from SA in the first place?
(I expect the SA+bayesian to add value, but didn't with bogofilter...)

Re: Bayesian Spam Filtering and Bogofilter

from Dan Brickley <[email protected]>, Tue, 11 Feb 2003 09:05:04 -0500

Replies:

reagle

Parents:

* Joseph Reagle <[email protected]> [2003-02-11 08:59-0500]
> On Tuesday 11 February 2003 01:23, Gerald Oskoboiny wrote:
> > I switched to Bogofilter (from Spamassassin) last Thursday, and
> > I'm really happy with it so far. My mail still gets labelled by
> > spamassassin on the way in, but I don't use it to decide where
> > my mail goes any more.
>
> I too no longer use spamassasin as a filtering criteria though I still run
> it to get rid of most of the horrid email on the pop server, and compare
> the results. I'm fairly happy with bogofilter though it does let some
> stupid spam through occasionally, and isn't catching the latest spams which
> is just a non-spammy natural language sentence or two and a link. (Also, it
> still misses some html mail, and I'm willing to consider that as very
> probable spam from the start.)
>
> > I decided to keep the spamassassin-labelling stuff around because
> > bogofilter can learn probability ratings for spamassassin tokens,
> > e.g. "user_agent_mutt" currently has pgood=0.007957, pbad=0.000354.
>
> What do you mean, the spam-assassin headers are still part of the email when
> you run bogofilter? That's just cheating then, right? (I originally trained
> bogofilter with the headers included and was stunned by it's performance
> when the headers were present, but not when they weren't, so that's why I :
> sed -e "/X-KMail/d" -e "/X-Spam/d" -e "/X-Bogosity/d" -e "/ * /d" $f

Not cheating at all! This isn't a competition between SA and bogofilter; if they
bogofilter algorithms can be used to take into account features that SA detects,
so much the better.

Very loose analogy: its like a multi-layer feedforward neural network, where the earlier
layers reorganise the data and emphasise salient features in such a way as to make
it easier for later layers to do even more useful processing...

Dan

Re: Bayesian Spam Filtering and Bogofilter

from Ted Guild <[email protected]>, 11 Feb 2003 10:39:38 -0500

Replies:

None.

Parents:

Joseph Reagle <[email protected]> writes:

> I'm fairly happy with bogofilter though it does let some stupid spam
> through occasionally, and isn't catching the latest spams which is
> just a non-spammy natural language sentence or two and a
> link.

Those are the ones that bug me the most, and don't know what to do
about them.

I'm still SA and going to add bogofilter on my mail server. My plan
is to do all filtering on the mail server, using it's cycles and
having more of my mail processing take place before my mail comes to
me. For training purposes I'll resend a message to an alias on the
mail server which will procmail into bogofilter. I'll probably have
the procmail recipe (at least for training on non-spam) look at other
headers (Received, MUA, etc.) to avoid outside influences from
tainting.

Actually I think the known sold (eg [email protected]) and harvested
addresses I will just send to the spam alias bucket. Might even make
some honey pots for autotraining in this manner.

> (Also, it still misses some html mail, and I'm willing to consider
> that as very probable spam from the start.)

Ditto.

--
Ted Guild <[email protected]>
http://www.guilds.net