bogusfilter?

from Dean Jackson <dean@w3.org>, Fri, 28 Mar 2003 14:37:43 +1100

Replies:

Parents:

None.

Recently I've been disappointed with the performance of
bogofilter and spamassassin. The combination of the two
still means I don't get much spam, but they seem to
be making more mistakes than normal. So, I'm seeking advice.

- Do most people run bogofilter in auto-train mode?

- Do you use the default settings on bogofilter? I notice
a lot of spam arriving with a bogorating of 0.85 or above, so
not being marked as spam.

- How big are your bogo databases?

- Is anyone doing any preprocessing on the email before handing
it off? For example, SpamAssassin notices a message is from
yahoogroups and thus compensates for the advertising that yahoo
puts on the bottom. At the moment bogofilter is learning that
those typically spam-like phrases are not spam.

Any tips or tricks appreciated.

I'll also note that the Bayesian analysis within SpamAssassin
version 2.50 is doing a pretty good job of marking spam, without
any formal training (it autotrains). In fact, with this enabled,
SpamAssassin is doing a better job of finding spam than Bogofilter
for me (especially with whitelists)

I got enthused at one point and cleared my bogo databases for
a complete retrain. The results are more _accurate_, but I don't
see a big change in the amount of spam being caught.

As an aside, how many people have a procmail rule like this?

# run everything through spamassassin in client/server mode
:0fw
* ! ^X-Spam-Status:
| spamc

# if spamc didn't label the message yet, run spamassassin again
:0fw
* ! ^X-Spam-Status:
| spamassassin -P

I assume I got this from the man page or someone's existing .procmailrc.
I'm tempted to take the test for existing spam headers out.
I can't believe spammers aren't adding fake headers like this to
dodge spamassassin. Or am I confused?

(Note to self, add this feature to by bulk email program if true)

Dean

Re: bogusfilter?

from Henrik Edlund <henrik@edlund.org>, Thu, 17 Apr 2003 21:45:54 +0200 (CEST)

Replies:

ted
henrik

Parents:

dean

On Fri, 28 Mar 2003, Dean Jackson wrote:

DJ> I'll also note that the Bayesian analysis within SpamAssassin version
DJ> 2.50 is doing a pretty good job of marking spam, without any formal
DJ> training (it autotrains). In fact, with this enabled, SpamAssassin is
DJ> doing a better job of finding spam than Bogofilter for me (especially
DJ> with whitelists)

I am running only SpamAssassin 2.53 but with Bayesian (auto-learning),
Razor2, DCC and auto-whitelist and all bullt-in RBL lookups. I also payed
for spamcop.net so I have enabled that RBL for my SpamAssassin as well.

I filter out anything with score 1.0 or above and so far I am getting 100%
good results, no false positives or false negatives.

The nice thing I have noticed with 2.50 and later is that non-spam no
longer land around zero but instead gets high minus scores. This while
spam gets very high scores. This is why it works for me with score 1.0 I
guess.

I no longer use the Gerald-inspired whitelist functionality with procmail
that I used to before SpamAssassion. No need anymore.

I get about 500 mail per day and of these 50-100 are spam. SpamAssassin
2.53 works splendid.

Henrik

--
"You're young, you're drunk, you're in bed, you have knives; shit happens."
-- Angelina Jolie

Re: bogusfilter?

from Ted Guild <ted@guilds.net>, Thu, 17 Apr 2003 17:35:47 -0400

Replies:

henrik

Parents:

dean
henrik

Henrik Edlund <henrik@edlund.org> writes:

> On Fri, 28 Mar 2003, Dean Jackson wrote:
>
> DJ> I'll also note that the Bayesian analysis within SpamAssassin version
> DJ> 2.50 is doing a pretty good job of marking spam, without any formal
> DJ> training (it autotrains). In fact, with this enabled, SpamAssassin is
> DJ> doing a better job of finding spam than Bogofilter for me (especially
> DJ> with whitelists)
>
> I am running only SpamAssassin 2.53 but with Bayesian (auto-learning),
> Razor2, DCC and auto-whitelist and all bullt-in RBL lookups. I also payed
> for spamcop.net so I have enabled that RBL for my SpamAssassin as well.
>
> I filter out anything with score 1.0 or above and so far I am getting 100%
> good results, no false positives or false negatives.
>
> The nice thing I have noticed with 2.50 and later is that non-spam no
> longer land around zero but instead gets high minus scores. This while
> spam gets very high scores. This is why it works for me with score 1.0 I
> guess.

What about those really tricky ones that in my experience gets past
Spamassasin and Bayesian? The short one or two liners of innocuous
phrases with a uri? Maybe I'll get those if I add Razor in the mix.
Gerald and I have talked about these and thought about keeping a
blacklist from whois data as Gerald noticed these often belong to a
handful of people with multiple domains. I was wondering about taking
the content from the uri and running it through bogofilter for a
score.

> I no longer use the Gerald-inspired whitelist functionality with procmail
> that I used to before SpamAssassion. No need anymore.

I didn't adopt whitelisting until after older version of spamassassin
had some false positives. Spamassassin's whitelist isn't as nice I'd
like to have it file based like the other. I cron sed my .mailrc into
my whitelist.

> I get about 500 mail per day and of these 50-100 are spam. SpamAssassin
> 2.53 works splendid.

I should upgrade, I got lazy and went with the debian packaged
version. sarge is at 2.43 might have to go with sid on this one as
it's 2.53 or build.

--
Ted Guild <ted@guilds.net>
http://www.guilds.net

Re: bogusfilter?

from Henrik Edlund <henrik@edlund.org>, Thu, 17 Apr 2003 23:51:47 +0200 (CEST)

Replies:

None.

Parents:

On Thu, 17 Apr 2003, Ted Guild wrote:

TG> What about those really tricky ones that in my experience gets past
TG> Spamassasin and Bayesian? The short one or two liners of innocuous
TG> phrases with a uri? Maybe I'll get those if I add Razor in the mix.
TG> Gerald and I have talked about these and thought about keeping a
TG> blacklist from whois data as Gerald noticed these often belong to a
TG> handful of people with multiple domains. I was wondering about taking
TG> the content from the uri and running it through bogofilter for a
TG> score.

I get those with one of the RBLs or with Razor2 or with DCC. Also they
never get minus scores, or scores below 1, if missed by any of the earlier
mentioned. These are the RBLs that my SpamAssassin employ:

relays.osirusoft.com
relays.ordb.org
relays.visi.com
sbl.spamhaus.org
orbs.dorkslayers.com
opm.blitzed.org
list.dsbl.org
ipwhois.rfc-ignorant.org
hil.habeas.com
bl.spamcop.net
dnsbl.njabl.org

Henrik

Re: bogusfilter?

from Hugo Haas <hugo@larve.net>, Thu, 17 Apr 2003 22:38:37 +0200

Replies:

None.

Parents:

dean

Hey Dean.

* Dean Jackson <dean@w3.org> [2003-03-28 14:37+1100]
> Recently I've been disappointed with the performance of
> bogofilter and spamassassin. The combination of the two
> still means I don't get much spam, but they seem to
> be making more mistakes than normal. So, I'm seeking advice.
>
> - Do most people run bogofilter in auto-train mode?

I do, but only in the following case:
- spam and not on my white list.
- not spam and on my white list.

> - Do you use the default settings on bogofilter? I notice
> a lot of spam arriving with a bogorating of 0.85 or above, so
> not being marked as spam.

When I upgraded bogofilter to version 0.11.*, I started having more
spams go though it. So I tweaked the settings.

> - How big are your bogo databases?

hugo@homer ~> ls -l .bogofilter
total 11379
-rw------- 1 hugo www 2969600 Apr 17 21:06 goodlist.db
-rw------- 1 hugo www 8634368 Apr 17 21:06 spamlist.db

> - Is anyone doing any preprocessing on the email before handing
> it off? For example, SpamAssassin notices a message is from
> yahoogroups and thus compensates for the advertising that yahoo
> puts on the bottom. At the moment bogofilter is learning that
> those typically spam-like phrases are not spam.

I do some white list tagging as said above.

> Any tips or tricks appreciated.

I completely dropped SpamAssassin, and am invoking bogofilter as
(assuming that BOGOFILTER_REGISTER is set to yes at some point):

# Bogofilter
:0fw
| bogofilter -pe -o 0.5

# Register mail as spam?
:0c
* BOGOFILTER_REGISTER ?? yes
* ^X-Bogosity: Yes
| bogofilter -s

# Register mail as non-spam?
:0c
* BOGOFILTER_REGISTER ?? yes
* ^X-Bogosity: No
* ^X-HH-Whitelist: YES
| bogofilter -n

:0:
* ^X-Bogosity: Yes
spam

With version 0.11.2, I haven't had any problem with this setting.

Note that you can use bogoutil to remove tokens with a low count or
old tokens, and you may increase the quality of your filtering, but I
haven't done that.

--
Hugo Haas - http://larve.net/people/hugo/