Re: Whitelist, Spam Assassin not enough

from Gerald Oskoboiny <[email protected]>, Tue, 28 Oct 2003 00:35:22 -0500

Replies:

Parents:

karl

* Karl Dubost <[email protected]> [2003-10-27 11:42-0500]
> Hi,
>
> I have read
> http://impressive.net/people/gerald/2000/12/spam-filtering.html
>
> I have SA 2.60 already installed on my machine for a while a threshold
> around 4.2 and still receives a lot of spam. I'm refusing any kind of
> ".exe". I'm on a macintosh. But too many spams are reaching my INBOX

I use a whitelist and SA 2.60 for my W3C mail, and whitelist and
bogofilter for personal mail, and I'm very happy with both of
them. I would guess that less than 5-10 spams/day get through,
and 300-400 are trapped, and false positives are very rare.
(well, they seem to be, I don't really check any more.)

> For example on October 26, I still have 56 messages of Spam (on
> hundreds) which have hit my INBOX. The rest of the spam is going right
> away to /dev/null, I'm not checking it anymore. Lost of time to save 1
> mail on 1000.

This sounds like something may be wrong. You should check your
mail error log (maybe something like /var/log/mail*), possibly
Razor2 or something is not installed correctly.

If you check recent spam that has been trapped, a lot of them
should be marked with tags like BAYES_99 and RAZOR2_CF_RANGE_51_100.
I think the bayes and razor2 ones are fairly important to improve
SA's accuracy; if they are never there, something's probably wrong.

Also, you can use sa-learn to improve SA's accuracy by telling it
when it guesses incorrectly:

http://www.spamassassin.org/doc/sa-learn.html

> So I thought about a white-list system.
>
> 1. First mail: Automatic reply with an address to a Web Form
> 2. The person has to identify her/himself
> 3. I check the list and add them if necessary in my whitelist.
>
> I have seen at least one issue. Online Forms for booking, buying
> products which have an unpredictable address when they reply.

If you do this you should do it very carefully, some people find
those systems really annoying.

http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=207300

I am sometimes tempted to use something like that for mail that
is trapped by spamassassin, because I don't like the thought of
false positives just disappearing into a mailbox I never check.

But I think I would rather install Exim4 and start rejecting spam
at SMTP time than start sending challenges to hundreds of
(probably forged) messages per day.

If I ever do the challenge-response thing I'll probably include
something in my challenge that says "if this message was a
forgery, you should install SMTP forgery prevention software;
see http://spf.pobox.com/ " to help spread word about SPF.

--
Gerald Oskoboiny <[email protected]>
http://impressive.net/people/gerald/

Re: Whitelist, Spam Assassin not enough

from Ted Guild <[email protected]>, Tue, 28 Oct 2003 10:11:02 -0500

Replies:

None.

Parents:

karl
gerald

Gerald Oskoboiny <[email protected]> writes:

> and false positives are very rare.
> (well, they seem to be, I don't really check any more.)

+1 Lately when the spam file gets too large I've just been doing a
cat /dev/null >

I guess I should skip that step and send it there to begin with.

>I am sometimes tempted to use something like that for mail that
>is trapped by spamassassin, because I don't like the thought of
>false positives just disappearing into a mailbox I never check.

That bothers me too, but not enough that I am inclined to sift through
volumes of spam manually ever again. If something is really important
someone will eventually get my attention in a way that doesn't get
trapped in a filter. Reliability of mail delivery to me has suffered
as a result.

Black holing it is antisocial and I don't like it when it happens to
me and one is seldom to know if it happens to them.

I posted to a mailing list recently, including a useful patch to the
software the list is about. I got an automated reply saying my mail
was queued up for moderator action since I wasn't subscribed to the
list. I didn't feel like I needed to subscribe to make a suggestion
and contribution and should I subscribe and resend I'd run the risk of
double posting should my original message get moderated in. After a
month I get a rejected by moderator 'No reason given.' I corresponded
with the list owner and he offered his explanation along with an
apology. The list had too much spam awaiting moderator action and he
simply chose to reject them all. The list software was mailman, which
I use as well for a couple lists I maintain, and I was glad he used
the reject instead of discard option. I can certainly sympathize with
his reluctance to sift for false positives.

>But I think I would rather install Exim4 and start rejecting spam
>at SMTP time than start sending challenges to hundreds of
>(probably forged) messages per day.

I've messed around some with Exim4 and exiscan for hooks into
Spamassassin and clamav (anti-virus). BTW Debian's package splits
exim's conf into a bunch of different files much like they did
ipchains package. I'm not a fan of this, wondering which gets loaded
in what order but I guess the upside is if you change one aspect of
the config you're not holding back on apt updates of the others.

Default is SA score of 10 to do a reject, I would probably lower that
to whatever is my personal threshold is which is currently 3.4 as I
was getting too many just over that threshold.

It appears to be an immensely flexible MTA but I have refrained from
making the switch.

I need to experiment more with it somewhere so as not to mess with
real mail until I am comfortable with it. The printed books are
highly recommended.

My thinking has been to discard viruses to avoid compounding viruses'
impact on mail servers and reject spam.

What to say though in the rejection? It is amazing how even a
carefully worded error message in a rejection notice baffles some.
"There's a problem with my mail it didn't go through." Well if you
read the explanation given to you it might make sense. Anyone know of
an active Clueless User Network Test System list? Every one I ever
find has always been shut down, probably abused by sheer volume of
clueless user subscriptions.

Hmm, maybe the reject could contain a unique key in a header the mail
client won't trash, forget References and Reply-to because of borken
MUAs, like Subject or in the body of the message. User replying to
bounce citing the full bounce message would be enough of an action to
get them past being rejected a second time. I guess whatever datafile
is used to compare these keys coming back in would get sizeable over
time and should rotate out after a month or so. Replies to rejects
could be piped to sa-learn to improve its reliability. Now if the
rejection message gets trapped by their spam filtering due to the
sender's wording then I guess it'd likely end up in the bit bucket or
itself get bounced back.

>If I ever do the challenge-response thing I'll probably include
>something in my challenge that says "if this message was a
>forgery, you should install SMTP forgery prevention software;
>see http://spf.pobox.com/ " to help spread word about SPF.

Including SPF into the exim mix would complete the picture and in time
I wouldn't even bother with the rejects to those.

--
Ted Guild <[email protected]>
http://www.guilds.net

Re: Whitelist, Spam Assassin not enough

from Gerald Oskoboiny <[email protected]>, Tue, 28 Oct 2003 14:01:15 -0500

Replies:

None.

Parents:

karl
gerald

* Gerald Oskoboiny <[email protected]> [2003-10-28 00:35-0500]
> I use a whitelist and SA 2.60 for my W3C mail, and whitelist and
> bogofilter for personal mail, and I'm very happy with both of
> them. I would guess that less than 5-10 spams/day get through,
> and 300-400 are trapped, and false positives are very rare.
> (well, they seem to be, I don't really check any more.)

Actually, in the last week SA trapped 567 spams/day for my w3c mail,
and bogofilter trapped about 436 spams/day for my personal mail.
That's just over 1000 spams/day, from 1762 total messages/day.
(I may have underestimated how much spam gets through; may be as
much as 20-30 messages/day.)

My personal spam intake was a bit higher than it should have been
lately due to a configuration error: my backup MX was configured
to relay all mail sent to *@impressive.net to gerald@primary-mx,
with the result that mail to any address at my site was accepted,
instead of bogus addrs bouncing back.

I fixed that, but I am tempted to turn it back on and start feeding
mail sent to the bogus addrs directly into "sa-learn --spam", and
maybe even set up a bunch of extra addresses to use as spam honeypots.

(and maybe introduce a 5 min delay before processing mail sent to
valid addrs, to give SA a chance to learn from spam that already
hit the honeypots during the same spam attack)

--
Gerald Oskoboiny <[email protected]>
http://impressive.net/people/gerald/

Re: Whitelist, Spam Assassin not enough

from Gerald Oskoboiny <[email protected]>, Wed, 14 Jan 2004 00:38:08 -0500

Replies:

Parents:

karl
gerald

* Gerald Oskoboiny <[email protected]> [2003-10-28 00:35-0500]

> I am sometimes tempted to use [a challenge-response system] for
> mail that is trapped by spamassassin, because I don't like the
> thought of false positives just disappearing into a mailbox I
> never check.
>
> But I think I would rather install Exim4 and start rejecting spam
> at SMTP time than start sending challenges to hundreds of
> (probably forged) messages per day.

I did this: impressive.net now rejects any mail that spamassassin
scores higher than 10. Woohoo!

The mail is rejected during the initial delivery attempt, with
this error message:

550-Sorry, this smells like spam; rejected. For more info, please see
550 http://impressive.net/people/gerald/2004/01/spam.html

This should reject about 2/3rds of all mail to my site. I feel
better already :)

This was *really* easy to set up thanks to the exim4-daemon-heavy
package in debian sarge which includes a patch called exiscan-acl.
Details:

http://impressive.net/weblogs/fogo/2004/01/13/2004-01-13.html

--
Gerald Oskoboiny <[email protected]>
http://impressive.net/people/gerald/

notes on SMTP-time spamassassin rejections

from Gerald Oskoboiny <[email protected]>, Tue, 1 Jun 2004 13:01:36 -0400

Replies:

Parents:

gerald

* Gerald Oskoboiny <[email protected]> [2004-01-14 00:38-0500]

> impressive.net now rejects any mail that spamassassin scores
> higher than 10. Woohoo!
>
> The mail is rejected during the initial delivery attempt, with
> this error message:
>
> 550-Sorry, this smells like spam; rejected. For more info, please see
> 550 http://impressive.net/people/gerald/2004/01/spam.html

The following is only marginally interesting to others, I just
wanted to archive some thoughts and recent stats on my spam
filtering setup.

For the last few months any email to impressive.net with a SA
score > 10 has been rejected, and any email I receive with score
5-10 goes into a mailbox called 'probable-spam' which in theory I
could review periodically for false positives, but in practice
just gets ignored. (it has 38643 messages since Jan 14, 292/day)

I hate silently ignoring email, so I wonder if I should decrease
the rejection threshold to 5 or something. Or maybe set up a
challenge/response system for that mail: since mr-burns rejects
forgeries using SPF, I could tell anyone who complains about
bogus challenges to publish SPF records and leave me alone.

The spamassassin that runs at SMTP time is a generic one that
doesn't learn over time because it doesn't have a bayes DB that
it can write to (because it runs as user nobody), so it is much
less effective than it could be.

I had planned to figure out how to set up a bogus user with a
bayes DB that I could train over time, but it seems tricky to do
that with exiscan-acl so maybe I should just configure SA on
mr-burns to use my personal bayes DB.

recent stats:

263 msgs/day rejected by generic SA > 10 at SMTP time (no bayes DB)
in the last week.

292 msgs/day trapped with SA > 5 by autolearning SA (not manually trained)
(since Jan 14)

of those 292,
185 msgs/day scored >10 when rescored using my bayes DB (autolearned)
107 msgs/day scored >5 but <10 when rescored using my bayes DB

so if I switched the global spamassassin config to use my personal
bayes DB on mr-burns, I could start rejecting another 185 messages/
day immediately.

If I lowered the rejection threshold to 5, I could start rejecting
another 107 messages/day.

If I started training SA manually, I could do even better.
(I still get about 50-100 spams/day in my low-priority mailbox:
stuff that scored < 5 and was not from someone on my whitelist)

Oh... I haven't received a single complaint about real mail being
rejected, though ~35k messages have been rejected by now. (But I
wouldn't necessarily hear about list subscriptions being cancelled.)

distributions of stuff that scored >5 using autolearned bayes DB:

$ cat probable-spam | formail -s formail -XX-Spam-Status: | fmt -1 | egrep ^hits= | cut -d. -f1 | sort -n -t= +1 | uniq -c
1 hits=-4
1 hits=-0
4 hits=1
12 hits=2
7 hits=3
22 hits=4
2485 hits=5
2641 hits=6
3077 hits=7
2977 hits=8
2948 hits=9
3583 hits=10
3352 hits=11
2829 hits=12
2550 hits=13
2284 hits=14
1811 hits=15
1489 hits=16
1038 hits=17
755 hits=18
535 hits=19
394 hits=20
350 hits=21
285 hits=22
269 hits=23
230 hits=24
202 hits=25
284 hits=26
268 hits=27
337 hits=28
247 hits=29
270 hits=30
214 hits=31
167 hits=32
140 hits=33
133 hits=34
120 hits=35
76 hits=36
53 hits=37
45 hits=38
36 hits=39
43 hits=40
31 hits=41
21 hits=42
11 hits=43
4 hits=44
5 hits=45
3 hits=46
3 hits=47

(a few messages that scored < 5 went to probable-spam due to
various other filters in my procmailrc)

--
Gerald Oskoboiny <[email protected]>
http://impressive.net/people/gerald/

Re: notes on SMTP-time spamassassin rejections

from Yves Lafon <[email protected]>, Tue, 1 Jun 2004 20:30:16 +0200 (MEST)

Replies:

None.

Parents:

gerald

On Tue, 1 Jun 2004, Gerald Oskoboiny wrote:

> 263 msgs/day rejected by generic SA > 10 at SMTP time (no bayes DB)
> in the last week.
>
> 292 msgs/day trapped with SA > 5 by autolearning SA (not manually trained)
> (since Jan 14)
>
> of those 292,
> 185 msgs/day scored >10 when rescored using my bayes DB (autolearned)
> 107 msgs/day scored >5 but <10 when rescored using my bayes DB

How about some stats about the number of email received from sites with
valid SPF record? (even if you stats may be optimistic)

--
Yves Lafon - W3C
"Baroula que barouleras, au ti�u toujou t'entourneras."

Re: notes on SMTP-time spamassassin rejections

from Gerald Oskoboiny <[email protected]>, Tue, 8 Jun 2004 14:14:32 -0400

Replies:

jones

Parents:

* Gerald Oskoboiny <[email protected]> [2004-06-01 13:01-0400]

> The spamassassin that runs at SMTP time is a generic one that
> doesn't learn over time because it doesn't have a bayes DB that
> it can write to (because it runs as user nobody), so it is much
> less effective than it could be.
>
> I had planned to figure out how to set up a bogus user with a
> bayes DB that I could train over time, but it seems tricky to do
> that with exiscan-acl so maybe I should just configure SA on
> mr-burns to use my personal bayes DB.

I did this, and started training spamassassin on any spam it
misses (maybe 5-10/day), set up a few honeypots (notes below),
and wow, what a huge improvement.

My spam intake has dropped to 1998 levels. It's actually eerily
quiet. I'm worried that I must be rejecting too much stuff, but
can't find any evidence of legit mail being blocked.

To set up honeypot addresses, I checked for the most common
unrouteable addresses in exim's rejectlog (somehow a bunch of
bogus addrs got onto spammer's lists, usually truncated versions
of real addresses, e.g. [email protected]) and turned those
into aliases for a new user I created:

# spam honeypots (most common unrouteable addrs in rejectlog)
rald: spam-honeypot
ald: spam-honeypot
...

(I could have also just created a bunch of fake addrs and put
those on my web site to be crawled by email harvesting bots, but
might as well use addresses that were already known to spammers.)

The 'spam-honeypot' user has the same uid as gerald so it can
write to my bayes DB, and it feeds all its non-daemon mail into
sa-learn using a procmailrc like this:

# procmailrc for spam-honeypot user: feed all mail into sa-learn --spam

PATH=$HOME/bin:/usr/bin:/bin:/usr/local/bin

:0:
* ^FROM_DAEMON
from-daemon

:0c
| sa-learn --spam

:0:
sa-learned-spam

and ~spam-honeypot/.spamassassin is symlinked to ~gerald/.spamassassin

(spam-honeypot etc above are actually called something else; I
didn't put the real names here because I don't want spammers
finding out the names of my honeypots and poisoning them with
legitimate mail.)

--
Gerald Oskoboiny <[email protected]>
http://impressive.net/people/gerald/

RE: notes on SMTP-time spamassassin rejections

from "David A. Jones" <[email protected]>, Tue, 8 Jun 2004 15:21:07 -0700

Replies:

None.

Parents:

gerald

impressive! :-) he he he

ok, but if a message comes in with bogus address and legit address this
should also lead you to believe that this is a bogus email. for example:

to: [email protected], [email protected],
[email protected]
from: [email protected]
subject: penis vagina penis vagina

crap crap crap crap

now say the spamcity.bastards.org is a legit address with MX records and
such. The To address is legit also but becuase it's accompanied with invalid
addresses you own shouldn't this also be rejected? Is that what the
honeypots do? is look at then entire SMTP and BSMTP headers?

Cheers,

David.

-----Original Message-----
From: [email protected] [mailto:[email protected]]On
Behalf Of Gerald Oskoboiny
Sent: Tuesday, June 08, 2004 11:15 AM
To: [email protected]
Subject: Re: notes on SMTP-time spamassassin rejections

* Gerald Oskoboiny <[email protected]> [2004-06-01 13:01-0400]

> The spamassassin that runs at SMTP time is a generic one that
> doesn't learn over time because it doesn't have a bayes DB that
> it can write to (because it runs as user nobody), so it is much
> less effective than it could be.
>
> I had planned to figure out how to set up a bogus user with a
> bayes DB that I could train over time, but it seems tricky to do
> that with exiscan-acl so maybe I should just configure SA on
> mr-burns to use my personal bayes DB.

I did this, and started training spamassassin on any spam it
misses (maybe 5-10/day), set up a few honeypots (notes below),
and wow, what a huge improvement.

My spam intake has dropped to 1998 levels. It's actually eerily
quiet. I'm worried that I must be rejecting too much stuff, but
can't find any evidence of legit mail being blocked.

To set up honeypot addresses, I checked for the most common
unrouteable addresses in exim's rejectlog (somehow a bunch of
bogus addrs got onto spammer's lists, usually truncated versions
of real addresses, e.g. [email protected]) and turned those
into aliases for a new user I created:

# spam honeypots (most common unrouteable addrs in rejectlog)
rald: spam-honeypot
ald: spam-honeypot
...

(I could have also just created a bunch of fake addrs and put
those on my web site to be crawled by email harvesting bots, but
might as well use addresses that were already known to spammers.)

The 'spam-honeypot' user has the same uid as gerald so it can
write to my bayes DB, and it feeds all its non-daemon mail into
sa-learn using a procmailrc like this:

# procmailrc for spam-honeypot user: feed all mail into sa-learn --spam

PATH=$HOME/bin:/usr/bin:/bin:/usr/local/bin

:0:
* ^FROM_DAEMON
from-daemon

:0c
| sa-learn --spam

:0:
sa-learned-spam

and ~spam-honeypot/.spamassassin is symlinked to ~gerald/.spamassassin

(spam-honeypot etc above are actually called something else; I
didn't put the real names here because I don't want spammers
finding out the names of my honeypots and poisoning them with
legitimate mail.)

--
Gerald Oskoboiny <[email protected]>
http://impressive.net/people/gerald/

Re: notes on SMTP-time spamassassin rejections

from Gerald Oskoboiny <[email protected]>, Wed, 21 Jul 2004 15:53:21 -0400

Replies:

None.

Parents:

* Gerald Oskoboiny <[email protected]> [2004-06-01 13:01-0400]

> For the last few months any email to impressive.net with a SA
> score > 10 has been rejected, and any email I receive with score
> 5-10 goes into a mailbox called 'probable-spam' which in theory I
> could review periodically for false positives, but in practice
> just gets ignored. (it has 38643 messages since Jan 14, 292/day)
>
> I hate silently ignoring email, so I wonder if I should decrease
> the rejection threshold to 5 or something.

I have been really happy with my spam blocking setup, and still have
not received a single complaint about valid mail being blocked.

Since June 3, 4899 messages were filtered to my probable-spam
mailbox but not rejected (messages that scored 5-10); I scanned
about 2k of those manually and found two false positives, an
inquiry about using a photo, and someone asking about a hotel
in Italy; both were tagged BAYES_99 which contributed 5.4 to the
score (but both scored < 6 total.)

So I just lowered my threshold for rejection to 6, and I am doing
away with this probable-spam mailbox, so I don't have to worry
about mail being silently ignored any more.

Oh... 6090 messages have entered my spam honeypot in that time,
fed directly into sa-learn --spam. (messages that scored > 10 were
rejected as usual, and not fed into sa-learn... I wonder if I
should try to exclude my honeypots from smtp-time spam blocking?)

Distribution of SA scores in probable-spam since June 3:

gerald@ogobogo:/home/gerald; cat mail/probable-spam | formail -s formail -c -XX-Spam-Status | cut -d= -f2 | cut -d. -f1 | sort -n | uniq -c
269 5
377 6
695 7
494 8
379 9
523 10
503 11
358 12
415 13
292 14
223 15
177 16
94 17
50 18
27 19
18 20
3 21
1 23
1 24

(hmm, why so many > 10?? Those should have been rejected.
Maybe some network tests are not being done at smtp time?)

--
Gerald Oskoboiny <[email protected]>
http://impressive.net/people/gerald/