Re: Whitelist, Spam Assassin not enough

Replies:

Parents:

* Gerald Oskoboiny <[email protected]> [2003-10-28 00:35-0500]

> I am sometimes tempted to use [a challenge-response system] for
> mail that is trapped by spamassassin, because I don't like the
> thought of false positives just disappearing into a mailbox I
> never check.
>
> But I think I would rather install Exim4 and start rejecting spam
> at SMTP time than start sending challenges to hundreds of
> (probably forged) messages per day.

I did this: impressive.net now rejects any mail that spamassassin
scores higher than 10. Woohoo!

The mail is rejected during the initial delivery attempt, with
this error message:

   550-Sorry, this smells like spam; rejected. For more info, please see
   550 http://impressive.net/people/gerald/2004/01/spam.html

This should reject about 2/3rds of all mail to my site. I feel
better already :)

This was *really* easy to set up thanks to the exim4-daemon-heavy
package in debian sarge which includes a patch called exiscan-acl.
Details:

   http://impressive.net/weblogs/fogo/2004/01/13/2004-01-13.html

--
Gerald Oskoboiny <[email protected]>
http://impressive.net/people/gerald/

notes on SMTP-time spamassassin rejections

Replies:

Parents:

* Gerald Oskoboiny <[email protected]> [2004-01-14 00:38-0500]

> impressive.net now rejects any mail that spamassassin scores
> higher than 10. Woohoo!
>
> The mail is rejected during the initial delivery attempt, with
> this error message:
>
>     550-Sorry, this smells like spam; rejected. For more info, please see
>     550 http://impressive.net/people/gerald/2004/01/spam.html

The following is only marginally interesting to others, I just
wanted to archive some thoughts and recent stats on my spam
filtering setup.

For the last few months any email to impressive.net with a SA
score > 10 has been rejected, and any email I receive with score
5-10 goes into a mailbox called 'probable-spam' which in theory I
could review periodically for false positives, but in practice
just gets ignored. (it has 38643 messages since Jan 14, 292/day)

I hate silently ignoring email, so I wonder if I should decrease
the rejection threshold to 5 or something. Or maybe set up a
challenge/response system for that mail: since mr-burns rejects
forgeries using SPF, I could tell anyone who complains about
bogus challenges to publish SPF records and leave me alone.

The spamassassin that runs at SMTP time is a generic one that
doesn't learn over time because it doesn't have a bayes DB that
it can write to (because it runs as user nobody), so it is much
less effective than it could be.

I had planned to figure out how to set up a bogus user with a
bayes DB that I could train over time, but it seems tricky to do
that with exiscan-acl so maybe I should just configure SA on
mr-burns to use my personal bayes DB.

recent stats:

263 msgs/day rejected by generic SA > 10 at SMTP time (no bayes DB)
in the last week.

292 msgs/day trapped with SA > 5 by autolearning SA (not manually trained)
(since Jan 14)

of those 292,
185 msgs/day scored >10 when rescored using my bayes DB (autolearned)
107 msgs/day scored >5 but <10 when rescored using my bayes DB

so if I switched the global spamassassin config to use my personal
bayes DB on mr-burns, I could start rejecting another 185 messages/
day immediately.

If I lowered the rejection threshold to 5, I could start rejecting
another 107 messages/day.

If I started training SA manually, I could do even better.
(I still get about 50-100 spams/day in my low-priority mailbox:
stuff that scored < 5 and was not from someone on my whitelist)

Oh... I haven't received a single complaint about real mail being
rejected, though ~35k messages have been rejected by now. (But I
wouldn't necessarily hear about list subscriptions being cancelled.)

distributions of stuff that scored >5 using autolearned bayes DB:

   $ cat probable-spam | formail -s formail -XX-Spam-Status: | fmt -1 | egrep ^hits= | cut -d. -f1 | sort -n -t= +1 | uniq -c
     1 hits=-4
     1 hits=-0
     4 hits=1
    12 hits=2
     7 hits=3
    22 hits=4
  2485 hits=5
  2641 hits=6
  3077 hits=7
  2977 hits=8
  2948 hits=9
  3583 hits=10
  3352 hits=11
  2829 hits=12
  2550 hits=13
  2284 hits=14
  1811 hits=15
  1489 hits=16
  1038 hits=17
   755 hits=18
   535 hits=19
   394 hits=20
   350 hits=21
   285 hits=22
   269 hits=23
   230 hits=24
   202 hits=25
   284 hits=26
   268 hits=27
   337 hits=28
   247 hits=29
   270 hits=30
   214 hits=31
   167 hits=32
   140 hits=33
   133 hits=34
   120 hits=35
    76 hits=36
    53 hits=37
    45 hits=38
    36 hits=39
    43 hits=40
    31 hits=41
    21 hits=42
    11 hits=43
     4 hits=44
     5 hits=45
     3 hits=46
     3 hits=47

(a few messages that scored < 5 went to probable-spam due to
various other filters in my procmailrc)

--
Gerald Oskoboiny <[email protected]>
http://impressive.net/people/gerald/

Re: notes on SMTP-time spamassassin rejections

Replies:

  • None.

Parents:

On Tue, 1 Jun 2004, Gerald Oskoboiny wrote:

> 263 msgs/day rejected by generic SA > 10 at SMTP time (no bayes DB)
> in the last week.
>
> 292 msgs/day trapped with SA > 5 by autolearning SA (not manually trained)
> (since Jan 14)
>
> of those 292,
> 185 msgs/day scored >10 when rescored using my bayes DB (autolearned)
> 107 msgs/day scored >5 but <10 when rescored using my bayes DB

How about some stats about the number of email received from sites with
valid SPF record? (even if you stats may be optimistic)

--
Yves Lafon - W3C
"Baroula que barouleras, au ti�u toujou t'entourneras."

Re: notes on SMTP-time spamassassin rejections

Replies:

Parents:

* Gerald Oskoboiny <[email protected]> [2004-06-01 13:01-0400]

> The spamassassin that runs at SMTP time is a generic one that
> doesn't learn over time because it doesn't have a bayes DB that
> it can write to (because it runs as user nobody), so it is much
> less effective than it could be.
>
> I had planned to figure out how to set up a bogus user with a
> bayes DB that I could train over time, but it seems tricky to do
> that with exiscan-acl so maybe I should just configure SA on
> mr-burns to use my personal bayes DB.

I did this, and started training spamassassin on any spam it
misses (maybe 5-10/day), set up a few honeypots (notes below),
and wow, what a huge improvement.

My spam intake has dropped to 1998 levels. It's actually eerily
quiet. I'm worried that I must be rejecting too much stuff, but
can't find any evidence of legit mail being blocked.

To set up honeypot addresses, I checked for the most common
unrouteable addresses in exim's rejectlog (somehow a bunch of
bogus addrs got onto spammer's lists, usually truncated versions
of real addresses, e.g. [email protected]) and turned those
into aliases for a new user I created:

   # spam honeypots (most common unrouteable addrs in rejectlog)
   rald:   spam-honeypot
   ald:    spam-honeypot
   ...

(I could have also just created a bunch of fake addrs and put
those on my web site to be crawled by email harvesting bots, but
might as well use addresses that were already known to spammers.)

The 'spam-honeypot' user has the same uid as gerald so it can
write to my bayes DB, and it feeds all its non-daemon mail into
sa-learn using a procmailrc like this:

   # procmailrc for spam-honeypot user: feed all mail into sa-learn --spam

   PATH=$HOME/bin:/usr/bin:/bin:/usr/local/bin

   :0:
   * ^FROM_DAEMON
   from-daemon

   :0c
   | sa-learn --spam

   :0:
   sa-learned-spam

and ~spam-honeypot/.spamassassin is symlinked to ~gerald/.spamassassin

(spam-honeypot etc above are actually called something else; I
didn't put the real names here because I don't want spammers
finding out the names of my honeypots and poisoning them with
legitimate mail.)

--
Gerald Oskoboiny <[email protected]>
http://impressive.net/people/gerald/

RE: notes on SMTP-time spamassassin rejections

Replies:

  • None.

Parents:

impressive! :-) he he he

ok, but if a message comes in with bogus address and legit address this
should also lead you to believe that this is a bogus email. for example:

to: [email protected], [email protected],
[email protected]
from: [email protected]
subject: penis vagina penis vagina

crap crap crap crap


now say the spamcity.bastards.org is a legit address with MX records and
such. The To address is legit also but becuase it's accompanied with invalid
addresses you own shouldn't this also be rejected? Is that what the
honeypots do? is look at then entire SMTP and BSMTP headers?

Cheers,

David.


-----Original Message-----
From: [email protected] [mailto:[email protected]]On
Behalf Of Gerald Oskoboiny
Sent: Tuesday, June 08, 2004 11:15 AM
To: [email protected]
Subject: Re: notes on SMTP-time spamassassin rejections


* Gerald Oskoboiny <[email protected]> [2004-06-01 13:01-0400]

> The spamassassin that runs at SMTP time is a generic one that
> doesn't learn over time because it doesn't have a bayes DB that
> it can write to (because it runs as user nobody), so it is much
> less effective than it could be.
>
> I had planned to figure out how to set up a bogus user with a
> bayes DB that I could train over time, but it seems tricky to do
> that with exiscan-acl so maybe I should just configure SA on
> mr-burns to use my personal bayes DB.

I did this, and started training spamassassin on any spam it
misses (maybe 5-10/day), set up a few honeypots (notes below),
and wow, what a huge improvement.

My spam intake has dropped to 1998 levels. It's actually eerily
quiet. I'm worried that I must be rejecting too much stuff, but
can't find any evidence of legit mail being blocked.

To set up honeypot addresses, I checked for the most common
unrouteable addresses in exim's rejectlog (somehow a bunch of
bogus addrs got onto spammer's lists, usually truncated versions
of real addresses, e.g. [email protected]) and turned those
into aliases for a new user I created:

   # spam honeypots (most common unrouteable addrs in rejectlog)
   rald:   spam-honeypot
   ald:    spam-honeypot
   ...

(I could have also just created a bunch of fake addrs and put
those on my web site to be crawled by email harvesting bots, but
might as well use addresses that were already known to spammers.)

The 'spam-honeypot' user has the same uid as gerald so it can
write to my bayes DB, and it feeds all its non-daemon mail into
sa-learn using a procmailrc like this:

   # procmailrc for spam-honeypot user: feed all mail into sa-learn --spam

   PATH=$HOME/bin:/usr/bin:/bin:/usr/local/bin

   :0:
   * ^FROM_DAEMON
   from-daemon

   :0c
   | sa-learn --spam

   :0:
   sa-learned-spam

and ~spam-honeypot/.spamassassin is symlinked to ~gerald/.spamassassin

(spam-honeypot etc above are actually called something else; I
didn't put the real names here because I don't want spammers
finding out the names of my honeypots and poisoning them with
legitimate mail.)

--
Gerald Oskoboiny <[email protected]>
http://impressive.net/people/gerald/

Re: notes on SMTP-time spamassassin rejections

Replies:

  • None.

Parents:

* Gerald Oskoboiny <[email protected]> [2004-06-01 13:01-0400]

> For the last few months any email to impressive.net with a SA
> score > 10 has been rejected, and any email I receive with score
> 5-10 goes into a mailbox called 'probable-spam' which in theory I
> could review periodically for false positives, but in practice
> just gets ignored. (it has 38643 messages since Jan 14, 292/day)
>
> I hate silently ignoring email, so I wonder if I should decrease
> the rejection threshold to 5 or something.

I have been really happy with my spam blocking setup, and still have
not received a single complaint about valid mail being blocked.

Since June 3, 4899 messages were filtered to my probable-spam
mailbox but not rejected (messages that scored 5-10); I scanned
about 2k of those manually and found two false positives, an
inquiry about using a photo, and someone asking about a hotel
in Italy; both were tagged BAYES_99 which contributed 5.4 to the
score (but both scored < 6 total.)

So I just lowered my threshold for rejection to 6, and I am doing
away with this probable-spam mailbox, so I don't have to worry
about mail being silently ignored any more.

Oh... 6090 messages have entered my spam honeypot in that time,
fed directly into sa-learn --spam. (messages that scored > 10 were
rejected as usual, and not fed into sa-learn... I wonder if I
should try to exclude my honeypots from smtp-time spam blocking?)

Distribution of SA scores in probable-spam since June 3:

gerald@ogobogo:/home/gerald; cat mail/probable-spam | formail -s formail -c -XX-Spam-Status | cut -d= -f2 | cut -d. -f1 | sort -n | uniq -c
   269 5
   377 6
   695 7
   494 8
   379 9
   523 10
   503 11
   358 12
   415 13
   292 14
   223 15
   177 16
    94 17
    50 18
    27 19
    18 20
     3 21
     1 23
     1 24

(hmm, why so many > 10?? Those should have been rejected.
Maybe some network tests are not being done at smtp time?)

--
Gerald Oskoboiny <[email protected]>
http://impressive.net/people/gerald/

HURL: fogo mailing list archives, maintained by Gerald Oskoboiny