wiredfool

Spam stats

About a month ago, I added centralized spam scoring to my mailserver using the latest (2.55?) spamassassin. Like most people, I’m worried about missing real mail if I set the drop threshold too low. Currently I’m killing 5-10% of the most blatant spam (and 100% of the email worms, the original reason for adding filtering serverside). Now that I have a month of logs, I can see what the incoming email looks like.

This graph is (sort of) a histogram of the spamassassin scores for all incoming mail to my server that didn’t get killed in the initial virus/worm scan. The red line is all mail, the green line is traffic to addresses that get nothing but spam. It covers about 11 thousand messages total, 2 thousand are to the spam-only addresses. Unfortunatly for the analysis and my sanity, I don’t have an easy way to find the spam scores of known good email. Maybe I can correlate message ids in my mailbox with records in the logs.

The known spam certainly has what looks like a similar distribution to the tail of the main curve, leading me to suspect that these addresses attract a reasonably representative sample of the more blatant spams. About 10% of these emails fall under the traditional ‘5’ threshold for spamassassin.

There are a few peaks in the probably good region, I’m betting that at least one of these corresponds with a high volume mailing list that I’m on.

So what does it all mean? Spamassassin isn’t going to cut it without the bayesian type filters that are all the rage now. But for the bayesian thing to work, you need training, feedback and individual preferences, and that’s just not going to work at this layer of the stack. It’s time for lateral thinking.

* It’s a histogram graphed as lines because boxes just looked like too much chartjunk. It really shouldn’t be using solid lines, because it’s not a continuous function. But the area mass actually is pretty close to what I’m looking for. With some normalization, it could look like a probablity distribution function. (or a cdf, which is as useful for finding cutoffs)

No comments

No comments yet. Be the first.

Leave a reply

You must be logged in to post a comment.