Is there a way to get spamassassin to score the top lines of a message body more heavily?

Question

A lot of spam is getting through the filter on the mail server I run with the relatively simple trick of starting with few lines of (incredibly obvious) weight loss or other scam text at the top, followed by a larger body of text from programming documentation — or, most evil of all, text scraped from Stack Exchange. At best, Spamassassin regards this as BAYES_50, and it happens that the rest of the messages are constructed carefully enough that they don't hit other triggers. (For example, the headers are minimal and correct.) Often, the included excerpts align closely enough with my legitimate interests that the message overall is scored as BAYES_00, because the very spammy tokens are just overwhelmed by juicy nuggets of sysadmin problem-solving.

The top part is so obviously spammy (and in fact tends to be very similar to previously-received and trained as spam messages) that I'm kind of amazed that it's getting through — but clearly it is. It seems like a separate pass which scored the top 25 (or so) lines of the message and weighed that heavily would solve the problem. Is there a way to do this?

Several people have suggested writing custom regular expressions. I do not want to get into this, as this is a constant losing battle. It's what people did before Bayesian spam sorting came into widespread use, and it was generally terrible. No human can keep up. It's not much more effective than just hitting the delete key for each spam message, and a lot more work on my part.

Bayesian spam filtering works. It even works on this spam, if I split out the "above the fold" portion and just analyze that part, with the decoy / chaff removed. The question is: how can I get Spamassassin to do that?

@kondybas Yes. And this is part of the problem, as the padding text outweighs the spammy part by sheer quantity. — mattdm, Sep 16 '14 at 11:09
How much Bayesian training have you done on these spams? I'd expect the Bayesian algorithm to work it out before long. — mc0e, Sep 19 '14 at 09:59
@mc0e It can't. It's just not that magically smart. A more sophisticated machine-learning system could probably do it, but I think the, um, "one simple trick" that I'm asking for here would as well. — mattdm, Sep 19 '14 at 14:36
A typical bayesian spam algorithm takes no notice of where each token appears in the message, but if this class of spam is as obvious as you say, then the weighting of the spammy tokens involved should become strong enough, and the non-spam ones become weakened, after sufficient training. This is not all good - an attacker can teach your bayesian learner to falsely regard non-spam tokens as spammy, and potentially block good emails in a targetted way. http://bnrg.cs.berkeley.edu/~adj/publications/paper-files/SecML-chapter.pdf — mc0e, Sep 20 '14 at 05:26
Right -- that "this is not all good" is why I was hoping for a simple, _slightly_ smarter approach. — mattdm, Sep 20 '14 at 15:21
Maybe you could write your own plugin. Documenation: http://spamassassin.apache.org/full/3.4.x/doc/Mail_SpamAssassin_Plugin.html — cybernard, Jul 24 '15 at 17:20

score 1 · Answer 1 · answered Sep 10 '15 at 17:12

I am a (little) vivid anti-spam fighter myself. And because of many problems as you encounter, I ended up doing the dirty things myself, years ago.

Now, this is not an answer to your particular question, but to your particular problem. So please don't downvote because of this.

How I solved this problem was to modify the sa_filter-post.pl script, used by XMail server, which calls spamc on the email file and does some minor stuff there, to process not the entire file, but specific parts of it, based on some specific rules (hardcoded by me). yes, regex'es but so far they work for me (I do have a bunch of other scripts before and after this one so that may play a role)

For example, I have a regex that fishes out phonenumbers. The spammer left that in full, so that goes straight out to process only the middle 400 chars of the file (I got to 400 by trial and error really, started from 200). Note that it's pretty hard to pick out the middle of what you see, compared to what is in the file.

There is another one that has the same structure of the html table with the "products", a dummy header and not usable footer, so I strip those out, I strip the "products" comments column out and then pass that on to spamc.

And so on, you get the picture.

But not all rules are perfect, so I do a little magic here by assigning a private score to each rule, which I hardcode and tune up or down when needed, based on how the rule behaves (and sometime I end up deleting rules all togethe). I then modify the SA score by the private score. The reason I did this was because for some reason SA only gave scores like 4. something to stuff clearly spam on rules that I also had strong feelings to catch them right. So I gave them just a little boost to go over 5.0, coupled with some post-processing scripts that take some other variables into consideration (source of email, target of email, structure of header, etc), it more or less kills the spam out.

Now I realize this isn't what you were hoping for, but in my case it gives me a whole lot of power over what gets scanned, it's just that I need to set things up manually and then every now and then do little touch-ups on the values/regex'es.

But in your case things are a lot easier as all you have to do is use a simple bash script that will be called by your MX instead of spamc and have that script use head command to only get the first whatever number of bytes you want and pass that temporary file to spamc.

The contents of the script will depend a bit on your mail server, but that shouldn't be hard to figure out.

(Note that I only talked that much of my setup so that you can see the possibilities of this option)

PS: I personally never got this kind of spam emails (with programming related goodies in them), so I wonder if you haven't pissed someone and now you're targeted. That would explain the specially crafted emails. The reason I think about this possibility is that years ago, when I was very active on various IT forums and groups, I did piss some people off and every now and then I used to get various types of attacks on my server, including email spamming. But back then the idiots weren't this smart :)

Is there a way to get spamassassin to score the top lines of a message body more heavily?

1 Answers1