15

I'm using SpamAssassin on Debian (the default configuration with Pyzor, AWL and Bayes disabled, and sa-compile enabled), and each of the spamd child processes consume around 100 to 150MB of memory (around 50MB of real memory) on the 32 bit servers, and about double this (logically enough) on the 64 bit servers. There are generally two child processes, but at busy times there can be five (the maximum) running.

ISTM that 200 to 600MB is a lot of memory for this task. I'd like to continue using SA as part of my filtering structure, but it's becoming difficult to justify so much memory.

Are there any ways to reduce the amount of memory that each child process uses? (Or alternatively, make a single child process so fast that I can set the maximum children to something like 2?). I'm willing to consider any options, including ones that will or may result in reduced accuracy.

I've already read the "Out of Memory Problems" page on the SA wiki; nothing there is of any use. Messages larger than 5 MB are not scanned with SA.

Tony Meyer
  • 889
  • 1
  • 13
  • 25
  • 1
    Note that forked children may use much less physical RAM than the sum of the numbers ps or top show. This is due to the copy-on-write strategy when forking. – David Schmitt May 04 '09 at 10:40

5 Answers5

6

I think you're misunderstanding the way Linux reports memory usage. When a process forks, it results in a second process that shares a lot of resources with the original process. Included in that is memory. However, Linux uses a technique known as Copy On Write (COW) for this. What that means is that each forked child process will see the same data in memory as the original process, but whenever that data changes (by the child or parent), the changes are copied and only then point to a new location.

Until one of the processes makes changes to that data, they are sharing the same copy. As a result, I could have a process that uses 100MB of RAM, and fork it 10 times. Each of those forked processes would show 100MB of RAM being used, but if you looked at the overall memory usage on the box, it might only show that 130MB of RAM is being used (100MB shared between the processes, plus a few MB of overhead, plus another dozen MB or two for the rest of the system).

As a final example, I have a box right now with 30 apache processes running. Each process is showing a usage of 22MB of RAM. However, when I run free -m to show my overall RAM usage, I get:

topher@crucible:/tmp$ free -m
             total       used       free     shared    buffers     cached
Mem:           349        310         39          0         24         73
-/+ buffers/cache:        212        136
Swap:          511         51        460

As you can see, this box doesn't even have enough RAM to run 30 processes that were each using 18MB of "real" RAM. Unless you're literally running out of RAM or your apps are swapping heavily, I wouldn't worry about things.

UPDATE: Also, check out this tool called smem, mentioned by jldugger in the answer to another question on Linux memory usage here.

Christopher Cashell
  • 8,999
  • 2
  • 31
  • 43
  • 1
    I am literally running out of RAM, so I do need to worry about it. However, it could be that it's other processes that are consuming the RAM, and SA isn't using so much. – Tony Meyer May 04 '09 at 21:56
  • From my observation and using the tool *smem*, it looks like spamassassin uses around 50 MB of RAM, and that if you fork it into multiple processes, almost all their memory is shared memory, so it'll still use around 50 MB of RAM total amongst all processes, even though *ps* reports each one having a RSS of 50 MB. YMMV. – thomasrutter Dec 07 '12 at 00:44
1

Using sa-compile you might be able to improve the matching speed of many rules.

David Schmitt
  • 2,165
  • 2
  • 15
  • 25
1

Here's what I have done.

I have a set-up where a lot of messages tend to be delivered roughly at the same time; for a series of experiments I run SA on messages which are copied to a temporary spool and then delivered by a cron job every five minutes.

spamd would keep on printing "maybe you should increase the max-children parameter" and I had it raised up to 40 at one point, but I had the server consuming all its swap space and crashing.

Now I have implemented a different regime where delivery is governed by a Procmail lock file. Because it was simple to do, I just use the last digit of the process ID, and run with 10 children. I'm not at all sure this is optimal, but it has already helped avoid the insane load peaks I wouled experience from time to time.

LINEBUF=10240

# Grab last digit of PID for lockfile
PID=$$
:0
* PID ?? ()\/[0-9]$
{ D=$MATCH }
:0
* > 512000
{ SA="(too large)" }
:0Ew:/tmp/20spamc.$D
SA=| spamc -p 38783 -l -y

In addition, I start up spamd with a number of ulimit restrictions. The numbers were taken out of http://svn.apache.org/repos/asf/spamassassin/trunk/contrib/run-masses except I removed the ulimit -u restriction. (Not sure what's going on. 32 is way too small in any event. With something like 500 I could keep spamd running for a while, but eventually running into the limit.)

ulimit -v 204800
ulimit -m 204800
ulimit -n 256
#ulimit -u 32

perl -T -I lib -w spamd --min-children 2 --max-children 10 --max-spare 5 etc etc

I guess I will end up with delivery failures if the load is too high for an extended time, but so far, it seems I have managed to reduce the load to manageable levels with this; and a bunch of failed deliveries is still much better than the machine running out of swap.

tripleee
  • 1,324
  • 3
  • 14
  • 24
0

High load averages are (sometimes) an indirect symptom that your machine is running out of RAM (and using lots of CPU swapping processes back and forth from virtual memory), so you could try configuring your mail server to not pass mail through SpamAssassin if the load averages are too high.

You don't mention which MTA you're running, but if you're calling SA from an access control list in exim4, then the suggestion at the bottom of this message is effective.

Also, you can relieve the load on SA, and thus reduce its memory usage, by putting some other, less resource-intensive spam-filtering methods in front of it (i.e. so they process and reject some spam before it gets to SA) - for instance, greylisting and sender verify callouts use relatively little RAM.

David North
  • 762
  • 1
  • 5
  • 12
  • On a related note, I am seriously considering ditching SA in favour of dspam on a couple of servers I run, as dspam is allegedly less RAM-hungry. – David North Jul 29 '09 at 20:22
  • As a middle ground, you could run a Bayesian filter as a first step, and fall back to SpamAssassin only for the messages for which the first filter did not come up with a clear verdict. Spammers tend to repeat themselves a lot so you could probably handle the vast majority of cases without SpamAssassin, but still have it available for new outbreaks etc. – tripleee Nov 19 '12 at 08:51
0

We were in a similar situation several months ago. SpamAssassin and ClamAV were using lots of memory on a hosted server. We had the option of adding more memory to the server, but it turned out to be more cost- and time-effective to switch over to Postini. YMMV.

Gerald Combs
  • 6,331
  • 23
  • 35