5

In one of our slave DNS servers BIND, version bind910-9.10.0P2_3, constantly get killed with the following message in /var/log/messages:

Jul 30 01:00:10 cinnabar kernel: pid 602 (named), uid 53, was killed: out of swap space

This service runs on a FreeBSD 10.0 VM in XenServer 6.2, it has 512MB of system memory.

At this moment pstat -m -s return this:

Device          1M-blocks     Used    Avail Capacity
/dev/ada0p3           512        9      502     2%

I don't think it's a swap problem, it appears to be memory leakage, but I'm unsure.

EDIT: Access information.

This is one of two slave DNS servers, they only store the zones from the authoritative server and act as a recursive server for the internal users to the outside world. The number of clients is something between 700-1500 simultaneous users. Since we have a /21 internal space and a /23 public IPv4 space and there's no queries from the outside world, the port 53 is even blocked on the firewall to those machines.

Vinícius Ferrão
  • 5,400
  • 10
  • 52
  • 91
  • Is this a public DNS server? Does it answer recursive queries (i.e. is it answering all requests for any name, even those it is not authoritative for)? If it is an "open" DNS server for which both of those are true, you're probably seeing abusive behavior from bots, which can cause BIND to grow to somewhat large sizes ( "large" for a system as small as yours, anyway). We also need more information from both the log and about your configuration, in order to provide anything other than guesses. – swelljoe Jul 30 '14 at 23:22
  • Added infos on the question. – Vinícius Ferrão Jul 31 '14 at 03:54
  • 2
    You may want to look at the [`max-cache-size`](http://ftp.isc.org/isc/bind9/cur/9.10/doc/arm/Bv9ARM.ch06.html#server_resource_limits) setting. But to know if there's actually any real chance of that helping you would have to confirm whether it's actually `named` that is exhausting system memory or if it's an innocent victim when the system is memory starved for some other reason. – Håkan Lindqvist Jul 31 '14 at 06:05
  • Hakan, I'm using the default settings for max-cache-size. So I don't know what's the value, theres a way to get this? And I really think theres some leakage with named, this VM only runs BIND9 and the default FreeBSD utils. – Vinícius Ferrão Jul 31 '14 at 14:44
  • The linked documentation does explain the default behaviour (expiration only based on ttl). – Håkan Lindqvist Jul 31 '14 at 16:41
  • The likelihood of it being leakage in named is pretty low; I would consider that far down the list of things you should be looking for first. There are many reasons for named to become quite large...and it could be other processes. – swelljoe Jul 31 '14 at 17:16
  • @swelljoe any ideas? As I said, only named is running on this machine. – Vinícius Ferrão Jul 31 '14 at 19:33
  • Did you implement the suggestion of `max-cache-size`? If so, did it help or is the problem something else (it is guesswork, after all)? In general, I can only suggest to observe actual behavior (monitoring memory usage seems like a reasonable starting point, then continuing from there) rather than drawing conclusions from what you expect. – Håkan Lindqvist Aug 01 '14 at 08:39

1 Answers1

1

If you have any kind of monitoring on this server, it would be nice to check if there are any peaks on memory usage right around the time processes get killed. Then you could try to find a correlation with number of requests, etc.

That being said, it could either mean there is indeed no memory left on the system but most likely Bind is requesting a contiguous area of memory, fragmentation is getting in the way and FreeBSD is trying to swap out some processes to make room for that. It probably can't swap out many pages, fails to allocate and triggers the out of memory killer.

If you have disk space, the easiest solution is to add more swap through a swap file (not need for a partition). Ideally, you should limit the cache size (Bind defaults to no unlimited), as suggested by Håkan, but that could have a performance impact. Without more statistics is really hard to tell. Even domestic routers have 512MB of RAM nowadays so you should consider increasing that (and limiting the cache) for a production server serving 700-1500 simultaneous users (which could translate in much more req/sec, again, without more information it's hard to tell).

You could also try tweaking the malloc implementation via the MALLOC_PRODUCTION knob, but I think that is too extreme in the face of easier solutions available.

Giovanni Tirloni
  • 5,693
  • 3
  • 24
  • 49