3

While attempting a backup of a pretty large folder (450G) to a 2TB drive that's in that server solely as a backup destination rdiff-backup (version 1.2.8 - last marked stable) caused a kernel panic.

System:

Linux giorgio 3.2.0-4-amd64 #1 SMP Debian 3.2.51-1 x86_64 GNU/Linux

Disks: 2 1TB disks in software mirror RAID mode, 1 2TB disk solely for backups.

I have a suspicion: memory on the server is 2G RAM + 2G swap = 4G. There are files up to 16G in size. Is it possible that rdiff-backup at some point loads the entire file into memory?

In any case, a kernel panic should not have happened (since the rdiff process was killed? so the memory should have been made available again?), so I guess my question has two parts, one: about my suspicion, two: about the kernel panic.

By the way, the panics started recently, quite a number of backups had already been successful - full and incremental - and those big GB files had already been there. So I guess it's the new Debian kernel's fault rather than rdiff-backup's?

Logfile section at the time the panic happens http://pastebin.com/e9a5fQdh

Last thing on the screen:

EDIT/Update: I just tried creating a 20GB swap file (with dd from /dev/zero) and the server went DOWN again, no reaction to ping.

From looking at the logs: It seems the kernel has killed some processes - including the one I suspect of having caused it all (rdiff-backup) - but says "running out of killable processes". It seems that killing the processes did not free the memory?

Mörre
  • 133
  • 6

1 Answers1

5

It didn't kill rdiff-backup, it should have but its oom_score_adj is -1000.

This is caused by a bug in sshd. The bug is fixed but wont be available until the next release which is openssh 6.5.

sshd fails to set the oom_score_adj of new shells it creates back to 0 if you reload it, causing all child processes you spawn via SSH (so your bash shell and any child processes that creates) to have -1000 oom_score_adj and subsequently can hog all the memory without oom-killer killing them.

The quickest way to fix this is to (assuming 7567 is the pid of sshd like in your case):-

  • Run echo 0 >/proc/7567/oom_adj_score
  • Restart sshd.

Do not reload sshd, restart it until the fix is in place. (openssh 6.5 shall have it)

The bug is reported and fixed here. https://bugzilla.mindrot.org/show_bug.cgi?id=2156

Matthew Ife
  • 22,927
  • 2
  • 54
  • 71
  • Wonderful news and so fast - will test and declare as answer once verified :-) – Mörre Dec 27 '13 at 19:10
  • Do you also happen to know if I really have to have as much total RAM as the size of the largest file handled by rdiff-backup? Because that seems to be the problem. It could be that this (only) happens when calculating incremental backups, I only looked at the "leftovers" briefly thus far. – Mörre Dec 27 '13 at 19:24
  • Its very active pagecache. You have 931268Kb of dirty pages and 852964kb of read data, this makes up the lions share of the memory. This region of memory has to be really *really* active for it to try to oom-kill on the default swappiness. In you're case its probably best to `echo 1 >/proc/sys/vm/oom_kill_allocating_task` or switch the overcommit mode to 2. – Matthew Ife Dec 27 '13 at 20:04
  • Can you tell me what your swappiness is set to and more importantly what the values of `cat /proc/sys/vm/dirty_expire_centisecs`, `cat /proc/sys/vm/dirty_ratio`, `cat /proc/sys/vm/dirty_background_ratio` is set to? I find this behaviour very odd. I find it very curious the number of dirty pages available on your system could reach half the total amount of RAM you have. – Matthew Ife Dec 27 '13 at 20:17
  • Debian defaults. This is a very low-load (samba file) server, the backup task is by far the most taxing one it has all day, especially the incremental ones. `swappiness=60`, `dirty_expire_centisecs=3000`, `dirty_ratio=40`, `dirty_background_ratio=40` Looks like I should resize some partitions to make room for 32G of swap - this is really silly. Maybe I can get `rdiff-backup` to skip incremental backups for huge files, I don't think the simple file copies try to allocate RAM for the entire file. – Mörre Dec 28 '13 at 12:42
  • `dirty_ratio=40` and `dirty_background_ratio=40` is too high on a low ram server, set it to 15. rdiff will go slow but it shouldn't oom the host. – Matthew Ife Dec 28 '13 at 12:47