2

I have multiple machines sharing home directory via NFS share used by 6-10 users. All machines are used to run computational experiments including the one with NFS server. Although it is very rare but possible that some experiment may cause out of memory(OOM) problem. Though the user process may get killed at some point of time, I would like to know how it can affect NFS server thus in turn affecting other machines too. I tried searching for it but could not find a specific answer. Also are there any measures I can take to avoid OOM affecting NFS share?

NFS Server Configuration: Intel Core i7-9700, 32 GB RAM, SWAP 32 GB and Graphics TITAN RTX Other machines have similar configurations.

rmah
  • 29
  • 3
  • If there is a risk, split the service on another host ! In cluster environment, there is often a master node dedicated to NFS shares, SSH access gateway, scheduler, hosts provisionning... – Dom Jul 12 '20 at 07:00
  • @Dom Thanks for the advice. However all the machines (5 in total) have good specs and can be used to run computational experiments. Dedicating a machine just to host NFS server would be wasting the hardware. I am wondering if anything can be done to make sure NFS service always have some resources to keep serving home directories as NFS share. – rmah Jul 12 '20 at 07:12

2 Answers2

8

I would limit the process memory with ulimit or with cgroups. You need to limit RSS and shared memory. Another approach would be to run it in a container or VM.

Probably the easiest approach is to use a container: docker, podman, LXC...

Mircea Vutcovici
  • 16,706
  • 4
  • 52
  • 80
5

By default when Linux runs out of memory it uses a heuristic to decide which processes to kill to recover enough memory to continue. This often is not desired, though. In many cases (including probably this one) it would be better to kill the process which caused the out of memory condition.

You can set the vm.oom_kill_allocating_task sysctl to cause the OOM killer to kill the process which ran the system out of memory.

Michael Hampton
  • 237,123
  • 42
  • 477
  • 940
  • 2
    "Causing the condition" is more random than "biggest process", i.e. that is exactly the opposite of what OP needs, because if the experiment got the last free page, any allocation done by the NFS server will kill the NFS server then. – Simon Richter Jul 13 '20 at 09:18
  • @SimonRichter The NFS server, though, isn't really going to be asking for much memory during its operation. And since it's run as kernel threads, it gets priority over user processes anyway. – Michael Hampton Jul 13 '20 at 14:01