Server computational slowdown when RAM is used extensively

Question

I have problem with server slowdowns in very specific scenario. The facts are:

1) I use computational application WRF (Weather Research and Forecast)
2) I use Dual Xeon E5-2620 v3 with 128GB RAM (NUMA architecture - probably related to problem!)
3) I run WRF with mpirun -n 22 wrf.exe (I have 24 logical cores available)
4) I use Centos 7 with 3.10.0-514.26.2.el7.x86_64 kernel
5) Everthing works OK in terms of computational performance until one of things happen:
5a) linux file cache gets some data, or
5b) I use tmpfs and fill it with some data

In 5a or 5b scenario, my WRF start to slow down suddenly and get sometimes even ~5x slower than normal.

6) RAM does not get swapped, it is not even close to happening, I have around 80% of RAM free in worst case scenario!
7) vm.zone_reclaim_mode = 1 in /etc/sysctl.conf seems to help a bit to delay issue in 5a scenario
8) echo 1 > /proc/sys/vm/drop_caches resolve problem completely in 5a scenario, restores WRF performance to maximum speed, but only temporary until file cache get data again, so I use this command in cron (don't worry, it IS ok, I use computer only for WRF and it does not need file cache to work at full performance)
9) but, above command still does nothing in 5b scenario (when I use tmpfs for temporary files)
10) perfomanace is restored in 5b scenario only if I manually empty tmpfs
11) It is not WRF or mpi problem
12) This happens only on this one computer type and I administer a lot of them for same/similar purporse (WRF). Only this one has full NUMA architecture so I suspect this has something with it
13) I also suspect that RHEL kernel has something with this but not sure, didn't tried to reinstall into different distro yet
14) numad and numactl option to invoke mpirun like "numactl -l", did not make any difference

Let me know if you have any idea to try to aviod those slowdowns.

One idea come to me after following some "Related" links on this question. Can Transparent Huge Pages be a source of this problem? Some articles highly suggest that THP does not play well on NUMA systems.

Jaroslav Kucera · Answer 1 · 2017-11-02T20:22:47.713

I'd suggest to enable numad service:

yum install numad
systemctl enable numad
systemctl start numad

The numad should be able to take care of memory locality automaticaly. The situation like process runs on CPU of first NUMA node, however it's data are in the RAM local to second NUMA node, should no longer happen (unless the amount of needed memory is bigger than capacity of the RAM local to single NUMA node).

I'd also suggest to configure tuned service with profile, which matches best to your usage scenario. You'd have to measure differencies and pick the best (or you can create some customized).

Maybe I've found the reason of the strange behaviour on your node. I've searched for mpirun and found the man page:

https://www.open-mpi.org/doc/current/man1/mpirun.1.php

There is written:

Quick Summary

If you are simply looking for how to run an MPI application, you probably want to use a command line of the following form: % mpirun [ -np X ] [ --hostfile ] This will run X copies of in your current run-time environment (if running under a supported resource manager, Open MPI’s mpirun will usually automatically use the corresponding resource manager process starter, as opposed to, for example, rsh or ssh, which require the use of a hostfile, or will default to running all X copies on the localhost), scheduling (by default) in a round-robin fashion by CPU slot. See the rest of this page for more details.

Please note that mpirun automatically binds processes as of the start of the v1.8 series. Three binding patterns are used in the absence of any further directives:

Bind to core: when the number of processes is <= 2

Bind to socket: when the number of processes is > 2

Bind to none: when oversubscribed

If your application uses threads, then you probably want to ensure that you are either not bound at all (by specifying --bind-to none), or bound to multiple cores using an appropriate binding level or specific number of processing elements per application process.

In your case with n=22 there is no binding applied and threads can be relocated. You may try external CPU binding (like with taskset). You'll have to experiment.

Hello, thank you for suggestion. I forgot to mention in my original question, that was actually my first thing to try. I also used numactl to start process. Both methods did not bring any difference in those slowdowns situations unfortunatelly. — Ivan Toman, Nov 02 '17 at 15:35
@IvanToman I've updated the post with mpirun binding found in the man page. Maybe there is solution of your problems. — Jaroslav Kucera, Nov 02 '17 at 20:29
Thank you Jaroslav, I will investigate that. I need to do more research on binding topic with mpirun. In a meantime I disabled THP and now trying to see if server will slowdown if I do not empty cache with such configuration. It might worth the shot I guess. — Ivan Toman, Nov 02 '17 at 20:37

Server computational slowdown when RAM is used extensively

1 Answers1