I am collecting numbers for monitoring HPC servers and am debating the policy for handing out memory (overcommit or not). I wanted to show users a number on how much virtual memory their processes (the whole machine) requested vs. how much was actually used.
I thought I'd get the interesting values from /proc/meminfo using the fields MemTotal, MemAvailable, and Committed_AS. The latter is supposed to show how much memory has been committed to by the kernel, a worst-case number of how much memory would really be needed to fulfill the running tasks.
But Committed_AS is obviously too small. It is smaller than the currently used memory! Observe two example systems. One admin server:
# cat /proc/meminfo
MemTotal: 16322624 kB
MemFree: 536520 kB
MemAvailable: 13853216 kB
Buffers: 156 kB
Cached: 9824132 kB
SwapCached: 0 kB
Active: 4854772 kB
Inactive: 5386896 kB
Active(anon): 33468 kB
Inactive(anon): 412616 kB
Active(file): 4821304 kB
Inactive(file): 4974280 kB
Unevictable: 10948 kB
Mlocked: 10948 kB
SwapTotal: 16777212 kB
SwapFree: 16777212 kB
Dirty: 884 kB
Writeback: 0 kB
AnonPages: 428460 kB
Mapped: 53236 kB
Shmem: 26336 kB
Slab: 4144888 kB
SReclaimable: 3863416 kB
SUnreclaim: 281472 kB
KernelStack: 12208 kB
PageTables: 38068 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
WritebackTmp: 0 kB
CommitLimit: 24938524 kB
Committed_AS: 1488188 kB
VmallocTotal: 34359738367 kB
VmallocUsed: 317176 kB
VmallocChunk: 34358947836 kB
HardwareCorrupted: 0 kB
AnonHugePages: 90112 kB
CmaTotal: 0 kB
CmaFree: 0 kB
HugePages_Total: 0
HugePages_Free: 0
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 2048 kB
DirectMap4k: 144924 kB
DirectMap2M: 4988928 kB
DirectMap1G: 13631488 kB
This is roughly 1.5G committed vs. 2.5G being in use without caches. A compute node:
ssh node390 cat /proc/meminfo
MemTotal: 264044768 kB
MemFree: 208603740 kB
MemAvailable: 215043512 kB
Buffers: 15500 kB
Cached: 756664 kB
SwapCached: 0 kB
Active: 44890644 kB
Inactive: 734820 kB
Active(anon): 44853608 kB
Inactive(anon): 645100 kB
Active(file): 37036 kB
Inactive(file): 89720 kB
Unevictable: 0 kB
Mlocked: 0 kB
SwapTotal: 134216700 kB
SwapFree: 134216700 kB
Dirty: 0 kB
Writeback: 140 kB
AnonPages: 44918876 kB
Mapped: 52664 kB
Shmem: 645408 kB
Slab: 7837028 kB
SReclaimable: 7147872 kB
SUnreclaim: 689156 kB
KernelStack: 8192 kB
PageTables: 91528 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
WritebackTmp: 0 kB
CommitLimit: 345452512 kB
Committed_AS: 46393904 kB
VmallocTotal: 34359738367 kB
VmallocUsed: 797140 kB
VmallocChunk: 34224733184 kB
HardwareCorrupted: 0 kB
AnonHugePages: 41498624 kB
CmaTotal: 0 kB
CmaFree: 0 kB
HugePages_Total: 0
HugePages_Free: 0
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 2048 kB
DirectMap4k: 312640 kB
DirectMap2M: 7966720 kB
DirectMap1G: 262144000 kB
This is around 47G used vs. 44G committed. The system at question is a CentOS 7 cluster:
uname-a
Linux adm1 3.10.0-862.14.4.el7.x86_64 #1 SMP Wed Sep 26 15:12:11 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
On my Linux desktop using a vanilla kernel, I see more 'reasonable' numbers with 32G being committed compared to 15.5G being in use. On a Debian server I see 0.4G in use vs. 1.5G committed.
Can someone explain this to me? How do I get a correct number for the committed memory? Is this a bug in the CentOS/RHEL kernel that should be reported?
Update with more data and a comparison between systems
A listing of used/committed memory for various systems I could access, with a note about the kind of load:
- SLES 11.4 (kernel 3.0.101-108.71-default)
- 17.6G/17.4G, interactive multiuser HPC (e.g. MATLAB, GIS)
- CentOS 7.4/7.5 (kernel 3.10.0-862.11.6.el7 or 3.10.0-862.14.4.el7)
- 1.7G/1.3G, admin server, cluster mgmt, DHCP, TFTP, rsyslog, …
- 8.6G/1.7G, SLURM batch system, 7.2G RSS for slurmdbd alone
- 5.1G/0.6G, NFS server (400 clients)
- 26.8G/32.6G, 16-core HPC node loaded with 328 (need to talk to the user) GNU R processes
- 6.5G/8.1G, 16-core HPC node with 16 MPI processes
- Ubuntu 16.04 (kernel 4.15.0-33-generic)
- 1.3G/2.2G, 6-core HPC node, 6-threaded scientific application (1.1G RSS)
- 19.9G/20.3G, 6-core HPC node, 6-threaded scientific application (19G RSS)
- 1.0G/4.4G, 6-core login node with BeeGFS metadata/mgmt server
- Ubuntu 14.04 (kernel 3.13.0-161-generic)
- 0.7G/0.3G, HTTP server VM
- Custom build (vanilla kernel 4.4.163)
- 0.7G/0.04G, mostly idle Subversion server
- Custom build (vanilla kernel 4.14.30)
- 14.2G/31.4G, long-running desktop
- Alpine (kernel 4.4.68-0-grsec)
- 36.8M/16.4M, some (web) server
- Ubuntu 12.04 (kernel 3.2.0-89-generic)
- 1.0G/7.1G, some server
- Ubuntu 16.04 (kernel 4.4.0-112-generic)
- 0.9G/1.9G, some server
- Debian 4.0 (kernel 2.6.18-6-686, 32 bit x86, obviously)
- 1.0G/0.8G, some reliable server
- Debian 9.5 (kernel 4.9.0-6)
- 0.4G/1.5G, various web services, light load, obviously
- Debian 9.6 (kernel 4.9.0-8-amd64)
- 10.9G/17.7G, a desktop
- Ubuntu 13.10 (kernel 3.11.0-26-generic)
- 3.2G/5.4G, an old desktop
- Ubuntu 18.04 (kernel 4.15.0-38-generic)
- 6.4G/18.3G, a desktop
SUnreclaim for SLES and CentOS rather large … 0.5G to 1G not uncommon, more if not flushing caches from time to time. But not enough to explain the missing memory in Committed_AS. The Ubuntu machines typically have below 100M SUnreclaim. Except the 14.04 one, that one has small Committed_AS and 0.4G SUnreclaim. Bringing kernels in order is tricky, as the 3.10 kernel from CentOS has many features of 4.x kernels backported. But there seems to be a line between 4.4 and 4.9 that affected the strangely low values of Committed_AS. The added servers from some of my peers suggest that Committed_AS also delivers strange numbers for older kernels. Was this broken and fixed multiple times?
Can people confirm this? Is this just buggy/very inaccurate kernel behaviour in determining the values in /proc/meminfo, or is there a bug(fix) history?
Some of the entries in the list are really strange. Having one slurmdbd process with a RSS of four times Committed_AS cannot be right. I am tempted to test a vanilla kernel on these systems with the same workload, but I cannot take the most interesting machines out of production for such games.
I guess the answer to my question is a pointer to the fix in the kernel commit history that enabled good estimates in Committed_AS again. Otherwise, please enlighten me;-)
Update about a two processes having more RSS than Committed_AS
The batch server that runs an instance of the Slurm database daemon slurmdbd, along with slurmctld is an illuminating example. It is very long-running and shows a stable picture, with those two processes dominating resource use.
# free -k; for p in $(pgrep slurmctld) $(pgrep slurmdbd) ; do cat /proc/$p/smaps|grep Rss| awk '{ print $2}'; done | (sum=0; while read n; do sum=$((sum+n)); done; echo $sum ); cat /proc/meminfo
total used free shared buff/cache available
Mem: 16321148 5873792 380624 304180 10066732 9958140
Swap: 16777212 1024 16776188
4703676
MemTotal: 16321148 kB
MemFree: 379708 kB
MemAvailable: 9957224 kB
Buffers: 0 kB
Cached: 8865800 kB
SwapCached: 184 kB
Active: 7725080 kB
Inactive: 6475796 kB
Active(anon): 4634460 kB
Inactive(anon): 1007132 kB
Active(file): 3090620 kB
Inactive(file): 5468664 kB
Unevictable: 10952 kB
Mlocked: 10952 kB
SwapTotal: 16777212 kB
SwapFree: 16776188 kB
Dirty: 4 kB
Writeback: 0 kB
AnonPages: 5345868 kB
Mapped: 79092 kB
Shmem: 304180 kB
Slab: 1287396 kB
SReclaimable: 1200932 kB
SUnreclaim: 86464 kB
KernelStack: 5252 kB
PageTables: 19852 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
WritebackTmp: 0 kB
CommitLimit: 24937784 kB
Committed_AS: 1964548 kB
VmallocTotal: 34359738367 kB
VmallocUsed: 0 kB
VmallocChunk: 0 kB
HugePages_Total: 0
HugePages_Free: 0
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 2048 kB
DirectMap4k: 1814044 kB
DirectMap2M: 14854144 kB
DirectMap1G: 2097152 kB
Here you see the Rss of the two processes amounting to 4.5G (just slurmdbd is 3.2G). The Rss kindof matches the active anon pages, but Committed_AS is less than 2G. Counting the Rss of all processes via /proc comes quite close to AnonPages+shmem (note: Pss is only about 150M smaller). I don't get how Committed_AS can be smaller than Rss (summed Pss) of active processes. Or, just in the context of meminfo:
How can Committed_AS (1964548 kB) be smaller than AnonPages (5345868 kB)? This is a faily stable workload. These are extremely long-lived two processes that are about the only thing that happens on this machine, with rather constant churn (batch jobs on other nodes being managed).