2

I am collecting numbers for monitoring HPC servers and am debating the policy for handing out memory (overcommit or not). I wanted to show users a number on how much virtual memory their processes (the whole machine) requested vs. how much was actually used.

I thought I'd get the interesting values from /proc/meminfo using the fields MemTotal, MemAvailable, and Committed_AS. The latter is supposed to show how much memory has been committed to by the kernel, a worst-case number of how much memory would really be needed to fulfill the running tasks.

But Committed_AS is obviously too small. It is smaller than the currently used memory! Observe two example systems. One admin server:

# cat /proc/meminfo 
MemTotal:       16322624 kB
MemFree:          536520 kB
MemAvailable:   13853216 kB
Buffers:             156 kB
Cached:          9824132 kB
SwapCached:            0 kB
Active:          4854772 kB
Inactive:        5386896 kB
Active(anon):      33468 kB
Inactive(anon):   412616 kB
Active(file):    4821304 kB
Inactive(file):  4974280 kB
Unevictable:       10948 kB
Mlocked:           10948 kB
SwapTotal:      16777212 kB
SwapFree:       16777212 kB
Dirty:               884 kB
Writeback:             0 kB
AnonPages:        428460 kB
Mapped:            53236 kB
Shmem:             26336 kB
Slab:            4144888 kB
SReclaimable:    3863416 kB
SUnreclaim:       281472 kB
KernelStack:       12208 kB
PageTables:        38068 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:    24938524 kB
Committed_AS:    1488188 kB
VmallocTotal:   34359738367 kB
VmallocUsed:      317176 kB
VmallocChunk:   34358947836 kB
HardwareCorrupted:     0 kB
AnonHugePages:     90112 kB
CmaTotal:              0 kB
CmaFree:               0 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
DirectMap4k:      144924 kB
DirectMap2M:     4988928 kB
DirectMap1G:    13631488 kB

This is roughly 1.5G committed vs. 2.5G being in use without caches. A compute node:

ssh node390 cat /proc/meminfo
MemTotal:       264044768 kB
MemFree:        208603740 kB
MemAvailable:   215043512 kB
Buffers:           15500 kB
Cached:           756664 kB
SwapCached:            0 kB
Active:         44890644 kB
Inactive:         734820 kB
Active(anon):   44853608 kB
Inactive(anon):   645100 kB
Active(file):      37036 kB
Inactive(file):    89720 kB
Unevictable:           0 kB
Mlocked:               0 kB
SwapTotal:      134216700 kB
SwapFree:       134216700 kB
Dirty:                 0 kB
Writeback:           140 kB
AnonPages:      44918876 kB
Mapped:            52664 kB
Shmem:            645408 kB
Slab:            7837028 kB
SReclaimable:    7147872 kB
SUnreclaim:       689156 kB
KernelStack:        8192 kB
PageTables:        91528 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:    345452512 kB
Committed_AS:   46393904 kB
VmallocTotal:   34359738367 kB
VmallocUsed:      797140 kB
VmallocChunk:   34224733184 kB
HardwareCorrupted:     0 kB
AnonHugePages:  41498624 kB
CmaTotal:              0 kB
CmaFree:               0 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
DirectMap4k:      312640 kB
DirectMap2M:     7966720 kB
DirectMap1G:    262144000 kB

This is around 47G used vs. 44G committed. The system at question is a CentOS 7 cluster:

uname-a
Linux adm1 3.10.0-862.14.4.el7.x86_64 #1 SMP Wed Sep 26 15:12:11 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

On my Linux desktop using a vanilla kernel, I see more 'reasonable' numbers with 32G being committed compared to 15.5G being in use. On a Debian server I see 0.4G in use vs. 1.5G committed.

Can someone explain this to me? How do I get a correct number for the committed memory? Is this a bug in the CentOS/RHEL kernel that should be reported?

Update with more data and a comparison between systems

A listing of used/committed memory for various systems I could access, with a note about the kind of load:

  • SLES 11.4 (kernel 3.0.101-108.71-default)
    • 17.6G/17.4G, interactive multiuser HPC (e.g. MATLAB, GIS)
  • CentOS 7.4/7.5 (kernel 3.10.0-862.11.6.el7 or 3.10.0-862.14.4.el7)
    • 1.7G/1.3G, admin server, cluster mgmt, DHCP, TFTP, rsyslog, …
    • 8.6G/1.7G, SLURM batch system, 7.2G RSS for slurmdbd alone
    • 5.1G/0.6G, NFS server (400 clients)
    • 26.8G/32.6G, 16-core HPC node loaded with 328 (need to talk to the user) GNU R processes
    • 6.5G/8.1G, 16-core HPC node with 16 MPI processes
  • Ubuntu 16.04 (kernel 4.15.0-33-generic)
    • 1.3G/2.2G, 6-core HPC node, 6-threaded scientific application (1.1G RSS)
    • 19.9G/20.3G, 6-core HPC node, 6-threaded scientific application (19G RSS)
    • 1.0G/4.4G, 6-core login node with BeeGFS metadata/mgmt server
  • Ubuntu 14.04 (kernel 3.13.0-161-generic)
    • 0.7G/0.3G, HTTP server VM
  • Custom build (vanilla kernel 4.4.163)
    • 0.7G/0.04G, mostly idle Subversion server
  • Custom build (vanilla kernel 4.14.30)
    • 14.2G/31.4G, long-running desktop
  • Alpine (kernel 4.4.68-0-grsec)
    • 36.8M/16.4M, some (web) server
  • Ubuntu 12.04 (kernel 3.2.0-89-generic)
    • 1.0G/7.1G, some server
  • Ubuntu 16.04 (kernel 4.4.0-112-generic)
    • 0.9G/1.9G, some server
  • Debian 4.0 (kernel 2.6.18-6-686, 32 bit x86, obviously)
    • 1.0G/0.8G, some reliable server
  • Debian 9.5 (kernel 4.9.0-6)
    • 0.4G/1.5G, various web services, light load, obviously
  • Debian 9.6 (kernel 4.9.0-8-amd64)
    • 10.9G/17.7G, a desktop
  • Ubuntu 13.10 (kernel 3.11.0-26-generic)
    • 3.2G/5.4G, an old desktop
  • Ubuntu 18.04 (kernel 4.15.0-38-generic)
    • 6.4G/18.3G, a desktop

SUnreclaim for SLES and CentOS rather large … 0.5G to 1G not uncommon, more if not flushing caches from time to time. But not enough to explain the missing memory in Committed_AS. The Ubuntu machines typically have below 100M SUnreclaim. Except the 14.04 one, that one has small Committed_AS and 0.4G SUnreclaim. Bringing kernels in order is tricky, as the 3.10 kernel from CentOS has many features of 4.x kernels backported. But there seems to be a line between 4.4 and 4.9 that affected the strangely low values of Committed_AS. The added servers from some of my peers suggest that Committed_AS also delivers strange numbers for older kernels. Was this broken and fixed multiple times?

Can people confirm this? Is this just buggy/very inaccurate kernel behaviour in determining the values in /proc/meminfo, or is there a bug(fix) history?

Some of the entries in the list are really strange. Having one slurmdbd process with a RSS of four times Committed_AS cannot be right. I am tempted to test a vanilla kernel on these systems with the same workload, but I cannot take the most interesting machines out of production for such games.

I guess the answer to my question is a pointer to the fix in the kernel commit history that enabled good estimates in Committed_AS again. Otherwise, please enlighten me;-)

Update about a two processes having more RSS than Committed_AS

The batch server that runs an instance of the Slurm database daemon slurmdbd, along with slurmctld is an illuminating example. It is very long-running and shows a stable picture, with those two processes dominating resource use.

# free -k; for p in $(pgrep slurmctld) $(pgrep slurmdbd) ; do cat /proc/$p/smaps|grep Rss| awk '{ print $2}'; done | (sum=0; while read n; do sum=$((sum+n)); done; echo $sum ); cat /proc/meminfo
              total        used        free      shared  buff/cache   available
Mem:       16321148     5873792      380624      304180    10066732     9958140
Swap:      16777212        1024    16776188
4703676
MemTotal:       16321148 kB
MemFree:          379708 kB
MemAvailable:    9957224 kB
Buffers:               0 kB
Cached:          8865800 kB
SwapCached:          184 kB
Active:          7725080 kB
Inactive:        6475796 kB
Active(anon):    4634460 kB
Inactive(anon):  1007132 kB
Active(file):    3090620 kB
Inactive(file):  5468664 kB
Unevictable:       10952 kB
Mlocked:           10952 kB
SwapTotal:      16777212 kB
SwapFree:       16776188 kB
Dirty:                 4 kB
Writeback:             0 kB
AnonPages:       5345868 kB
Mapped:            79092 kB
Shmem:            304180 kB
Slab:            1287396 kB
SReclaimable:    1200932 kB
SUnreclaim:        86464 kB
KernelStack:        5252 kB
PageTables:        19852 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:    24937784 kB
Committed_AS:    1964548 kB
VmallocTotal:   34359738367 kB
VmallocUsed:           0 kB
VmallocChunk:          0 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
DirectMap4k:     1814044 kB
DirectMap2M:    14854144 kB
DirectMap1G:     2097152 kB

Here you see the Rss of the two processes amounting to 4.5G (just slurmdbd is 3.2G). The Rss kindof matches the active anon pages, but Committed_AS is less than 2G. Counting the Rss of all processes via /proc comes quite close to AnonPages+shmem (note: Pss is only about 150M smaller). I don't get how Committed_AS can be smaller than Rss (summed Pss) of active processes. Or, just in the context of meminfo:

How can Committed_AS (1964548 kB) be smaller than AnonPages (5345868 kB)? This is a faily stable workload. These are extremely long-lived two processes that are about the only thing that happens on this machine, with rather constant churn (batch jobs on other nodes being managed).

drhpc
  • 21
  • 1
  • 5
  • I recently edited the Question with some more data (in two steps). The picture leaves me somewhat confirmed in my belief that there is something wrong with the Committed_AS estimate in, well, some ranges of kernel versions. – drhpc Nov 30 '18 at 13:41

3 Answers3

2

Those boxes are not under significant memory pressure. Neither is paging (SwapFree). Second box is ~47 GB committed of 250 GB total. 200 GB is a lot to play with.

In practice, keep increasing the size of the workload until one of these happens:

  • User (application) response time degrades
  • Page out rate is higher than you are comfortable with
  • OOM killer murders some processes

Relationships between the memory counters is unintuitive, varies greatly between workloads, and probably is only really understood by kernel developers. Don't worry about it too much, focus on measuring obvious memory pressure.


Other descriptions of Comitted_AS, on the linux-mm list a while ago, emphasize it is an estimate:

Committed_AS: An estimate of how much RAM you would need to make a
              99.99% guarantee that there never is OOM (out of memory)
              for this workload. Normally the kernel will overcommit
              memory. That means, say you do a 1GB malloc, nothing
              happens, really. Only when you start USING that malloc
              memory you will get real memory on demand, and just as
              much as you use. So you sort of take a mortgage and hope
              the bank doesn't go bust. Other cases might include when
              you mmap a file that's shared only when you write to it
              and you get a private copy of that data. While it normally
              is shared between processes. The Committed_AS is a
              guesstimate of how much RAM/swap you would need
              worst-case.
John Mahowald
  • 30,009
  • 1
  • 17
  • 32
  • This is not about the machines being in trouble. This is for monitoring resource usage of scientific computing jobs occupying the whole node (hence not a single process to work with). The usage pattern differs a bit from typical server. Since the memory is actually meant to be used by the user jobs, it can be normal that a job requests 300G or more and then starts filling that with data. I want to be able to tell the users that they allocated too much before OOM killing starts, or tell them that they might hit such a limit when scaling problem size. – drhpc Nov 29 '18 at 18:39
  • About Committed_AS being an estimate: I do not expect it to be exact. It can be off by quite some margin. But that it can be about half of the used non-reclaimable memory, so obviously wrong, led me to assume that there is something that I am missing. Any idea apart from reading the kernel source code and trying to get an answer on LKML? – drhpc Nov 29 '18 at 18:41
  • Workload dependent, please add specifics to your question about if the dataset is a file, shared memory (many DBMS systems use shared memory buffers), private pages, if huge pages are in use... You can read or memory map a TB sized file on such a box, and being file backed isn't likely to use much memory. But if you slurp 1 TB into private pages it is going to OOM. – John Mahowald Nov 29 '18 at 18:57
  • That's my point: I want to know if e.g. mappings of big files are counted in Committed_AS (I think they should). See my update of the question regarding more examples of servers and some workload indication. – drhpc Nov 30 '18 at 12:00
  • Why would file mappings count against committed? Flush to disk and free the memory, similar to file system cache. You could parse /proc/*/*maps for the details, but a fair accounting of shared memory is not trivial. In practice, migrate users to smaller nodes until the "free" metric is near zero. – John Mahowald Dec 02 '18 at 23:32
  • I understand that MAP_PRIVATE mappings need a memory reservation and hence should count. I am contemplating counting /proc/*/maps, but as you said: the accounting is not trivial. Is it so outlandish to assume that the kernel should be able to give a total amount of committed memory in the system? And about migrating users: I made my application not clear enough. This is about HPC bare-metal nodes that have a fixed configuration and I simply want to be able to tell how much virtual and how much physical memory is used, for rough reporting/profiling. – drhpc Dec 03 '18 at 16:44
  • You already are looking at the right data, on Linux /proc/meminfo, but you are looking for patterns where there is a lot of noise and variables, and no real problem. Collect over time and plot, like any decent monitoring tool does (netdata, for example), watch for very low MemAvailable. Maybe give the user with the 1 TB dataset a quarter-TB box. If that's not enough, oh well upgrade to the half-TB node and update your capacity planning models. – John Mahowald Dec 03 '18 at 21:29
  • 2
    I fear, we're talking past each other, obviously from differing backgrounds. Disregarding any operational considerations for server farms or HPC clusters, I would like to focus on my main point: Committed_AS is supposed to be an estimated upper bound of memory use given current allocations, however unrealistic it is that it will be reached. I observe values that are actually lower than current memory use. This simply looks wrong and I would like to know if this is to be expected, if there is a certain kernel config option that influences this, etc. – drhpc Dec 04 '18 at 12:53
1

Here's another answer purely about Committed_AS being lower than "expected":

The interesting lines from your /proc/meminfo are as follows:

Active:          4854772 kB
Inactive:        5386896 kB
Active(anon):      33468 kB
Inactive(anon):   412616 kB
Active(file):    4821304 kB
Inactive(file):  4974280 kB
Mlocked:           10948 kB
AnonPages:        428460 kB
Shmem:             26336 kB
Committed_AS:    1488188 kB

(The Active and Inactive are just sums of the (anon) vs (file) details later, and AnonPages is just sum of lines with identifier (anon) – I only included those lines to make this easier to understand.)

As Active(file) is file backed that doesn't cause any raise to Committed_AS so practically the only things that actually raises your Committed_AS value are AnonPages + Shmem + Mlocked + spikes in memory usage. The Committed_AS is the amount of memory (RAM+swap combined) that system must be able to provide to currently running processes even if all caches and buffers are flushed to disk.

If a process does malloc() (which is usually implemented as sbrk() or brk() behind the scenes) the kernel will increase Committed_AS but it will not show in other numbers because kernel doesn't actually reserve any real RAM until the memory is actually used by the process. (Technically the kernel has specified virtual memory address space range to use for the process but the virtual memory mapping for the CPU is pointing to zero filled page with a flag that if the process tries to write anything, actual memory must be allocated on the fly - this allows process to read zeros from the virtaul address space without faulting the CPU but writing data to virtually allocated memory area is the action that actually allocates the memory for real.) It's very common that programs allocate more (virtual) memory than they actually use so this is a good feature to have but it obviously makes memory statistics harder to understand. It seems that your system is mostly running processes that do not acquire a lot of memory that's not actually used because your Committed_AS is pretty low compared to other values.

For example, my current system is currently running like this:

MemTotal:       32570748 kB
Active:         12571828 kB
AnonPages:       7689584 kB
Mlocked:           19788 kB
Shmem:           4481940 kB
Committed_AS:   44949856 kB

Note the huge amount of Committed_AS (~45 GB) in my system even though the total number of anonoymous pages, locked memory plus Shmem total to about 12 GB. As I'm running desktop environment on this system, I would assume that I have lots of processes that have executed fork() after acquiring/using lots of RAM. In this case the forked process can in theory modify all that memory without doing any explicit memory allocations and all this forked memory is counted upwards the Committed_AS value. As a result, the Committed_AS may not reflect your real system memory usage at all.

TL;DR: Committed_AS is estimated allocated virtual memory that is not backed up by any filesystem or max amount of memory that must be backed by real storage (RAM+swap) in theory to keep currently running processes still running if nothing allocates more memory in the whole system.

However, If the system is communicating with outside world, even incoming IP packets could cause more memory to be used so you cannot make any guarantees about future system behavior based on this number. Also note that stack memory is always allocated on the fly so even if none of your processes fork() or make explicit memory allocations, your memory usage (Committed_AS) may still increase when processes use more stack space.

In my experience Committed_AS is only really meaningful to compare to previous runs with similar workloads. However, if Committed_AS is less than your MemTotal you can be pretty sure that the system has very light memory pressure compared to your available hardware.

Mikko Rantalainen
  • 858
  • 12
  • 27
  • I'll have a closer look at the file-backed memory use. This may be it. About your explanantions about malloc() vs. actual use: I am very much aware of that. My point, even as we had overcommit disabled, was to be able to tell what amount of memory is actually available for a HPC job that will be the only non-system thing running on the machine (again: _not_ about usual server workloads). I want to diagnose and tell the user: You actually used 28% of memory explicitly, but your program(s) allocated 80%. Cgroup statistics may be the solution, but not simple system-wide meminfo. – drhpc Sep 24 '21 at 10:48
  • So, in my first example. You are saying that the actual memory use is Mlocked + AnonPages + Shmem = 0.5G, Committed_AS is 1.5G, right? So that is a normal picture that I wouldn't question. 1.5G allocated, but 33% of that actually used right now. So is the answer actuallly that Committed_AS is good, but MemTotal-MemAvailable is just wildly wrong as an estimate of what's being used? MemAvailable seems to err on the side of caution, not to promise more than what is possibly there (said spikes perhaps)? – drhpc Sep 24 '21 at 11:11
  • I already noted in my monitoring scripts that MemAvailable does not count things like SReclaimable as being available. I'm revisiting this topic now that maybe will update the question with things I learned in between. I do not really get the point of your answer, though. Is it that you say that counting the file-backed pages causes me to think Committed_AS is too low? Most of your answer seems to explain why Committed_AS should be _higher_ than expected from actual use, not lower. – drhpc Sep 24 '21 at 11:28
  • I added another detailed example to the question. Can you explain what is going on with the used memory clearly being bigger than Committed_AS there? The situation is very stable and I really would like to understand this. – drhpc Sep 24 '21 at 20:51
  • You should avoid morphing the existing question into another. If you feel that you now understand the issue better so that you can formulate a better question, just create a new question link to new question in comment or at the bottom of the question. That said, try summing the `Pss` field instead of `Rss` to get actual memory usage over multiple processes. – Mikko Rantalainen Sep 25 '21 at 12:53
  • I admit that it's looking convoluted, also with lots of discussion besides the point. But my initial question is still unchanged: Is Committed_AS the proper value to use for the sum of memory allocations when it can clearly be lower than measures of really utilized memory? I still don't see how that would be logically possible if it is the measure that I (we?) assume it to be. – drhpc Sep 25 '21 at 15:20
  • Do you think it would be beneficial to pull out the lastest update about the batch server and put it in a new question by itself? I can do that. An answer to that one would still be an answer to the initial question, though. – drhpc Sep 25 '21 at 15:23
  • *Why* do you think that `Committed_AS` is lower than "really utilized memory"? I guess that's the whole point of this whole discussion. – Mikko Rantalainen Sep 25 '21 at 16:23
  • Yes! It is the whole point. Maybe we can finally settle it;-) You say AnonPages + Shmem + Mlocked go all into Committed_AS, the allocated-but-not-used memory should go on top and further increase Committed_AS to something potentially (and often) much larger. Now my batch server has Committed_AS of 1964548 kB and AnonPages of 5345868 kB alone. How can a part be more than twice the size of the sum? – drhpc Sep 26 '21 at 17:36
  • 1
    As I wrote earlier, you cannot sum `Rss` fields in `smaps` of multiple processes because `Rss` pages can be shared via copy-on-write (COW) mechanism. The `Pss` field should be used instead if you're computing sum over multiple processes. – Mikko Rantalainen Sep 27 '21 at 09:28
  • According to https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/6/html/deployment_guide/s2-proc-meminfo "Active(anon) — The amount of anonymous and tmpfs/shmem memory, in kibibytes" so at least RedHat includes all of `tmpfs` into `Active(anon)` even if that's actually disk backed and doesn't raise `Committed_AS`. In my experience `tmpfs` and `Shmem` usage are hard to track in general. – Mikko Rantalainen Sep 27 '21 at 09:37
  • That said, I personally think that `tmpfs` should be included in `Committed_AS` because all `tmpfs` usage needs to be backed by either RAM or swap. On quick test it appears that `tmpfs` usage doesn't directly show in `Committed_AS`. – Mikko Rantalainen Sep 27 '21 at 09:56
  • In my example, Pss is only slightly smaller. But in general, you are right and thanks for pointing it out. Why do you think tmpfs may be disk-backed? It's counted in Cached, but as an artifact, imho. It always occupies memory. (RAM or swap). For my example case of the batch server, Shmem (tmpfs) is only 300M. It just doesn't add up that Committed_AS is so much smaller than Active(anon), does it? – drhpc Sep 28 '21 at 10:24
  • 1
    (Btw.: With Linux 4.14.246, I just wrote 1G of zeros to /dev/shm and that got counted in Committed_AS right away.) – drhpc Sep 28 '21 at 10:29
  • If your kernel appears to include both `tmpfs` and `Shmem` usage into `Active(anon)` then it doesn't make sense to have `Committed_AS` lower than `Active(anon)`. Kernel bugs are totally possible, I had a problem about a month ago where `Shmem` usage was around 10 GB even though I couldn't why any reason for that. Rebooting the system fixed the issue and the problem hasn't appeared again so I guess it was some kind of kernel bug, too. – Mikko Rantalainen Sep 29 '21 at 09:09
0

In my experience, Committed_AS has been more accurate than MemAvailable. Especially with highly spikey workload the MemAvailable seems to be more like some kind of average instead of true value over short time periods.

That said, I don't remember using data from Committed_AS with kernels older than version 4.15 so I don't know if historical behavior was different.

Both Committed_AS and MemAvailable are officially kernel level heuristics so neither should be trusted as true fact.

For the workloads that I use to run, I start to experience performance problems typically when Committed_AS exceeds about 150% the real amount of RAM. However, that obviously highly depends on your workload. If you have lots of leaky processes and enough swap, your Committed_AS may keep climbing up without performance issues as processes leak RAM and kernel swaps the leaked RAM areas to swap. Note that for such cases Committed_AS could end up being much higher than total RAM + swap without any problems.

I wouldn't disable memory overcommit unless you're running hard realtime system. And such a thing probably shouldn't use any swap either. I'm personally always running with /proc/sys/vm/overcommit_memory set to 1.

If you can provide enough swap, it usually makes sense to increase /proc/sys/vm/watermark_scale_factor and /proc/sys/vm/watermark_boost_factor to avoid latency because of swapping. However, it's important to understand that Committed_AS is the currently committed memory (memory requested by user mode processes but usually not fully used) and having RAM+swap cover that should handle all cases where no process allocates any new memory. Unless you're running some very exotic system, multiple processes are constantly allocating new memory so you shouldn't make too strict estimates about future behavior of the system. And in case your workload is highly spikey, the current numbers tell very little about the future behavior of the system.

With modern systems, I'd focus on statistics that can include all the highest short term peaks in memory management. I'd guess that a well made statistics program would monitor kernel events via /sys/fs/cgroup/memory/cgroup.event_control and collect statistics at the moment of the highest memory pressure. I don't know any statistics application that actually supports that, though. Any statistics app that collects data on wall clock defined sample period only is going to miss majority of the short term spikes in RAM usage. For mathematically correct sample averages the wall clock sample period is a requirement but understanding the spikes is more important than having accurate averages because it's those spikes that kill your processes/performance, not the averages.

Mikko Rantalainen
  • 858
  • 12
  • 27
  • 1
    Thanks for your time … and, well, time has passed and things with newer kernels could be different. We run the jobs inside cgroups via the Slurm scheduler now, and will sometime beat that one into giving us proper access to the cgroups for reporting memory usage (it's deleting the cgroups before our custom reporting can access them, probably will have to either patch Slurm or add a plugin that stores the cgroups stats). But apart from that, the original question still stands about how bad Committed_AS should be as an estimate. I mean, a lot _smaller_ than the actual current usage? – drhpc Sep 13 '21 at 14:10
  • I think that Committed_AS can be used as heuristic but it seems to update with some delay and short memory usage spikes may be missed even if you constantly poll the value. When memory load is high long enough, it has appeared pretty accurate on my experience. – Mikko Rantalainen Sep 14 '21 at 19:06
  • Sadly, not in my experience … see the examples I posted. Nothing spiky about the memory usage. I consistently saw clearly too low Committed_AS on differing systems also with fairly constant load. I don't see much interest (on the respective kernel list, even) in clearing up that picture. Hoping for proper stats from cgroups. – drhpc Sep 22 '21 at 15:29
  • How do you know you don't have memory usage spikes? Even if you sample `/proc/meminfo` every second, the spikes may last less under second and any such spike could even cause OOM. – Mikko Rantalainen Sep 23 '21 at 09:26
  • Of course I don't really know that I don't have usage spikes. I just know that I checked various systems that either are mostly idle or have a rather constant load in HPC jobs. My main point is, though: How can the spikes in actual usage be higher than those in allocated/promised memory? The occupied memory should be the slower variable, as it really takes time touching 10G of RAM. Even if it's heuristics, it is odd to produce them in an obviously inconsistent manner. I got a bucket of 10 litres and fill it with 16 litres of water. Rough estimate … – drhpc Sep 24 '21 at 10:39
  • The max size of the memory usage spikes totally depends on your workload. A misbehaving process can easily allocate 10 GB / s and if that process then crashes or gets killed by the kernel within a second, you'll see spikes exceeding 10 GB above the normal memory usage of your system – and if your statistics app samples the system e.g. once per 5 seconds it will miss such spikes. And such misbehaving process can be e.g. PostgreSQL 11.x child with *misestimation* in query planner causing very high memory usage. See https://dba.stackexchange.com/a/285423/29183 for an example. – Mikko Rantalainen Sep 25 '21 at 11:59
  • A process that allocates a lot of memory and then dies should cause a spike in Committed_AS, but not in actually used memory, as filling up that memory takes time. So anything about such spikes would only explain very _high_ values for Committed_AS, in case the sampling catches such a spike. But I have much too _low_ values. That was the point of the question from the beginning. – drhpc Sep 25 '21 at 15:19
  • If the process is already dead by the time you poll `meminfo` you'll not see the memory spike in `Committed_AS` either. You have to use memory cgroup to collect accurate peak statistics. – Mikko Rantalainen Sep 25 '21 at 16:22
  • True. Per-process or procress group statistics you can get elsewhere (per process btw. also accurately via taskstats on process end). When a process dies, its resident memory is cleaned up at the same instant as it's vmsize, isn't it? Anyhow, spikes in usage can be ruled out in my most recent example with two major users with lifetimes measured in several months. – drhpc Sep 26 '21 at 17:31