What is the difference between h_rss and h_vmem in Sun Grid Engine (SGE)?

Question

So far as I understood,

mem_free can be specified to submit a job in a host that has the memory free = mem_free, whereas
h_vmem is the hard limit of the memory up to which the job can consume and if the job reaches the h_vmem, the job crashes? I think we can set the h_vmem of a host near to the total physical memory, so that the job won't start using swap and slow the server down.

Then what is h_rss? It seems to have the same definition as of h_vmem.

Or am I misinterpreting h_vmem? Is h_vmem used to reserve the extra memory that it might need than the minimum memory it's needed (mem_free)? But do not crash if it exceeds the memory, so the job can exceed h_vmem?

If my second interpretation of h_vmem is correct, then I guess, for a job to be submitted in a host, the job has to satisfy both mem_free and h_vmem (given h_vmem is not INFINITY).

And if my first interpretation of h_vmem is correct, then I guess, for a job to be submitted in a host, the job can satisfy mem_free alone and no need to satisfy h_vmem, as it only reserves the space available and if there is no space available, it doesn't matter?

score 4 · Answer 1 · answered Oct 02 '16 at 19:00

Ok, I found answer for this by checking the /proc/<pid>/limits of the running job process in the execution server.

When I submit a job with h_rss=10G, in the limits the value of Max Resident Set is set to 10737418240 bytes (i.e, 10G). (The default value at OS is unlimited) So, the process can not take memory beyond this. And also h_rss is something that is not consumable.
Whereas when I submit a job with h_vmem=50G, in the limits the value of Max Resident Set is equal to unlimited. So, it can continue beyond 50G. However, it is consumable and hence, the h_vmem of the host is reduced by 50G.

This can be found out by running the following commands:
- qhost -h <hostname> -F h_vmem, where h_vmem shows the current h_vmem value and
- qconf -se <hostname>, where h_vmem in complex_values shows the allocated h_vmem value.

score 4 · Accepted Answer · answered Dec 11 '18 at 20:52

Whether a resource is consumable or not, and how much can be reserved on a system, is configurable. You can use one of the existing values or you can create a new one, up to you.

While there's no harm in setting it anyway, mem_free is not consumable by default. That means that while there must be that amount of memory available on the system when your job starts, if 10 jobs each requiring 10GB of free memory can all start at the same time on a server with 11GB of free memory. If all of them actually use 10GB you'll be in trouble.

The differences between the others come down to enforcement. rss (physical memory usage) isn't enforced. vmem (virtual memory usage) is. Unfortunately linux doesn't offer good ways to enforce physical memory usage (cgroups are ok, but the rss ulimit doesn't actually do anything in modern kernels).

On the other hand, it's very important to recognize that there is NO correct way to treat vmem as a consumable resource. If you compile "hello world" in C with the -fsanitize=address debugging option (available in clang or gcc5+), it'll use 20TB of virtual memory, but less than 5MB of physical memory. Garbage collected runtimes like Java and Go will also allocate significant quantities of vmem that never get reflected as physical memory, in order to reduce memory fragmentation. Every chrome tab on my 8GB laptop uses 2TB of virtual memory as part of its security sandboxing. These are all totally reasonable thing for programs to do, and setting a lower limit prevents perfectly well-behaved programs from working. Just as obviously, setting a consumable limit of 20TB of vmem on a system is pointless.

If you must use h_vmem for whatever reason, the difference between the h_ and s_ variants are which signal is used to kill processes which exceed the limit - h_ kills processes with SIGKILL (e.g. kill -9), whereas s_ uses a signal which a process can handle (allowing a well-behaved job to shut down cleanly, or a poorly-behaved one to ignore the signal). Best advice there is to first cry because vmem restrictions are inherently broken, and then set h_vmem to slightly higher than s_vmem so jobs have the opportunity to die with a useful error message.

My advice would be to have the cluster admin configure h_rss to be consumable, set both h_rss and mem_free in your job template, avoid h_vmem altogether, and hope that people don't abuse the system by under-reserving memory. If an enforcement mechanism is required, it's complicated but one can set up the job manager to put jobs in memory cgroups and set either memory.limit_in_bytes or memory.soft_limit_in_bytes. The latter allows a cgroup to exceed its reservation so long as the system isn't running out of memory. This improves the kernel's ability to cache files on behalf of those processes, improving performance for everyone, but there is a risk in that when the system does run out of memory there are circumstances under which the OOM killer doesn't have time to look around for a process to kill from an over-limit cgroup, and instead the attempted allocation will fail.

Thanks Adam. We have moved from SGE already. But we did implement something similar to what you have suggested. We disabled h_vmem and set a custom resource for memory along with custom scripts to monitor the memory usage. — GP92, Dec 20 '18 at 01:47

What is the difference between h_rss and h_vmem in Sun Grid Engine (SGE)?

2 Answers2