cgroups memory 16GB ceiling

Question

I am trying to use cgroups to limit memory usage of user processes on servers with a large amount of ram (128 GB or more). What we want to achieve is to reserve about 6GB of ram for OS and root processes and leave the rest to users. We want to make sure we have free memory at all times and we don't want servers to swap aggressively.

This works fine if limit is set low enough ( < 16GB ). User processes are correctly assigned to the right cgroup by cgred and once the limit is reached, oom will terminate memory hungry processes.

The issue rise when we set the limit higher. Then, the server will start swapping if a process is using more than 16G of ram even if memory usage is still well below the limit and there is plenty of ram is available.

Is there any setting or some sort of maximum that would limit the amount of memory we can grant access to under cgroups?

Here is more info:

I use the following code to simulate user processes eating memory. The code keep track of allocated memory in a linked list so the memory is used and accessible from within the program, as opposed to just being reserved with malloc (and overwrite pointer each time).

/* Content of grabram.c */

#include <stdlib.h>
#include <stdio.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>
#include <string.h>



struct testlink {
  void *ram;
  struct testlink *next;
};

int main (int argc, char *argv[]) {

    int block=8192;
    char buf[block];
    void *ram=NULL;
    FILE *frandom;
    int nbproc,i;
    pid_t pID;
  struct testlink *pstart, *pcurr, *pnew;

    if (argc < 2) {
        //nbproc = 1 by default
        nbproc=1;
    } else {
        if (sscanf(argv[1], "%d", &nbproc) != 1) {
                /* it is an error */
            printf("Failed to set number of child processes\n");
            return -1;
            } 
    }

    // open /dev/urandom for reading
    frandom = fopen("/dev/urandom", "r");
    if ( frandom == NULL ) {
        printf("I can't open /dev/urandom, giving up\n");
        return -1;
    }

    fread(&buf, block, 1, frandom); 
    if ( ferror(frandom) ) {
        // we read less than 1 byte, get out of the loop
        printf("Error reading from urandom\n");
        return -1;
    } 
    fclose (frandom);

    // pID=0 => child pID <0 => error, pID > 0 => parent
    for (i=1; i<nbproc; i++){ 
            pID = fork();
        // break out of the loop  if a child
        if (pID == 0)
            break;
        // exit if fork fails
        if (pID < 0) {
            printf("fork() failed, dying \n");
            return -1;
        }

    }
  pstart = (struct testlink*)malloc(sizeof(struct testlink));
  pstart->ram=NULL;
  pstart->next=NULL;
  pcurr = pstart;

    while ( 1==1 ) {
        ram = (void *)malloc(block);
        if (ram == NULL) {
                    printf("can't allocate memory\n");
                    return -1;
        }

        memcpy(ram, &buf, block);

    // store allocated blocks of ram in a linked list
    // so no one think we are not using them
    pcurr->ram = ram;
    pnew = (struct testlink*)malloc(sizeof(struct testlink));
    pnew->ram=NULL;
    pnew->next=NULL;
    pcurr->next=pnew;
    pcurr=pnew;

    }

    return 0;   

}

So far i tried setting the following tuneables:

vm.overcommit_memory
vm.overcommit_ratio
vm.swappiness
vm.dirty_ratio
vm.dirty_background_ratio
vm.vfs_cache_pressure

None of these sysctl settings seemed to have any effect. The server will start swapping after my code above go above the 16GB barrier even if swappiness is set to 0, overcommit is disabled, etc. I even tried to turn swap off to no avail. Even with no swap, kswapd is still triggered and performance decreases.

Finally, the relevant content of cgconfig.conf file

mount {
  cpuset  = /cgroup/computenodes;
  cpu = /cgroup/computenodes;
  memory  = /cgroup/computenodes;
}


#limit = 120G
group computenodes {
# set memory.memsw the same so users can't use swap
  memory {
    memory.limit_in_bytes = 120G;
    memory.memsw.limit_in_bytes = 120G;
    memory.swappiness = 0;
#    memory.use_hierarchy = 1;
  }

# No alternate memory nodes if the system is not NUMA
# On computenodes use all available cores
    cpuset {
        cpuset.mems="0";
        cpuset.cpus="0-47";
    }
}

Finally, we use Centos 6, kernel 2.6.32.

Thanks

Just to clarify your testing, if you one *one* cgroup and *one* running instance of this software, after 16GB is reached, do you start to swap? Despite the overall memory of the host being 128G? How long does it take to do this too? Also, can you repeat the issue and then `cat /proc/zoneinfo` and `cat /proc/buddyinfo` (may be a lot of data in zoneinfo, but thats ok). Oh! And can you specify the arguments passed when you ran your tester? It seems to take # of child processes to invoke. — Matthew Ife, Feb 24 '14 at 20:13
Can you give the full kernel version and release of CentOS 6? — ewwhite, Feb 24 '14 at 20:17
@Marc-andréLabonté That's a really, really bad kernel release. Is there *any* chance you can update? — ewwhite, Feb 24 '14 at 20:29
If i use 2 instances of grabram, like ./grabram 2 to fork 2 of them, memory splits between the two so they get about 8GB each then systems start to swap. Same with 4 or 8 instances. Coming with zoneinfo and buddyinfo — Marc-andré Labonté, Feb 24 '14 at 20:29
I know that kernel is bad but stuck there for now. I can test on newer kernels though. — Marc-andré Labonté, Feb 24 '14 at 20:31
Can't post output of zoneinfo, too long, need some way to attach files — Marc-andré Labonté, Feb 24 '14 at 20:34
I am testing with ./grabram 4 (fork 4 times then eats memory) — Marc-andré Labonté, Feb 24 '14 at 20:37
cat /proc/buddyinfo Node 0, zone DMA 0 1 1 1 0 0 0 0 1 1 3 Node 0, zone DMA32 2 14 14 5 12 8 23 21 18 1 5 Node 0, zone Normal 81 67 49 29 19 12 5 3 2 3 2 Node 1, zone Normal 267 325 95 116 49 54 30 5 1 45 489 Node 2, zone Normal 1896 1365 934 470 387 133 90 59 48 25 3816 — Marc-andré Labonté, Feb 24 '14 at 20:39
Some more clarifications:
Yes, with one cgroup and one instance of the process, i start swapping at 16GB despite more than 100G free. Takes less than a minute to get there. — Marc-andré Labonté, Feb 24 '14 at 20:51
@Marc-andréLabonté Can you provide the result of `sysctl vm.zone_reclaim_mode` — Matthew Ife, Feb 24 '14 at 20:51
Hi, i tested with vm.zone_reclaim_mode set to 0 then set to 4, same results. I'll try on CentOS latest and greatest kernel, 2.6.32-431.5.1.el6.x86_64 on another machine (64G ram) — Marc-andré Labonté, Feb 24 '14 at 21:09
I am pretty confident this is to do with the zone reclaim mode. Can you *also* test the behaviour of your tests with it **not** running in a cgroup? — Matthew Ife, Feb 24 '14 at 21:16
While not under cgroup on same machine with vm.zone_reclaim_mode=0, grabram does not swap at 16 G. Now it uses 90G and counting — Marc-andré Labonté, Feb 24 '14 at 21:21
System is swapping now but ram is full, which is expected. I am not seeing same behavior on ther other machine (64G ram, latest kernel). Works properly under such conditions. — Marc-andré Labonté, Feb 24 '14 at 21:26
Turned cgroup back on on Centos 6.2 machine, start swapping at 16G again. — Marc-andré Labonté, Feb 24 '14 at 21:30
So far:centos 6.2, cgroup on => swap at 16G ; centos 6.2, cgroup off => works ; centos 6.5, cgroup on => works — Marc-andré Labonté, Feb 24 '14 at 21:32
Found this:http://lkml.org/lkml/2010/9/20/68, might be related — Marc-andré Labonté, Feb 24 '14 at 21:33
Did you try changing the cgroup policy from the answer below? — Matthew Ife, Feb 24 '14 at 21:35
Yes, i can confirm the answer is good, green checked it. Thanks so much again — Marc-andré Labonté, Feb 24 '14 at 21:51

Matthew Ife · Accepted Answer · 2014-02-24T21:20:55.437

**Note: Undeleting for posterity **

Your problem is here

# No alternate memory nodes if the system is not NUMA
# On computenodes use all available cores
    cpuset {
        cpuset.mems="0";
        cpuset.cpus="0-47";
    }
}

You are only ever using one node of memory. You need to set this to use all nodes of memory.

I also think the below applies too, and you'll see the problem still unless you know about the below. So leaving in for posterity.

This issue basically boils down to the hardware being used. The kernel has a heuristic to determine the value of this switch. This alters how the kernel determines memory pressure on a NUMA system.

zone_reclaim_mode:

Zone_reclaim_mode allows someone to set more or less aggressive approaches to
reclaim memory when a zone runs out of memory. If it is set to zero then no
zone reclaim occurs. Allocations will be satisfied from other zones / nodes
in the system.

This is value ORed together of

1   = Zone reclaim on
2   = Zone reclaim writes dirty pages out
4   = Zone reclaim swaps pages

zone_reclaim_mode is set during bootup to 1 if it is determined that pages
from remote zones will cause a measurable performance reduction. The
page allocator will then reclaim easily reusable pages (those page
cache pages that are currently not used) before allocating off node pages.

It may be beneficial to switch off zone reclaim if the system is
used for a file server and all of memory should be used for caching files
from disk. In that case the caching effect is more important than
data locality.

Allowing zone reclaim to write out pages stops processes that are
writing large amounts of data from dirtying pages on other nodes. Zone
reclaim will write out dirty pages if a zone fills up and so effectively
throttle the process. This may decrease the performance of a single process
since it cannot use all of system memory to buffer the outgoing writes
anymore but it preserve the memory on other nodes so that the performance
of other processes running on other nodes will not be affected.

Allowing regular swap effectively restricts allocations to the local
node unless explicitly overridden by memory policies or cpuset
configurations.

To give you some idea here of what is going on, memory is broken up into zones, this is specifically useful on NUMA systems which RAM is tied to specific CPUs. In these hosts memory locality can be an important factor in performance. If for example memory banks 1 and 2 are assigned to physical CPU 0, CPU 1 can access this but at the cost of locking that RAM from CPU 0, which causes a performance degradation.

On linux, the zoning reflects the NUMA layout of the physical machine. Each zone is 16GB in size.

What is happening at the moment with zone reclaim on is the kernel is opting to reclaim (write dirty pages to disk, evict file cache, swap out memory) in a full zone (16 GB) rather than permit the process to allocate memory in another zone (which will impact performance on that CPU. This is why you notice swapping after 16GB.

If you switch off this value this should alter the behaviour of the kernel not aggressively reclaim zone data and instead allocate from another node.

Try switching off zone_reclaim_mode by running sysctl -w vm.zone_reclaim_mode=0 on your system and then re-running your test.

Note, long running high memory processes running on a configuration like this with zone_reclaim_mode off will become increasingly expensive over time.

If you allow lots of disparate processes running on many different CPUs all using lots of memory to use any node with free pages, you can effectively render the performance of the host to something akin to it only having 1 physical CPU.

Thanks, i changed the line in cgconfig.conf like this: cpuset.mems="0-7"; Works with one instance of the process. Testing again with several instances. — Marc-andré Labonté, Feb 24 '14 at 21:40
Tested with 10 instances of grabram. I can see 8 kswapd threads now instead on 1, on for each memory node i guess. No more 16G limit and i learned a lot today. — Marc-andré Labonté, Feb 24 '14 at 21:50

cgroups memory 16GB ceiling

1 Answers1