4

I know it is possible to set the numa mode to "interleave" (see NB below) for a specific process using numactrl --interleave, but I'd like to know if it is possible to make this the system wide default (aka change the "system policy"). For example, if there a kernel boot flag to achieve this?

NB: here I'm talking about the kernel behavior which interleaves allocated pages across NUMA nodes - not the memory controller behavior setting at the BIOS level which interleaves cache lines across

ewwhite
  • 194,921
  • 91
  • 434
  • 799
BeeOnRope
  • 553
  • 3
  • 6
  • 12
  • Which specific OS and version are you using? – ewwhite Sep 10 '14 at 12:36
  • 1
    I've heard of premature optimization, but this sounds like premature un-optimization! I'm very curious as to what the use case is for this. – Michael Hampton Sep 11 '14 at 10:33
  • @MichaelHampton Some databases and large-memory applications recommend this ([here](http://docs.mongodb.org/manual/administration/production-notes/#mongodb-and-numa-hardware), [here](https://issues.apache.org/jira/browse/CASSANDRA-2594) and [here](http://www.percona.com/doc/percona-server/5.5/performance/innodb_numa_support.html)). – ewwhite Sep 11 '14 at 10:40
  • 2
    At this point curiosity and the need to test different configurations. One of the favorite responses to any question on stackexchange sites seems to be "why would you want to even do that?!". Well one other common response is "You need to test that (configuration, idea, optmization, etc)". So to test things you need to configure them in different ways... – BeeOnRope Sep 11 '14 at 10:41
  • 1
    @ewwhite We are largely using RHEL, although I'm especially interested on options available on all modern Linux distros. – BeeOnRope Sep 11 '14 at 10:42
  • Not that it matters, but the "interleave" feature makes sure you get access to all the memory. If you don't interleave, then when you malloc, you get a block close to the core that thread is running on. In some situations, you might deplete from one NUMA block when the other is free - and I believe malloc() won't try the other NUMA block by default. Thus, some database developers think Interleave is better. Whether they are right or not --- the answer being "test, test" as stated here. – Brian Bulkowski Feb 10 '17 at 17:31
  • @BrianBulkowski - I think that's mostly not the case. Based on an inspection of the source `malloc` isn't even NUMA-aware, so the underlying `malloc` behavior comes largely from the system OS allocation policy (i.e., where `sbrk` and `mmap` pages get allocated). The details are [available](https://www.kernel.org/doc/Documentation/vm/numa_memory_policy.txt) but even there none of the NUMA policies are "allocate on local node or else fail", but rather always fall back. Of course the admin could bind the process using numa policy or cpusets, or the programmer could use numa-specific calls. – BeeOnRope Feb 10 '17 at 17:47

1 Answers1

1

If using RHEL/CentOS/Fedora, I'd suggest using the numad daemon. (Red Hat paywall link).

While I don't have much use for the numactl --interleave directive, it seems you've determined that your workload requires it. Can you explain why this is the case in order to provide some better context?

Edit:

It seems that most applications that recommend explicit numactl definition either make a libnuma library call or incorporate numactl in a wrapper script.

For the numad side, there's a configuration option that can be specified on the command line or in /etc/numad.conf...

-K <0|1>
   This option controls whether numad keeps interleaved  memory  spread  across  NUMA  nodes,  or
   attempts to merge interleaved memory to local NUMA nodes.  The default is to merge interleaved
   memory.  This is the appropriate setting to localize processes in a  subset  of  the  system’s
   NUMA  nodes.   If  you  are running a large, single-instance application that allocates inter-
   leaved memory because the workload will have continuous unpredictable memory  access  patterns
   (e.g. a large in-memory database), you might get better results by specifying -K 1 to instruct
   numad to keep interleaved memory distributed.

Some say that trying this with something like numad -K 1 -u X, where X is 100 x core count, may help for this. Try it.

Also see HP's ProLiant Whitepaper on Linux and NUMA.

ewwhite
  • 194,921
  • 91
  • 434
  • 799
  • Basically I have a situation where it may be difficult to use numactrl explicitly to launch the process, so I was curious if there was a way to set the fault. The various policies, such as interleave do seem to exist (the kernel uses interleave at startup, for example) so it seemed that there would be some way to set the default. – BeeOnRope Sep 11 '14 at 09:43
  • I understand. Use `numad` instead. – ewwhite Sep 11 '14 at 09:59
  • Based on my limited understanding of numad, it doesn't seem like it can do what I want. It mostly moves memory around after the fact, trying to consolidate working sets that have become spread across nodes - but it doesn't seem like affects the initial allocation node. So it can only help me "de-interleave" but never increase the interleaving. – BeeOnRope Sep 11 '14 at 10:11
  • Please give your hardware and OS specifics. – ewwhite Sep 11 '14 at 10:41
  • 1
    Let's say RHEL and x86 commodity servers (e.g., Dell poweredge stuff). – BeeOnRope Sep 11 '14 at 10:44
  • @BeeOnRope 2-socket machines? Not 4-socket? See my edit above. – ewwhite Sep 11 '14 at 10:50
  • 1
    Let's say mostly 2-socket, but does it matter? – BeeOnRope Sep 11 '14 at 11:02
  • @BeeOnRope Yeah, it matters a little. I've worked with a lot of 4-socket machines and have had to modify policy with some knowledge of the changes to the underlying architecture. Granted, this was about achieving the best locality, but I just wanted to check if you were dealing with an edge case. – ewwhite Sep 11 '14 at 11:08