23

Is there a way to detect memory fragmentation on Linux? This is because on some long running servers I have noticed performance degradation and only after I restart process I see better performance. I noticed it more when using Linux huge page support -- are huge pages in Linux more prone to fragmentation?

I have looked at /proc/buddyinfo in particular. I want to know whether there are any better ways(not just CLI commands per se, any program or theoretical background would do) to look at it.

Guy Avraham
  • 161
  • 1
  • 7
Raghu
  • 286
  • 1
  • 3
  • 6
  • I am not looking at just quick commandline solutions, any simple program/theory will also do. Hence, I did not ask at serverfault. – Raghu Apr 16 '10 at 09:37
  • 1
    I don't understand here one point. As far as I understand memory fragmentation must lead to lack of memory and as a result to memory allocation errors. However you are asking about performance degradation. Is it because you have lots of memory swapped to disk? And if so what give `vmstat` in the field `so`? –  Apr 16 '10 at 10:49
  • @skwllsp - Edited my answer to be more specific. – Tim Post Apr 16 '10 at 12:48
  • @Raghu - I would not expect most system administrators to modify kernel code to make memory management behave differently, however, _skilled_ Linux admins should know at least an overview of _how_ Linux manages memory. This question is really on the line. I voted to migrate it simply because I can't suggest (in my answer) code that answers your question. Reading from /proc or using `vmstat` is a common user experience. If you were writing a _program_ to do the same, it would be different. If you intend to use bash to harvest this info, edit your question, it won't be closed :) – Tim Post Apr 16 '10 at 13:47
  • @Tim - As i suggested it is not just the bash/cli commands I wanted to know, I needed the information to help me in my benchmarking procedure(to analyze the results, not to run them) . – Raghu Apr 16 '10 at 14:11
  • @Raghu - your question looks like you want a program to analyze it, not and understanding of what's going on. As you suggested isn't exactly quite clear, I suggest moving your last comment to be part of your question while being more explicit. – Tim Post Apr 16 '10 at 14:27

4 Answers4

12

I am answering to the tag. My answer is specific only to Linux.

Yes, huge pages are more prone to fragmentation. There are two views of memory, the one your process gets (virtual) and the one the kernel manages (real). The larger any page, the more difficult it's going to be to group (and keep it with) its neighbors, especially when your service is running on a system that also has to support others that by default allocate and write to way more memory than they actually end up using.

The kernel's mapping of (real) granted addresses is private. There's a very good reason why userspace sees them as the kernel presents them, because the kernel needs to be able to overcommit without confusing userspace. Your process gets a nice, contiguous "Disneyfied" address space in which to work, oblivious of what the kernel is actually doing with that memory behind the scenes.

The reason you see degraded performance on long running servers is most likely because allocated blocks that have not been explicitly locked (e.g. mlock()/mlockall() or posix_madvise()) and not modified in a while have been paged out, which means your service skids to disk when it has to read them. Modifying this behavior makes your process a bad neighbor, which is why many people put their RDBMS on a completely different server than web/php/python/ruby/whatever. The only way to fix that, sanely, is to reduce the competition for contiguous blocks.

Fragmentation is only really noticeable (in most cases) when page A is in memory and page B has moved to swap. Naturally, re-starting your service would seem to 'cure' this, but only because the kernel has not yet had an opportunity to page out the process' (now) newly allocated blocks within the confines of its overcommit ratio.

In fact, re-starting (lets say) 'apache' under a high load is likely going to send blocks owned by other services straight to disk. So yes, 'apache' would improve for a short time, but 'mysql' might suffer .. at least until the kernel makes them suffer equally when there is simply a lack of ample physical memory.

Add more memory, or split up demanding malloc() consumers :) Its not just fragmentation that you need to be looking at.

Try vmstat to get an overview of what's actually being stored where.

Michael Hampton
  • 237,123
  • 42
  • 477
  • 940
Tim Post
  • 1,515
  • 13
  • 25
  • Thank you for the answer. I was using huge pages(size = 2048KB each) for mysql - innodb buffer pool - to see how well it fares(using sysbench). Initially when the process uptime (and even system uptime) was low, it was giving very good results. However, its performance started to degrade over several runs. Regarding the page out you mentioned, I surely noticed a high VM activity, but I presumed it may have been because of benchmark and innodb log flushing(vm activity higher with huge pages than without). I also set vm.swappiness to 1. I could not notice any drastic change. – Raghu Apr 16 '10 at 13:49
  • According to [the fine manual](https://www.kernel.org/doc/Documentation/vm/hugetlbpage.txt), "Huge pages cannot be swapped out under memory pressure." I think this is a good answer in w/r/t standard memory but not for hugepages. – Dan Pritts Mar 28 '16 at 03:50
6

Kernel

To get current fragmentation index use:

sudo cat /sys/kernel/debug/extfrag/extfrag_index

To defragment kernel memory try executing:

sysctl vm.compact_memory=1  

Also you try turning off Transparent Huge Pages (aka THP) and/or disable swap(or decrease swappiness).

Userspace

To reduce userspace fragmentation you may want to try different allocator, e.g. jemalloc(it has great introspection capabilities, which will give you an inside into allocator internal fragmentation).

You can switch to custom malloc by recompiling your program with it or just by running your program with LD_PRELOAD: LD_PRELOAD=${JEMALLOC_PATH}/lib/libjemalloc.so.1 app (beware of interactions between THP and memory memory allocators)

Although, slightly unrelated to memory fragmentation(but connected to memory compaction/migration), you probably want to run multiple instances of your service, one for each NUMA node and bind them using numactl.

SaveTheRbtz
  • 5,621
  • 4
  • 29
  • 45
  • 1
    Why would you think that disabling swap could help? To me it seems more likely that disabling swap will hurt even more. – kasperd Dec 30 '15 at 23:31
  • 1
    Because there is not enough information in the original post, maybe process is just leaking and start swapping. Also i do not see any legitimate reasons for using swap on pretty much any production system (m.b. only for shared workstations for students). – SaveTheRbtz Dec 31 '15 at 00:44
  • 2
    Having enough swap space will improve performance. The performance problems you will get if you have insufficient swap space is reason enough to enable swap. – kasperd Dec 31 '15 at 00:56
  • 1
    @SaveTheRbtz A good reason to use swap on a production system is that it gives the system more options that it will use only if it thinks they are beneficial. Also, it permits modified pages that haven't been accessed in hours (and may never be accessed) to be ejected from precious physical memory. Lastly, it allows the system to sanely handle cases where much more memory is reserved than is used. – David Schwartz Dec 31 '15 at 01:06
  • 2
    "only if it thinks they are beneficial" - that adds additional heuristic and make system less predictable. Also page replacement algorithms(used in swap and anonymous `mmap`) are implemented differently on different kernels (e.g. Linux vs FreeBSD), or even different versions of the same OS (2.6.32 vs 3.2 vs 3.10).. "it permits modified pages [...] to be ejected from [...] physical memory" - that will hide memory leaks. "handle cases where much more memory is reserved than is used" - slow system is way worse than down system, so "sane" is questionable. – SaveTheRbtz Dec 31 '15 at 04:02
  • How do I interpret the output of `sudo cat /sys/kernel/debug/extfrag/extfrag_index`? – sudo Jul 17 '17 at 18:01
  • -1: no problem with that order 0: lack of memory 1 (or 1000?): memory available but fragmented https://github.com/torvalds/linux/blob/6faf05c2b2b4fe70d9068067437649401531de0a/Documentation/sysctl/vm.txt#L246 – fche Nov 30 '18 at 04:01
4

Using huge pages should not cause extra memory fragmentation on Linux; Linux support for huge pages is only for shared memory (via shmget or mmap), and any huge pages used must be specifically requested and preallocated by a system admin. Once in memory, they are pinned there, and are not swapped out. The challenge of swapping in huge pages in the face of memory fragmentation is exactly why they remain pinned in memory (when allocating a 2MB huge page, the kernel must find 512 contiguous free 4KB pages, which may not even exist).

Linux documentation on huge pages: http://lwn.net/Articles/375098/

There is one circumstance where memory fragmentation could cause huge page allocation to be slow (but not where huge pages cause memory fragmentation), and that's if your system is configured to grow the pool of huge pages if requested by an application. If /proc/sys/vm/nr_overcommit_hugepages is greater than /proc/sys/vm/nr_hugepages, this might happen.

jstultz
  • 41
  • 2
  • Indeed - and it should generally *help* performance because it will prevent TLB misses (see the linked article for explanation). – Dan Pritts Dec 16 '13 at 15:21
1

There is /proc/buddyinfo which is very useful. It's more useful with a nice output format, like this Python script can do:

https://gist.github.com/labeneator/9574294

For huge pages you want some free fragments in the 2097152 (2MiB) size or bigger. For transparent huge pages it will compact automatically when the kernel is asked for some, but if you want to see how many you can get, then as root run:

echo 1 | sudo tee /proc/sys/vm/compact_memory

Also yes, huge pages cause big problems for fragmentation. Either you cannot get any huge pages, or their presence causes the kernel to spend a lot of extra time trying to get some.

I have a solution that works for me. I use it on a couple of servers and my laptop. It works great for virtual machines.

Add the kernelcore=4G option to your Linux kernel command line. On my server I use 8G. Be careful with the number, because it will prevent your kernel from allocating anything outside of that memory. Servers that need a lot of socket buffers or that stream disk writes to hundreds of drives will not like being limited like this. Any memory allocation that has to be "pinned" for slab or DMA is in this category.

All of your other memory then becomes "movable" which means it can be compacted into nice chunks for huge page allocation. Now transparent huge pages can really take off and work as they are supposed to. Whenever the kernel needs more 2M pages it can simply remap 4K pages to somewhere else.

And, I'm not totally sure how this interacts with zero-copy direct IO. Memory in the "movable zone" is not supposed to be pinned, but a direct IO request would do exactly that for DMA. It might copy it. It might pin it in the movable zone anyway. In either case it probably isn't exactly what you wanted.

Zan Lynx
  • 886
  • 5
  • 13