Possible Linux page table issue/huge load average with large heap JVM that results in significant sys time in GC logs

Question

Our service runs on AWS on m5.12xlarge nodes (48 cores, 192 G RAM) on Ubuntu 16.04. We use Java 8. For our service we allocate about 150G as max heap size. We have no swap on the node. The nature of our service is that it allocates a lot of large short-lived objects. Apart from this, through a 3rd party library that we depend on, we create a lot of short-lived processes that communicate with our process via pipes and get reaped after serving a handful of requests.

We noticed that sometime after the process is started and the RES (in top) of the process touches about 70G, CPU interrupts increase significantly and JVM's GC logs show that sys time shoots up to tens of seconds (sometimes 70 seconds). Load averages which start out at < 1 end up at almost 10 on these 48 core nodes in this state.

sar output indicated that when a node is in this state, min page faults increase significantly. Broadly, high number of CPU interrupts correlate with this state.

Restarting our service provides only a temporary respite. Load averages slowly but surely spike up and GC sys times go through the roof again.

We run our service on cluster of about 10 nodes each with load (almost) equally distributed. We see some nodes get into this state more often and more quickly than others that work normally.

We tried various GC options and options such as large pages/THP and so on with no luck.

Here's a snapshot of loadavg and meminfo

    /proc/meminfo on a node with high load avg:

    MemTotal:       193834132 kB
    MemFree:        21391860 kB
    MemAvailable:   52217676 kB
    Buffers:          221760 kB
    Cached:          9983452 kB
    SwapCached:            0 kB
    Active:         144240208 kB
    Inactive:        4235732 kB
    Active(anon):   138274336 kB
    Inactive(anon):    24772 kB
    Active(file):    5965872 kB
    Inactive(file):  4210960 kB
    Unevictable:        3652 kB
    Mlocked:            3652 kB
    SwapTotal:             0 kB
    SwapFree:              0 kB
    Dirty:             89140 kB
    Writeback:             4 kB
    AnonPages:      138292556 kB
    Mapped:           185656 kB
    Shmem:             25480 kB
    Slab:           22590684 kB
    SReclaimable:   21680388 kB
    SUnreclaim:       910296 kB
    KernelStack:       56832 kB
    PageTables:       611304 kB
    NFS_Unstable:          0 kB
    Bounce:                0 kB
    WritebackTmp:          0 kB
    CommitLimit:    96917064 kB
    Committed_AS:   436086620 kB
    VmallocTotal:   34359738367 kB
    VmallocUsed:           0 kB
    VmallocChunk:          0 kB
    HardwareCorrupted:     0 kB
    AnonHugePages:  85121024 kB
    CmaTotal:              0 kB
    CmaFree:               0 kB
    HugePages_Total:       0
    HugePages_Free:        0
    HugePages_Rsvd:        0
    HugePages_Surp:        0
    Hugepagesize:       2048 kB
    DirectMap4k:      212960 kB
    DirectMap2M:    33210368 kB
    DirectMap1G:    163577856 kB



/proc/meminfo on a node that is behaving ok
    MemTotal:       193834132 kB
    MemFree:        22509496 kB
    MemAvailable:   45958676 kB
    Buffers:          179576 kB
    Cached:          6958204 kB
    SwapCached:            0 kB
    Active:         150349632 kB
    Inactive:        2268852 kB
    Active(anon):   145485744 kB
    Inactive(anon):     8384 kB
    Active(file):    4863888 kB
    Inactive(file):  2260468 kB
    Unevictable:        3652 kB
    Mlocked:            3652 kB
    SwapTotal:             0 kB
    SwapFree:              0 kB
    Dirty:           1519448 kB
    Writeback:             0 kB
    AnonPages:      145564840 kB
    Mapped:           172080 kB
    Shmem:              9056 kB
    Slab:           17642908 kB
    SReclaimable:   17356228 kB
    SUnreclaim:       286680 kB
    KernelStack:       52944 kB
    PageTables:       302344 kB
    NFS_Unstable:          0 kB
    Bounce:                0 kB
    WritebackTmp:          0 kB
    CommitLimit:    96917064 kB
    Committed_AS:   148479160 kB
    VmallocTotal:   34359738367 kB
    VmallocUsed:           0 kB
    VmallocChunk:          0 kB
    HardwareCorrupted:     0 kB
    AnonHugePages:  142260224 kB
    CmaTotal:              0 kB
    CmaFree:               0 kB
    HugePages_Total:       0
    HugePages_Free:        0
    HugePages_Rsvd:        0
    HugePages_Surp:        0
    Hugepagesize:       2048 kB
    DirectMap4k:      149472 kB
    DirectMap2M:    20690944 kB
    DirectMap1G:    176160768 kB

The most significant chunk of the flamegraph is:

https://i.stack.imgur.com/yXmOM.png

By chance we ended up rebooting a node and noticed that it ran in a very stable manner for about 2 weeks with no change elsewhere. Since then we've resorted to rebooting nodes that hit this state to get some breathing room. Later we found elsewhere that these symptoms could be related to page table getting wedged which can only be mitigated by a reboot. It is not clear if this is correct and if this is the reason for our situation.

Is there a way to resolve this issue permanently?

Please edit your question to add the contents of `/proc/meminfo` in the problem state. Also try visualizing what is on CPU to see where the problem is. Consider collecting both Java and kernel profiling data then making mixed-mode flame graphs. https://medium.com/netflix-techblog/java-in-flames-e763b3d32166 — John Mahowald, Jun 25 '19 at 00:25
@JohnMahowald Thanks for your response. I updated my post with the information as suggested by you. I am still not able to spot anything that could help solve the problem. — devurandom, Jun 25 '19 at 21:43
It kind of smells like a NUMA issue. Can you run the workload on a larger quantity of smaller instances? I'm not sure what else you can really do with a NUMA problem on EC2. — Michael Hampton, Jun 25 '19 at 22:16
(1) We were running on smaller machines initially and then we moved to these bigger ones. We had similar load average issues on the smaller ones, though we didn't analyze the issue deeply. We were treating it as a Java heap-size/GC issue at that time. (2) There's no /proc//numa_maps on the box, so I suppose NUMA is not in play. (3) The fact that a box behaves okay for a few days after a reboot (but not on restarting our application) seems to indicate a case of a state build up in the system that can be undone only with a reboot. — devurandom, Jun 25 '19 at 23:20

score 1 · Answer 1 · answered Jun 28 '19 at 13:35

Transparent huge pages are getting fragmented or churning. On Linux, this is the size of memory to consider abandoning transparent and explicitly setting up page sizes.

Differences of greater than 10 GB, in bad minus good:

Committed_AS:   274.3
AnonHugePages:  -54.5
DirectMap2M:    11.9
DirectMap1G:    -12.0

Shift from DirectMap of 1G to 2M shows how internally the x86 TLB and Linux had less contiguous space to work with. A large difference is lost 50 GB from AnonHugePages there. Somehow that blew up Committed_AS to 225% of your MemTotal which is a bad symptom; this system is going to page out like mad.

Given the page faults in the flame graph stack, you are getting large overheads from the Linux virtual memory system shuffling pages around.

Improving performance includes explicitly configuring huge pages. 150 GB of heap is well beyond the 30 GB transition point where compressed pointers are no longer feasible. (Lots has been written about staying under this 30 GB threshold.) Triple-digit GB also is the size where I consider Linux huge pages need to be seriously evaluated.

On OpenJDK or Oracle JDK: properly allocate huge pages first then use option -XX:+UseLargePages. Java Support for Large Memory Pages and Debian wiki on HugePages

If you also wish to experiment with garbage collectors, have look at OpenJDK's ZGC wiki page. Limited pause times, handling large heaps, and NUMA awareness are explicit goals. In short, also tack on experimental options: -XX:+UnlockExperimentalVMOptions -XX:+UseZGC -XX:+UseLargePages. ZGC wiki page also discusses faffing about with Linux huge page pools and hugetlbfs, always helpful to have examples of those things.

Regarding NUMA, think for a minute on the CPU this would run on. Probably two 24 core sockets or so. AWS isn't specific, but say it was a Platinum 8175. Because you will be executing on different sockets, some of the memory will not be local to the socket. This is true even if the hypervisor doesn't expose this topology to the VM guest.

Two socket on a modern Xeon can have manageable NUMA effects, however. Page sizes is a bigger problem.

Thanks a lot John. Will try your suggestions. – devurandom Jun 28 '19 at 22:03 — devurandom, Jun 28 '19 at 22:03

Possible Linux page table issue/huge load average with large heap JVM that results in significant sys time in GC logs

1 Answers1