Our service runs on AWS on m5.12xlarge nodes (48 cores, 192 G RAM) on Ubuntu 16.04. We use Java 8. For our service we allocate about 150G as max heap size. We have no swap on the node. The nature of our service is that it allocates a lot of large short-lived objects. Apart from this, through a 3rd party library that we depend on, we create a lot of short-lived processes that communicate with our process via pipes and get reaped after serving a handful of requests.
We noticed that sometime after the process is started and the RES (in top) of the process touches about 70G, CPU interrupts increase significantly and JVM's GC logs show that sys time shoots up to tens of seconds (sometimes 70 seconds). Load averages which start out at < 1 end up at almost 10 on these 48 core nodes in this state.
sar output indicated that when a node is in this state, min page faults increase significantly. Broadly, high number of CPU interrupts correlate with this state.
Restarting our service provides only a temporary respite. Load averages slowly but surely spike up and GC sys times go through the roof again.
We run our service on cluster of about 10 nodes each with load (almost) equally distributed. We see some nodes get into this state more often and more quickly than others that work normally.
We tried various GC options and options such as large pages/THP and so on with no luck.
Here's a snapshot of loadavg and meminfo
/proc/meminfo on a node with high load avg:
MemTotal: 193834132 kB
MemFree: 21391860 kB
MemAvailable: 52217676 kB
Buffers: 221760 kB
Cached: 9983452 kB
SwapCached: 0 kB
Active: 144240208 kB
Inactive: 4235732 kB
Active(anon): 138274336 kB
Inactive(anon): 24772 kB
Active(file): 5965872 kB
Inactive(file): 4210960 kB
Unevictable: 3652 kB
Mlocked: 3652 kB
SwapTotal: 0 kB
SwapFree: 0 kB
Dirty: 89140 kB
Writeback: 4 kB
AnonPages: 138292556 kB
Mapped: 185656 kB
Shmem: 25480 kB
Slab: 22590684 kB
SReclaimable: 21680388 kB
SUnreclaim: 910296 kB
KernelStack: 56832 kB
PageTables: 611304 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
WritebackTmp: 0 kB
CommitLimit: 96917064 kB
Committed_AS: 436086620 kB
VmallocTotal: 34359738367 kB
VmallocUsed: 0 kB
VmallocChunk: 0 kB
HardwareCorrupted: 0 kB
AnonHugePages: 85121024 kB
CmaTotal: 0 kB
CmaFree: 0 kB
HugePages_Total: 0
HugePages_Free: 0
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 2048 kB
DirectMap4k: 212960 kB
DirectMap2M: 33210368 kB
DirectMap1G: 163577856 kB
/proc/meminfo on a node that is behaving ok
MemTotal: 193834132 kB
MemFree: 22509496 kB
MemAvailable: 45958676 kB
Buffers: 179576 kB
Cached: 6958204 kB
SwapCached: 0 kB
Active: 150349632 kB
Inactive: 2268852 kB
Active(anon): 145485744 kB
Inactive(anon): 8384 kB
Active(file): 4863888 kB
Inactive(file): 2260468 kB
Unevictable: 3652 kB
Mlocked: 3652 kB
SwapTotal: 0 kB
SwapFree: 0 kB
Dirty: 1519448 kB
Writeback: 0 kB
AnonPages: 145564840 kB
Mapped: 172080 kB
Shmem: 9056 kB
Slab: 17642908 kB
SReclaimable: 17356228 kB
SUnreclaim: 286680 kB
KernelStack: 52944 kB
PageTables: 302344 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
WritebackTmp: 0 kB
CommitLimit: 96917064 kB
Committed_AS: 148479160 kB
VmallocTotal: 34359738367 kB
VmallocUsed: 0 kB
VmallocChunk: 0 kB
HardwareCorrupted: 0 kB
AnonHugePages: 142260224 kB
CmaTotal: 0 kB
CmaFree: 0 kB
HugePages_Total: 0
HugePages_Free: 0
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 2048 kB
DirectMap4k: 149472 kB
DirectMap2M: 20690944 kB
DirectMap1G: 176160768 kB
The most significant chunk of the flamegraph is:
https://i.stack.imgur.com/yXmOM.png
By chance we ended up rebooting a node and noticed that it ran in a very stable manner for about 2 weeks with no change elsewhere. Since then we've resorted to rebooting nodes that hit this state to get some breathing room. Later we found elsewhere that these symptoms could be related to page table getting wedged which can only be mitigated by a reboot. It is not clear if this is correct and if this is the reason for our situation.
Is there a way to resolve this issue permanently?