1

we have been running our java system for over 2 years without ever having a system hang. We have 2 physical servers running similar java software (2 JVMs on each server) to form a cluster. As far as I can tell the crashing only started happening when we introduced core pinning and mappedbus.io for shared memory access between 2 JVMs on one of the servers. The system hang has only happened 4 times in 2 weeks, and it only ever happens on the machine where we configured the core pinning and memory mapped file access between the JVMs. We disabled that config, so we don't pin the cores to spin on reading the memory mapped files and we don't pin our primary app thread. Note, when I say pin, we also busy spin the thread running on that pinned core.

That is totally anecdotal though. As the system does not hang every day, I cannot say for sure that it is anything to do with core pinning or shared memory access. However, with pinning (and busy spin) disabled, and accessing the shared memory in a loop with a LockSupport.parkNanos(5000) we don't seem to have any system hangs.

Latency is critical for us, so this "non-busy" set up is a temporary work around only.

Also, please note that I have moved the application across to an identical server and was also able to experience this full system hang. So I can't see this being a hardware failure.

So, from digging around logs before or after a crash, this is what seems to be relevant to me. There are several of these stacks. I am just posting the first one here (i.e. I don't believe this is anything to do with postgres itself)

kernel: [25738.874778] INFO: task postgres:2155 blocked for more than 120 seconds.
kernel: [25738.874833]       Not tainted 5.4.0-050400-generic #201911242031
kernel: [25738.874878] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kernel: [25738.874928] postgres        D    0  2155   2056 0x00004000
kernel: [25738.874931] Call Trace:
kernel: [25738.874942]  __schedule+0x2e3/0x740
kernel: [25738.874948]  ? __wake_up_common_lock+0x8a/0xc0
kernel: [25738.874951]  schedule+0x42/0xb0
kernel: [25738.874957]  jbd2_log_wait_commit+0xaf/0x120
kernel: [25738.874961]  ? wait_woken+0x80/0x80
kernel: [25738.874965]  jbd2_complete_transaction+0x5c/0x90
kernel: [25738.874969]  ext4_sync_file+0x38c/0x3e0
kernel: [25738.874974]  vfs_fsync_range+0x49/0x80
kernel: [25738.874977]  do_fsync+0x3d/0x70
kernel: [25738.874980]  __x64_sys_fsync+0x14/0x20
kernel: [25738.874985]  do_syscall_64+0x57/0x190
kernel: [25738.874991]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
kernel: [25738.874993] RIP: 0033:0x7f96dc24b214
kernel: [25738.875002] Code: Bad RIP value.
kernel: [25738.875003] RSP: 002b:00007fffb2abd868 EFLAGS: 00000246 ORIG_RAX: 000000000000004a
kernel: [25738.875006] RAX: ffffffffffffffda RBX: 00007fffb2abd874 RCX: 00007f96dc24b214
kernel: [25738.875007] RDX: 00005635889ba238 RSI: 00005635889a1490 RDI: 0000000000000003
kernel: [25738.875009] RBP: 00007fffb2abd930 R08: 00005635889a1480 R09: 00007f96cc1e1200
kernel: [25738.875010] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
kernel: [25738.875011] R13: 0000000000000000 R14: 000056358899c5a0 R15: 0000000000000001

p.s. this happened on 16.04 and kernel 4.15 also. The upgrade to 18.04 and 5.0 was an attempt to resolve the system hang but has not made any difference.

The other thing I have considered is that perhaps this trace is just a symptom, not the problem. That is, my application has bound up the server and caused other processes to block on io and receive these errors. But as the server freezes completely, I have no way to know the state of my application at that time.

Additional Information in response to comments

First, just to reiterate I don't have solid evidence that the core pinning + shared memory is the straw that breaks the camel's proverbial back, but this is my best guess based on the change history and outages.

The CPU model is Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz with turbo boost. There are 2 of these in the server. I am pinning CPU numbers 2,4,6 which I believe to be on the same physical CPU. Hyperthreading is ON.

The setup is like this. JVM-A has a pinned busy spin thread writing to memory mapped file X and reading from memory mapped file Y. JVM-B has a pinned busy spin thread reading from memory mapped file X and writing back to memory mapped file Y. In JVM-B the pinned reading thread then publishes the message onto a disruptor ring buffer with a pinned busy spin worker. The message is an order instruction which is finally sent to market on this worker. This is a low latency trading platform.

This post gives a better exploration of LockSupport.parkNanos than I can here https://hazelcast.com/blog/locksupport-parknanos-under-the-hood-and-the-curious-case-of-parking/

I have 2 10,000rpm HDD in RAID 1 with embedded RAID controller.

Regarding the target latency, yes, we could in theory merge the two JVMs into one and get rid of this memory mapped file channel completely. However, there are other considerations before doing that, so I would like to focus on understanding this technical problem first.

Finally, postgres on this server is running in recovery mode only, it is not the primary. Also, our system does not do much database IO at all. It's really only used to bootstrap and the start of day, and persist the days trading activity over night. One of the crashes occurred at a time when there would have been almost zero database IO.

Gecko
  • 11
  • 2

2 Answers2

0

"Blocked" in the case of hung_task_timeout_secs means a task was in D uninterruptible state for that long. 120 seconds is rather an extraordinary amount of time to be doing I/O.

Get monitoring going that can get metrics off this host. netdata is good for this, it collects lots of stuff every second in memory, so not a lot of disk I/O. And has nice graphs.

Review disk latency, such as with iostat -xz 1. Waits above single-digit ms are not good. Share what that storage is, spindles, solid state, SAN LUNs.

Regarding spinning and pinning, I have a suspicion you are forcing the scheduler to starve. Share the specific CPU model in question, and what cores you are pinning to do what. How is LockSupport.parkNanos() implemented?

Review vmstat 1. Constantly having many tasks in running r or uninterruptible b states is not good.

Consider installing BPF and using scripts to collect task task diagnostics. runqslower will show waiting tasks above a certain threshold. Very fast is ideal, note the threshold units are microseconds.


Stepping back a minute, consider the design of this thing.

What exactly is the latency target, doing what and how fast?

Is there a reason postgres is running on the same host? If it were remote and accessed via TCP, its I/O would not be a problem for the JVM application.

John Mahowald
  • 30,009
  • 1
  • 17
  • 32
  • Hi John. Thanks for your detailed response. I will implement the monitoring you have suggested and see what I can get from this. Regarding your questions, I will update the post. – Gecko Jan 16 '20 at 23:45
0

Ok, so in the end the problem was quite simple. My isolated testing was never able to crash the machine because I was missing this one element from my test code. The issue is nothing to with shared memory or core pinning in and of itself. It's just that isolating cores slightly reduced the available shared resource to the point where the scheduler could be starved, because ...

Both JVM's were set with real-time priority using

sudo renice -n -20 $!
sudo chrt -r -a -p 99 $!

The entire JVM was bumped so in total almost 300 threads with max priority. Context switching over 150,000/s even with relatively low CPU utilization.

We have left the niceness and removed the real time change. This seems to have fixed it. The original goal of the legacy RT setting can be achieved via changing how we busyspin/pinning/c-states/p-states, etc.

Gecko
  • 11
  • 2