Many ESTABLISHed connections in lsof for single network connection

0

We're running our own enterprise software application which uses TCP connections to send data to different parts of the same application. When we run lsof we see 100 000's of thousands of entries similar to the below:

COMMAND PID   TID       USER   FD      TYPE             DEVICE  SIZE/OFF       NODE NAME
my_cmd 14990         my_user  661u     IPv6            4290258       0t0        TCP localhost:12120->localhost:46368 (ESTABLISHED)
my_cmd 14990  4830   my_user  661u     IPv6            4290258       0t0        TCP localhost:12120->localhost:46368 (ESTABLISHED)
my_cmd 14990  4831   my_user  661u     IPv6            4290258       0t0        TCP localhost:12120->localhost:46368 (ESTABLISHED)
my_cmd 14990 14991   my_user  661u     IPv6            4290258       0t0        TCP localhost:12120->localhost:46368 (ESTABLISHED)
my_cmd 14990 14992   my_user  661u     IPv6            4290258       0t0        TCP localhost:12120->localhost:46368 (ESTABLISHED)
my_cmd 14990 14993   my_user  661u     IPv6            4290258       0t0        TCP localhost:12120->localhost:46368 (ESTABLISHED)
my_cmd 14990 14995   my_user  661u     IPv6            4290258       0t0        TCP localhost:12120->localhost:46368 (ESTABLISHED)
...
  • my_cmd is the command to start our application and 14990 is the parent process.
  • 4830, 4831, 14991, 14992,...,15987 are all threads spawned by our application. There are 100's of such entries for a single server port.
  • 12120 is the server port of part of our application which is receiving data from another part of our application.

Here's the netstat output for the above port:

>netstat -anp | grep 12120
tcp        0      0 127.0.0.1:12120         :::*                    LISTEN      14990/my_cmd
tcp        0      0 127.0.0.1:12120         127.0.0.1:46368         ESTABLISHED 14990/my_cmd
tcp        0      0 127.0.0.1:46368         127.0.0.1:12120         ESTABLISHED 14990/my_cmd

What we interpret from the above is that each thread has a handle on the network connection (port 46368 to port 12120). But our application is segmented into thread pools (of around 8 threads) so there appear to be far too many threads with a handle to the same connection.

We have about 40 different parts of our application so there are 40 such connections showing this behaviour i.e. each thread has a handle on each connection. We're slowly creating more and more threads which each show the same behaviour. When we reach about ~500k such entries our application is unable to create more threads and requires a restart.

  • Our application is a Java application on a SLES server.
  • We're using TCP keep-alives for the above connections.
  • We see OutOfMemoryErrors at the point of failure with "Unable to create native thread".
  • We've set the Xmx option for the Java application to 3GB and we're never using more than 2GB so it's not an application memory issue. Besides, threads are created off of the heap.
  • We appear to only be allocating < 80% of the server's total system memory (observed via AWS memory utilisation metric).
  • We are not seeing "Too many open files" i.e. it doesn't appear we've hit the ulimit.
  • This occurs in both a production environment (at volume) and a test environment (far less volume) with comparable frequency i.e. it doesn't seem related to system load.

There's clearly a leak somewhere but we can't track it down. It's almost as though one thread of our application creates a network connection and then shares it with all other threads. Our application then creates a new thread at some point (it can choose to create new threads to handle more tasks) and then the existing network connections are shared with it so it gets its own handle for each connection. We haven't changed the networking aspect of our application in a few years (it's been stable for a long while) so we're a little stumped as to what things may have changed.

EDIT 1 start

Connecting to our java process using VisualVM and looking at the java.lang.OperatingSystem bean shows:

  • MaxFileDescriptorCount is set to 4096.
  • current OpenFileDescriptorCount is hovering at around 690 to 700 (i.e. not the 100 000's as reported by lsof.

There seems to be a disconnecting between what's reported via VisualVM vs lsof which we can't explain.

EDIT 1 end

EDIT 2 start

Some more Googling indicates that the difference is just because lsof looks at many more types of files than are reported through VisualVM.

EDIT 2 end

I have two concrete questions:

  1. Are my observations above correct i.e. is lsof showing each thread of a process having a handle to the same network connection?

Edit 3 start

Looks like this is an artifact of lsof. I.e. it's normal to show duplicated handles across multiple threads of a parent process.

Edit 3 end

  1. Can you suggest any other tools to help shed some light on what's going on? E.g. strace shows dup2(x, y) calls every so often with y being one of the file descriptors in the lsof output (661 in the example above) but it's a little tough to tie that to a specific part of our code.

user5860663

Posted 2019-06-05T20:54:09.170

Reputation: 53

No answers