0
We're running our own enterprise software application which uses TCP connections to send data to different parts of the same application. When we run lsof
we see 100 000's of thousands of entries similar to the below:
COMMAND PID TID USER FD TYPE DEVICE SIZE/OFF NODE NAME
my_cmd 14990 my_user 661u IPv6 4290258 0t0 TCP localhost:12120->localhost:46368 (ESTABLISHED)
my_cmd 14990 4830 my_user 661u IPv6 4290258 0t0 TCP localhost:12120->localhost:46368 (ESTABLISHED)
my_cmd 14990 4831 my_user 661u IPv6 4290258 0t0 TCP localhost:12120->localhost:46368 (ESTABLISHED)
my_cmd 14990 14991 my_user 661u IPv6 4290258 0t0 TCP localhost:12120->localhost:46368 (ESTABLISHED)
my_cmd 14990 14992 my_user 661u IPv6 4290258 0t0 TCP localhost:12120->localhost:46368 (ESTABLISHED)
my_cmd 14990 14993 my_user 661u IPv6 4290258 0t0 TCP localhost:12120->localhost:46368 (ESTABLISHED)
my_cmd 14990 14995 my_user 661u IPv6 4290258 0t0 TCP localhost:12120->localhost:46368 (ESTABLISHED)
...
my_cmd
is the command to start our application and14990
is the parent process.4830
,4831
,14991
,14992
,...
,15987
are all threads spawned by our application. There are 100's of such entries for a single server port.12120
is the server port of part of our application which is receiving data from another part of our application.
Here's the netstat output for the above port:
>netstat -anp | grep 12120
tcp 0 0 127.0.0.1:12120 :::* LISTEN 14990/my_cmd
tcp 0 0 127.0.0.1:12120 127.0.0.1:46368 ESTABLISHED 14990/my_cmd
tcp 0 0 127.0.0.1:46368 127.0.0.1:12120 ESTABLISHED 14990/my_cmd
What we interpret from the above is that each thread has a handle on the network connection (port 46368 to port 12120). But our application is segmented into thread pools (of around 8 threads) so there appear to be far too many threads with a handle to the same connection.
We have about 40 different parts of our application so there are 40 such connections showing this behaviour i.e. each thread has a handle on each connection. We're slowly creating more and more threads which each show the same behaviour. When we reach about ~500k such entries our application is unable to create more threads and requires a restart.
- Our application is a Java application on a SLES server.
- We're using TCP keep-alives for the above connections.
- We see
OutOfMemoryErrors
at the point of failure with "Unable to create native thread". - We've set the
Xmx
option for the Java application to 3GB and we're never using more than 2GB so it's not an application memory issue. Besides, threads are created off of the heap. - We appear to only be allocating < 80% of the server's total system memory (observed via AWS memory utilisation metric).
- We are not seeing "Too many open files" i.e. it doesn't appear we've hit the
ulimit
. - This occurs in both a production environment (at volume) and a test environment (far less volume) with comparable frequency i.e. it doesn't seem related to system load.
There's clearly a leak somewhere but we can't track it down. It's almost as though one thread of our application creates a network connection and then shares it with all other threads. Our application then creates a new thread at some point (it can choose to create new threads to handle more tasks) and then the existing network connections are shared with it so it gets its own handle for each connection. We haven't changed the networking aspect of our application in a few years (it's been stable for a long while) so we're a little stumped as to what things may have changed.
EDIT 1 start
Connecting to our java process using VisualVM and looking at the java.lang.OperatingSystem bean shows:
- MaxFileDescriptorCount is set to 4096.
- current OpenFileDescriptorCount is hovering at around 690 to 700 (i.e. not the 100 000's as reported by
lsof
.
There seems to be a disconnecting between what's reported via VisualVM vs lsof
which we can't explain.
EDIT 1 end
EDIT 2 start
Some more Googling indicates that the difference is just because lsof
looks at many more types of files than are reported through VisualVM.
EDIT 2 end
I have two concrete questions:
- Are my observations above correct i.e. is
lsof
showing each thread of a process having a handle to the same network connection?
Edit 3 start
Looks like this is an artifact of lsof
. I.e. it's normal to show duplicated handles across multiple threads of a parent process.
Edit 3 end
- Can you suggest any other tools to help shed some light on what's going on? E.g.
strace
showsdup2(x, y)
calls every so often withy
being one of the file descriptors in thelsof
output (661 in the example above) but it's a little tough to tie that to a specific part of our code.