21

I have an application which is running out of file descriptors, apparently by opening sockets, but I can't find out exactly what these sockets do. These appear in the lsof output as

java    9689 appuser 1010u  sock       0,5          263746675 can't identify protocol
java    9689 appuser 1011u  sock       0,5          263746676 can't identify protocol
java    9689 appuser 1012u  sock       0,5          263746677 can't identify protocol
java    9689 appuser 1014u  sock       0,5          263746678 can't identify protocol
java    9689 appuser 1015u  sock       0,5          263746679 can't identify protocol
java    9689 appuser 1016u  sock       0,5          263746681 can't identify protocol

and in /proc/$PID/fd as

lrwx------ 1 appuser appuser 64 Jun 23 11:49 990 -> socket:[263732085]
lrwx------ 1 appuser appuser 64 Jun 23 11:49 991 -> socket:[263732086]
lrwx------ 1 appuser appuser 64 Jun 23 11:49 992 -> socket:[263735307]
lrwx------ 1 appuser appuser 64 Jun 23 11:49 993 -> socket:[263732088]
lrwx------ 1 appuser appuser 64 Jun 23 11:49 995 -> socket:[263735308]
lrwx------ 1 appuser appuser 64 Jun 23 11:49 996 -> socket:[263735309]
lrwx------ 1 appuser appuser 64 Jun 23 11:49 997 -> socket:[263745434]
lrwx------ 1 appuser appuser 64 Jun 23 11:49 998 -> socket:[263745435]
lrwx------ 1 appuser appuser 64 Jun 23 11:49 999 -> socket:[263745436]

but there is no similar output in netstat -a.

What are these sockets and how can I find out what they do?

Edit: I have tried running grep $SOCKET /proc/net , as recommended in the lsof FAQ , where $SOCKET is for instance 263746679, but that gave no results either.


As background, the application is a container for multiple tasks which, amongst others, perform network calls. I need to single out the one which goes berserk, but until I find out who those sockets communicates with, I'm stuck.

Robert Munteanu
  • 1,542
  • 5
  • 22
  • 38
  • We are also facing this problem recently with one of our .NET Core web apps (Ubuntu server with Kestrel), but the device recorded is "0,9" with name "protocol: TCP". Trying to find out exactly what devices 0 and 9 are has proven to be difficult. But the symptoms all feel look like the same case of opening sockets without binding and using them. – icelava Oct 04 '19 at 10:34

3 Answers3

18

This can occur if you create a socket, but never connect() or bind() with it. Your best bet may be to strace (-fF) the application, and then cross-reference with the output of lsof to determine which sockets are causing the issue. As a bonus method of debugging: if you wrap your socket calls with debugging information and write them out to /dev/null, it'll appear in strace without giving you hilariously-large log files.

BMDan
  • 7,129
  • 2
  • 22
  • 34
  • Thanks, this sounds interesting. I will try to find out if that is indeed the case with our application. – Robert Munteanu Jun 28 '10 at 20:38
  • 1
    Somewhat along the same line, because this is Java is could be very hard to use strace; a better method might be to create your own socket subclass that logs information before passing it to the parent (real) JDK socket. strace can only see the underlying Java calls to the OS and can't see inside your threads for what's actually making those socket calls, to strace it all just looks like one big ball of java. –  Jun 29 '10 at 05:59
  • @troyengel : I've (re)discovered Byteman ( http://www.jboss.org/byteman ) a very neat tool which allows me to inject the bytecode needed to trace these calls. – Robert Munteanu Jun 29 '10 at 12:41
  • Most useful answer, so this gets the bounty. Thanks! – Robert Munteanu Jul 04 '10 at 20:26
3

Using Python, I have encountered the same problem on SSL sockets:

  • When I use socket.close(), the socket stays in CLOSE_WAIT state for an indefinite time
  • when I use socket.shutdown(), lsof says "can't identify protocol"

The solution was to unwrap the SSL layer before closing:

  • origsock = socket.unwrap()
  • origsock.close()

This closes the sockets properly in my app.

user48134
  • 31
  • 1
2

The first thing I would do is incrase if your file descriptor limit:

~# vi /etc/sysctl.conf
fs.file-max = 331287

Next I would make sure your system is up to date, this includes all libraries and servers. Its possible that your Java application server is out of date (if you are using one). Its also a possibility that your application server is misconfigured, you should look at your configuration file and lower your connectionTimeout and/or your maxKeepAliveRequests (I'm not sure what application server your using or if you are using one at all...).

I'm not sure what this application does, but if you don't think it requires tens of thousands of sockets then this is almost certainly a "file descriptor leak" in your Java application. You may have to send a bug report to the vendor. In this bug report you should include information on how to recreate the issue.

Here are some ways to debug the issue.

Wireshark (or twireshark for the cli) is the best tool to see how these sockets are being used. Wireshark will give you a break down of the type of traffic being thrown over the wire. Its likely that the first few connections will succeed and then it will hit the file descriptor limit. Once the file descriptor limit is hit then Wireshark isn't going to pick up on anything (and neater is netstat for that matter), but this will help narrow down the issue. There maybe case where a lot of outgoing SYN's are being sent, however no SYN/ACKs are being received thus a lot of tcp connections are just stuck in the SYN_WAIT state.

If you have access to the source code and you know the type of sockets being created (such as using strace or just searching the code) then you can open the project in Eclipse (or another IDE) and set a break point at the function that is creating these sockets. When the breakpoint gets hit, then you can look at the stack trace. This file descriptor leak maybe a simple infinite loop or perhaps the socket timeout value is too large. Another possibility is that the java app isn't doing a socket.close() to clean up the connections. Doing a close is commonly done in a the finely block of a try/catch (Yes a socket must always have a try/catch in Java or it won't build :). At the end of the day its likely that the Java app isn't handling its IOException's properly.

Rook
  • 2,615
  • 5
  • 26
  • 34
  • Thanks for the answer. I'm actually developing this application - the container part - rather than just managing it, and I was unable to find any problems related to sockets not being closed. But the wireshark/twireshark hint is good, I'll use that. – Robert Munteanu Jul 04 '10 at 20:28
  • @Robert Munteanu If you are building this app thenthis is a question for stackoverflow. Never the less you are opening too many sockets. – Rook Jul 04 '10 at 20:35
  • Rook : I gave up on finding this out code-wise, and tried to track it down as a sysadmin. That's why I posted on SF. And yes, I know somehow too many sockets are open. But there are zero clues as to where... – Robert Munteanu Jul 05 '10 at 11:29
  • @Robert Munteanu You have to set break points upon socket creation and look at the stack trace and memory at that point. I suspect you are falling into an infinite loop. Being able to look at any variable and step though your code is going to be the best approach for complex problems like this. – Rook Jul 05 '10 at 22:59
  • Rook unfortunately this happens seemingly random on one of 20 servers - not always the same - , only in production environments, and perhaps twice per week. Otherwise it would've been rather simple to finger out. I'm currently using Byteman ( http://jboss.org/byteman ) to track socket creation/bind/connect/close calls. Hopefully something will come out of it. – Robert Munteanu Jul 06 '10 at 08:35