8

I have a (probably quite old) CentOS 4.5 server with a custom java application running inside.

I found the application was crashing after some running time and found it was handling 1024 connections and trying to open one more socket when it died.

As a matter of fact if I check ulimit -n I can confirm it is 1024, so the application is getting closed since it has no more free file descriptors..

What bothers me is that there are hundreds of apparently inactive connections, in a "ESTABLISHED off" state, all from a relatively small number of IPs (about 200), and that they tend to add up as time goes by and clients connect, much like these which I see running netstat -nato:

tcp        0      0 ::ffff:10.39.151.20:10000   ::ffff:78.152.97.98:12059   ESTABLISHED off (0.00/0/0)
tcp        0      0 ::ffff:10.39.151.20:10000   ::ffff:78.152.97.98:49179   ESTABLISHED off (0.00/0/0)
tcp        0      0 ::ffff:10.39.151.20:10000   ::ffff:78.152.97.42:45907   ESTABLISHED off (0.00/0/0)

I know it is not a DOS attack, the connections are legitimate, but the seem not to close after the clients connect and do a short data exchange with the server.. furthermore the pace is slow, since the are generated by 200 clients (counting different IP)..

Should I investigate on some weird application bug (maybe on jre 1.6), or dig into CentOS network configuration? I have no clue on what more to look upon..

Thanks in advance, any hint is appreciated!

Luke
  • 381
  • 1
  • 5
  • 13
  • 1
    Sounds like bad programing to me. – Edwin May 01 '13 at 17:48
  • I don't have access to the application code, so I'm trying to argue what I can from what I see on the live system.. But if I find evidence that it's an application fault, I will try to get in touch with the developer.. – Luke May 01 '13 at 20:50
  • Maybe this can be relevant to the discussion: the system is behind a corporate firewall and inside a NAT.. – Luke May 01 '13 at 22:55
  • As suggested by @Zabuzzman, I've observed with tcpdump a couple of connections: the former exchanges data about every 2 minutes, so it is effectively active; the latter exchanged data once and then was silent for at least half-an-hour.. until I stopped monitoring it. Both were still listed as "ESTABLISHED off" within netstat.. – Luke May 02 '13 at 02:34

1 Answers1

11

Hypothesis 1: your application is behind a firewall that drops idle tcp-connections after a given amount of time.

When the client tries to use this connection again, it finds it unresponsive, drops it an starts a new one.

For the server, as the TCP connections don't have a keep-alive timer there is no way of knowing that the connection is invalid and it will be kept open indefinitely.

To prove: make a long running tcpdump of one connection to show it becomes unused after a given amount of time.

Solution:

  • Change the code to use keepalive on the tcp sockets and (optionally for best performance) set the keepalive timer lower than the firewall tcp-idle timer
  • Change the firewall tcp-idle timer to a higher value beyond the maximum functional idle time of the client. Most likely this will be a global setting on the firewall, so your security administrator might be slightly reluctant to do so.
Zabuzzman
  • 733
  • 10
  • 25
  • So are you telling me that when a TCP connection is in the "established" state and isn't using keepalive, it won't be released by the server if the client goes silently away, for instance by means of a firewall which "cuts the line"? Shouldn't it be closed when the firewall closes it? Besides that I thought there was some sort of timeout in the server's TCP stack which tells how long can a connection stay there without activity... By the way there is a corporate firewall between this server and their clients.. – Luke May 01 '13 at 21:42
  • 3
    Unless the firewall sends a TCP-Reset packet to the server, the server won't know a thing. Idle TCP sessions can linger forever if the server-app doesn't close them. – Zabuzzman May 01 '13 at 21:49
  • Ok, this is a good starting point. So if connections get cut they won't be released, unless a) the server's service is restarted or b) the connections are created with keepalive, right? Is there a way to force this app using keepalive connections instead of "plain" ones? – Luke May 01 '13 at 22:59
  • Assuming your hypothesis is right, and given I'm still not much into firewall's internal mechanics, is it correct that a firewall drops connection without closing them as the TCP protocol expects? I can agree on closing too-long-inactive connections, but I thought that before purging a connection from the firewall tables it should be given some kind of "closing" signal.. – Luke May 02 '13 at 02:32
  • Two questions for your firewall admin then: 1. do you seed dropped idle connections in your logs? 2. can you send my server a TCP-Reset when it gets it connection kicked? – Zabuzzman May 02 '13 at 07:49
  • I think we're getting closer: let's see what my firewall admin will say.. I'll give you more feedback as soon as I get some reply. Thanks in the meanwhile! – Luke May 04 '13 at 14:43
  • My FW admin said that, besides authorization rules (which of course are met here, since the connections get established), the only rule is a 1-hour timeout on TCP connections... so the problem must be on application side, which is not closing connections when done. From what I can tell they stay open till they get dropped by the firewall, and after that nothing can bring them down, other than closing the application's socket server-side (killing it..). I still haven't solved, but your assistance was valuable to direct me to the right way! I'll let you know how it ends :) – Luke May 06 '13 at 20:57