Closing a socket which keeps waiting a child process, when the parent process has been killed

0

1

The situation is the following one:

  • A service/parent process is connected to a "public port" (the parent process is a service). That "public port" is 11000. When new requests arrive to the parent process from port 11000, the server sends that requests to a child process using a "private" port (socket). You know, the typical way of implementing servers.

  • The parent process is killed but the socket is not closed (I don't know the reason yet).

  • The orphan process is waiting the socket is closed, and pkill doesn't work (it is in uninterruptible sleep).

  • I cannot run the server again, because the server says the address (0.0.0.0:11000) is already in use.

So, I have two options, to close the "internal socket" to finish the orphan process, or "free" somehow the address/port 0.0.0.0:11000 to run the server again, and leave the orphan process in waiting state. The thing is avoid to restart the server each time it fails, while I investigate the issue.

Useful info about the situation (the pid of the child proces is 1993):

$ sudo lsof -np 1993

[...]
proc 1993 root 16u  IPv4  14997  0t0  TCP 127.0.0.1:42982->127.0.0.1:37528 (CLOSE_WAIT)

So, the port I wan't to close is 37528. The file descriptor of the corresponding socket is 16u (or that is what I think).

$ sudo strace -p 1993

Process 1993 attached
futex(0x2fff414, FUTEX_WAIT_PRIVATE, 1, NULL

$ netstat -np
[...]
tcp      0   0 127.0.0.1:42982     127.0.0.1:37528    CLOSE_WAIT  -  

If I try to connect to the orphan process by means of gdb:

$ gdb -p 1993
Attaching to process 1993
{process_path} (deleted): No such file or directory.

Because the parent process is killed, I think. The thing is, I cannot connect to the orphan process to call close(16u).

How can I "solve" the situation?

NOTES: I've already try to restart the networking service, but it doesn't work. It's an Ubuntu Server 14.04 (VirtualBox), and I connect to my machine using ssh. There's no network manager.

I've tried to apply ifdown, ifup to every interface (eth0, eth1, lo y virbr), but they don't close the socket.

Peregring-lk

Posted 2016-10-05T16:18:59.943

Reputation: 215

Answers

1

There is no easy way. First, this has got nothing to do with networking: CLOSE_WAIT is the status your child process enters after replying to a FIN packet with an ACK, and before closing the socket and sending out to its peer a FIN packet. During the CLOSE_WAIT state the process is completing some operation at the end of which it will call close(), which prompts the kernel to send out the FIN packet.

In other words, during the CLOSE_WAIT state the process is trying to complete some operation, not waiting for something from a peer; hence shutting down network, restarting interfaces and so on will accomplish nothing.

By and large, this should not be per se a big problem: there is nothing wrong with having some processes hanging in a CLOSE_WAIT state. What bothers you about this is difficult to understand: you state that the parent process listens on port 11000, then contacts the child on port 37528, but you state that, after the parent process has died you cannot start a new instance of the server because the port 11000 is not released. But you just stated that it is not the child process that is using it! So who is?

At any rate, there are just a few things you can try;

  1. have you tried to kill a process with the -9 option? It's the strongest you can concoct.

  2. You may use strace from the beginning to trace system calls even in the children processes (or is it child processes?), by means of

    strace -f YourParentProcess
    

    This will follow the *fork()*ed processes as well.

  3. My guess is that you may very well forget about the child, and try to determine why port 11000 appears occupied, and by whom. You should try the handier command

    ss -lntp | grep 11000
    

    to investigate the matter.

MariusMatutiae

Posted 2016-10-05T16:18:59.943

Reputation: 41 321