24

I am running Bacula on a RedHat box. From time to time, the storage daemon bacula-sd stops working and becomes <defunct>.

[root@backup ~]# ps -ef | grep defunct | more
root      4801 29261  0 09:25 pts/5    00:00:00 grep defunct
root      5825     1  0 Oct18 ?        00:00:00 [bacula-sd] <defunct>

My question is, how can I kill this process? Its parent is 1, which is init, as far as I know, and I wouldn't want to kill the init process, would I?

'Normally' killing this process does not work:

[root@backup ~]# kill -0 5825
[root@backup ~]# kill -9 5825

Help is greatly appreciated!

Edit: running

[root@backup ~]# lsof -p 5825

produces the following output:

COMMAND    PID USER   FD   TYPE  DEVICE     SIZE    NODE NAME
bacula-sd 5825 root  cwd    DIR   253,0     4096 3801089 /root
bacula-sd 5825 root  rtd    DIR   253,0     4096       2 /
bacula-sd 5825 root  txt    REG   253,0  2110599  368004 /usr/local/sbin/bacula-sd
bacula-sd 5825 root  mem    REG   253,0    75284  389867 /usr/lib/libz.so.1.2.3
bacula-sd 5825 root  mem    REG   253,0    46680 3604521 /lib/libnss_files-2.5.so
bacula-sd 5825 root  mem    REG   253,0   936908  369115 /usr/lib/libstdc++.so.6.0.8
bacula-sd 5825 root  mem    REG   253,0   125736 3606807 /lib/ld-2.5.so
bacula-sd 5825 root  mem    REG   253,0  1602128 3606885 /lib/libc-2.5.so
bacula-sd 5825 root  mem    REG   253,0   208352 3606892 /lib/libm-2.5.so
bacula-sd 5825 root  mem    REG   253,0   125744 3606887 /lib/libpthread-2.5.so
bacula-sd 5825 root  mem    REG   253,0    25940 3604573 /lib/libacl.so.1.1.0
bacula-sd 5825 root  mem    REG   253,0    15972 3604535 /lib/libattr.so.1.1.0
bacula-sd 5825 root  mem    REG   253,0    46548 3606908 /lib/libgcc_s-4.1.2-20080102.so.1
bacula-sd 5825 root  mem    REG   253,0 56422480  366368 /usr/lib/locale/locale-archive
bacula-sd 5825 root    0r   CHR     1,3             1545 /dev/null
bacula-sd 5825 root    1r   CHR     1,3             1545 /dev/null
bacula-sd 5825 root    2r   CHR     1,3             1545 /dev/null
bacula-sd 5825 root    3u   CHR   9,128             6469 /dev/nst0
bacula-sd 5825 root    4u  IPv4 1023380              TCP backup:bacula-sd (LISTEN)
bacula-sd 5825 root    5u  IPv4 2693268              TCP backup:bacula-sd->backup:53957 (CLOSE_WAIT)
bacula-sd 5825 root    7u  IPv4 3248683              TCP backup:bacula-sd->backup:57629 (CLOSE_WAIT)
bacula-sd 5825 root    8u  IPv4 3250966              TCP backup:bacula-sd->backup:37650 (CLOSE_WAIT)
bacula-sd 5825 root    9u  IPv4 3253908              TCP backup:bacula-sd->backup:37671 (CLOSE_WAIT)
Dennis Williamson
  • 60,515
  • 14
  • 113
  • 148
andreas-h
  • 1,054
  • 1
  • 16
  • 27
  • See also https://unix.stackexchange.com/questions/5642/what-if-kill-9-does-not-work/5648#5648 and https://unix.stackexchange.com/questions/11172/how-can-i-kill-a-defunct-process-whose-parent-is-init – rogerdpack Apr 30 '19 at 15:24

7 Answers7

20

The only way you could remove the zombie/defunct process, would be to kill the parent. Since the parent is init (pid 1), that would also take down your system.

This pretty much leaves you with two options.

  • Manually modify the process table, eg. create a dummy process, link the defunct process as a child of the dummy, then kill them off. Quite dangerous, and you may have to manually clean up other process resources such as semaphores and file handles.
  • Reboot the system.

I'd go with the second.

Roy
  • 4,256
  • 4
  • 35
  • 50
  • 2
    +1. However, there is no rush to do either, as long as more zombie processes aren't appearing, or your zombie process hasn't locked up 4G of your RAM. :) – Kyle Smith Oct 20 '09 at 14:40
  • 1
    "Since the parent is init (pid 1), that would also take down your system" - You cannot kill `init` as it does not have a signal handler for SIGKILL. See `man 2 kill`. –  Jul 10 '13 at 10:26
  • 1
    How do you do the first? – Jelle De Loecker Feb 12 '14 at 21:01
  • @AndrewH I'm not sure SIGKILL depends on a signal handler in the target process, but it's true that the typical kernel will ignore a SIGKILL to init. However, should you run out of cooler ways to trigger a kernel panic, I think you will find that on most Linux systems a SIGSEGV will do quite nicely. – Roy Mar 04 '14 at 12:26
  • 1
    It should be noted that one of `init`'s jobs is to reap zombie processes, so if you wait long enough `init` should clean up the zombie processes. Although, most `init`s should set the handler of `SIGCHLD` to be `SIG_IGN` which fixes this. – cyphar Oct 12 '15 at 00:57
5

Check if there was a kernel panic,

# dmesg |tail

Check if the process is in "D" Unkillable sleep, where it's in kernel mode for some syscall which has not returned yet (either kernel oops, or some other reason) http://www.nabble.com/What-causes-an-unkillable-process--td20645581.html

Dennis Williamson
  • 60,515
  • 14
  • 113
  • 148
4

You could try restarting init:

 # telinit u

Otherwise, I wouldn't worry too much. It's not running and it's not taking any resources and it's just there so the kernel can remember it.

David Pashley
  • 23,151
  • 2
  • 41
  • 71
  • 1
    well, i kind of have to worry. it's a production machine running backup (bacula) and voip (asterisk) services. as long as the defunct bacula-sd process is there, bacula cannot seem to access the tape drive ... – andreas-h Oct 20 '09 at 08:37
  • It shouldn't have any files open. Run lsof -p 5825 and check. – David Pashley Oct 20 '09 at 08:39
  • Well, there seems to be many things open ... see above. Any ideas what I can do? I've never used lsof ... – andreas-h Oct 20 '09 at 08:53
  • 1
    Yeah, your zombie has /dev/nst0 open. A system reboot is probably the best bet at this point. – Kyle Smith Oct 20 '09 at 14:41
  • 5
    Yep, reboot seems to be the prevailing answer. I always feel like I've failed when I have to reboot a server. :( – David Pashley Oct 20 '09 at 17:22
  • seems that zombies have the power to prevent systemd to work as expected. Even after systemctl daemon-reexec and systemclt daemon-reload, any request time-out. telinit u, cleared the zombies and allowed us to use systemd again without rebooting the machine and avoiding us to kill more than 1000 of active services in the server – Thiago Conrado Dec 02 '20 at 18:53
4

If a zombie has init as its parent, then init has stopped working properly. One of the roles of init is to clean up zombies. If it doesn't do it, noone else will. So the only solution is to reboot. If init is broken, then a reboot may fail, so I'd shut down important services, sync the filesystem then hit the power button instead.

MarkR
  • 2,898
  • 16
  • 13
2

Let's keep the panic down, shall we? A "defunct" or "zombie" process is not a process. It is simply an entry in the process table, with a saved exit code. Thus, a zombie holds no resources, takes no CPU cycles, and uses no memory, since it is not a process. Don't get all weird and itchy trying to "kill" zombie processes. Just like their namesakes, they can't be killed, since they're already dead. But unlike the brain-eating kind, they harm absolutely no-one, and won't bite other processes.

Don't let zombie processes eat your brain. Just ignore them.

Teddy
  • 5,134
  • 1
  • 22
  • 27
  • 11
    Yes, that is the theory. Unfortunately it's not always true. A defunct process will sometimes hang on to system resources, like andreash has clearly documented. – Roy Oct 20 '09 at 13:51
  • 5
    In his case, per the lsof output, the zombie process is eating the brains of /dev/nst0. He needs those brains to continue backup operations. – Kyle Smith Oct 20 '09 at 14:42
  • 2
    A system administrator who spends his/her career ignoring Zombie processes will eventually wake up in the middle of the night with their life being sucked out of them. A Zombie is, in my experience, indicative of something wrong. I write these even as a zombie child has some strange interaction with its parent, and the parent is spinning my CPU. I don't know whose fault it is, but the point is that Zombies are ugly and ignoring them will one day come to haunt you. ...One day... when you're sleeping peacefully... in the middle of the night... after a cold Autumn day... – Mike S May 18 '17 at 13:44
  • @MikeS I got a good laugh out of your comment! – Paul Calabro Jul 27 '17 at 08:32
  • @MikeS has right . I have ssh-agent defunct and ssh nor git cannot run properly. only restart can help. (same fix as windows has ... haha) – John Tribe Oct 17 '18 at 03:56
0

I just had this issue, where I'm running wine Kindle, and the Kindle window won't close after I kill all wine processes, if I run ps, there is a [Kindle.exe] <defunct> process whose parent is 1 (ps.tree is a self-made script to show process tree):

$ps.tree 21323
 1 0 02:44:47 /lib/systemd/systemd --system --deserialize 119
   21323 1 01:50:44 [Kindle.exe] <defunct>

I finally killed the [Kindle.exe] process and the ghost window by killing all threads of this process, by running this command:

cd /proc/21323/task
kill *
0

Seems like you've got an orphaned process. As far as I know the only way to kill these would be to reboot the box. I've had this happen on my ESX servers (which are linux under the hood) from time to time and a host reboot is the fix (from VMware support).

I'm a Windows guy so take that for what its worth.

Dennis Williamson
  • 60,515
  • 14
  • 113
  • 148
mrdenny
  • 27,074
  • 4
  • 40
  • 68
  • unfortunately, rebooting is not a real option. it's a production machine also running voip services, so i cannot reboot it during office hours ... – andreas-h Oct 20 '09 at 08:38
  • 1
    so, you could reboot it after office hours, right? – warren Oct 20 '09 at 09:54