4

I have a server with 3 oracle instances on it, and the file system is nfs with netapp. After shutdown the databases, one process for each database doesn't quit for a long time. Each kill -i doesn't work. I tried to truss, pfile it, the command through error.

And iostat shows there are lots of IO to the netapp server. So someone said the process was busy writing data to remote netapp server, and before the write complete, it won't quit. So what need to be done was just wait until all the IO was done.

After wait for longer time (about 1.5 hours), the processes exit.

So my question is: how can a process ignore the kill signal? As far as I know, if we kill -9, it will stop immediately. Do you encounter such situation kill -i doesn't kill the process right away?

TEST7-stdby-phxdbnfs11$> ps -ef|grep dbw0
  oracle  1469 25053   0 22:36:53 pts/1       0:00 grep dbw0
  oracle 26795     1   0 21:55:23 ?           0:00 ora_dbw0_TEST7
  oracle  1051     1   0   Apr 08 ?        3958:51 ora_dbw0_TEST2
  oracle   471     1   0   Apr 08 ?        6391:43 ora_dbw0_TEST1
TEST7-stdby-phxdbnfs11$> kill -9 1051 
TEST7-stdby-phxdbnfs11$> ps -ef|grep dbw0
  oracle  1493 25053   0 22:37:07 pts/1       0:00 grep dbw0
  oracle 26795     1   0 21:55:23 ?           0:00 ora_dbw0_TEST7
  oracle  1051     1   0   Apr 08 ?        3958:51 ora_dbw0_TEST2
  oracle   471     1   0   Apr 08 ?        6391:43 ora_dbw0_TEST1
TEST7-stdby-phxdbnfs11$> kill -9 471
TEST7-stdby-phxdbnfs11$> ps -ef|grep dbw0
  oracle 26795     1   0 21:55:23 ?           0:00 ora_dbw0_TEST7
  oracle  1051     1   0   Apr 08 ?        3958:51 ora_dbw0_TEST2
  oracle   471     1   0   Apr 08 ?        6391:43 ora_dbw0_TEST1
  oracle  1495 25053   0 22:37:22 pts/1       0:00 grep dbw0
TEST7-stdby-phxdbnfs11$> ps -ef|grep smon
  oracle  1524 25053   0 22:38:02 pts/1       0:00 grep smon
TEST7-stdby-phxdbnfs11$> ps -ef|grep dbw0
  oracle  1526 25053   0 22:38:06 pts/1       0:00 grep dbw0
  oracle 26795     1   0 21:55:23 ?           0:00 ora_dbw0_TEST7
  oracle  1051     1   0   Apr 08 ?        3958:51 ora_dbw0_TEST2
  oracle   471     1   0   Apr 08 ?        6391:43 ora_dbw0_TEST1
TEST7-stdby-phxdbnfs11$> kill -9 1051 471 26795
TEST7-stdby-phxdbnfs11$>  ps -ef|grep dbw0
  oracle  1528 25053   0 22:38:19 pts/1       0:00 grep dbw0
  oracle 26795     1   0 21:55:23 ?           0:00 ora_dbw0_TEST7
  oracle  1051     1   0   Apr 08 ?        3958:51 ora_dbw0_TEST2
  oracle   471     1   0   Apr 08 ?        6391:43 ora_dbw0_TEST1

TEST7-stdby-phxdbnfs11$> truss -p 26795
truss: unanticipated system error: 26795

TEST7-stdby-phxdbnfs11$> pfiles 26795
pfiles: unanticipated system error: 26795
user41014
  • 143
  • 4
  • It would be helpful if you describe the operating system the NFS client is running; this is actually more relevant than knowing that the server is a NetApp appliance. – Charles Duffy Apr 14 '10 at 16:52
  • While it's obviously possible, you will get better performance and probably not have the issue you're asking about if you don't run your RDBMS data store on NFS. Can you put the data on direct disk to the DB server? You can still use the NetApp via iSCSI or Fibre Channel if you have that hardware. – mpez0 Apr 14 '10 at 17:40

2 Answers2

3

Process will get the KILL signal (all signals behave in the same way) only and only when it is in "userspace". If it is in kernelspace (for example waiting for a NFS share to deliver data read from file), it will not get the signal (the signal will wait until process returns to userspace, it will not get lost).

Most NFSD's have some options regarding this, it can return from read with failure status if it times out. This will cause data loss (as will the other option..) because not all programs check for all read() results.

Processes can not ignore/cancel KILL signal, it's notification only and gives a chance to save any necessary data.

0

The question didn't specify the client platform, so I'm assuming Linux.


See the relevant FAQ on the Linux NFS site.

Processes can't ignore SIGKILL, but syscalls can block signals from being handled.

There are two mount flags for Linux clients which can be used to work around this: soft and intr.

soft causes NFS to eventually give up when a request is failing. If your application is not well-written to be robust in the face of syscall failures (and many, many applications are not so written), it may cause data corruption.

intr attempts to make NFS syscalls interruptable in a safer way than soft. However, it is still possible to get an intr,hard mount into an unkillable state.

Charles Duffy
  • 946
  • 2
  • 10
  • 19