1

We have a HW problem with the disks, that made all the mount points to be read only. Output of dmesg:

end_request: I/O error, dev sda, sector 15574609
sd 0:0:0:0: SCSI error: return code = 0x00040000

We want to analyze a program that is currently running, because it should have died when he couldn't write to the file syste. So, we would like to use strace to debug the system calls.

But the output of strace is:

Bus error

It seems some resources are not available to the machine or some low-level error. I am stuck about how analizing the program, before the sysadmins repair the disk.

ompemi
  • 13
  • 3

2 Answers2

1

Your disk is (probably, in fact almost certainly) dying. It sounds like your sysadmins have already reached this conclusion.
Prepare for the funeral by dressing your backups in black and performing a restore test.


Re: the bus error - this should have been immediately lethal to the program in question. It's the signal equivalent of "WTF? That's unpossible!" (See this SO question - they're talking about memory, but the same thing can happen with disks, or any addressable component). I don't recall if you can catch SIGBUS, but if your program is doing so it shouldn't.
Further questions on how to trace/debug your software should really be asked over on StackOverflow or Programmers.

voretaq7
  • 79,345
  • 17
  • 128
  • 213
  • We know the procedure to restore everything, what actually matters is if we can address the current state to detect it in the future. I will have a look at SIGBUS (need to look carefully to the agent code), but I would like to detect in the current scenario why we didn't catch it up. – ompemi Oct 05 '11 at 18:07
1

Sounds like your system can't even load the utilities/libraries needed to do the tracing.

The correct thing to here is:

  • repair the disk (i.e. restore from backup, etc)
  • get the system back up in an optimal state
  • properly test your program in a controlled manner (by making the filesystem readonly at the right time)
MikeyB
  • 38,725
  • 10
  • 102
  • 186
  • +1 for controlled testing. Ideally NOT in the production environment. – voretaq7 Oct 05 '11 at 17:43
  • The problem is that we have already tried to reach this situation, unsuccesfully. It is a really inconsistent state where it is not only read-only, there are other resources affected. @voretaq7 It is not a production environment, yet. – ompemi Oct 05 '11 at 18:01
  • 1
    Those I/O errors are read failures from the disk. You cannot expect your program to behave properly when the disk is failing. You need to ensure that your program files and (output) data files are on separate partitions, then corrupt the data partition to test your failing program. – MikeyB Oct 05 '11 at 18:20