Set-Up
I've been a programmer for quite some time now but I'm still a bit fuzzy on deep, internal stuff.
Now. I am well aware that it's not a good idea to either:
- kill -9 a process (bad)
- spontaneously pull the power plug on a running computer or server (worse)
However, sometimes you just plain have to. Sometimes a process just won't respond no matter what you do, and sometimes a computer just won't respond, no matter what you do.
Let's assume a system running Apache 2, MySQL 5, PHP 5, and Python 2.6.5 through mod_wsgi.
Note: I'm most interested about Mac OS X here, but an answer that pertains to any UNIX system would help me out.
My Concern
Each time I have to do either one of these, especially the second, I'm very worried for a period of time that something has been broken. Some file somewhere could be corrupt -- who knows which file? There are over 1,000,000 files on the computer.
I'm often using OS X, so I'll run a "Verify Disk" operation through the Disk Utility. It will report no problems, but I'm still concerned about this.
What if some configuration file somewhere got screwed up. Or even worse, what if a binary file somewhere is corrupt. Or a script file somewhere is corrupt now. What if some hardware is damaged?
What if I don't find out about it until next month, in a critical scenario, when the corruption or damage causes a catastrophe?
Or, what if valuable data is already lost?
My Hope
My hope is that these concerns and worries are unfounded. After all, after doing this many times before, nothing truly bad has happened yet. The worst is I've had to repair some MySQL tables, but I don't seem to have lost any data.
But, if my worries are not unfounded, and real damage could happen in either situation 1 or 2, then my hope is that there is a way to detect it and prevent against it.
My Question(s)
Could this be because modern operating systems are designed to ensure that nothing is lost in these scenarios? Could this be because modern software is designed to ensure that nothing lost? What about modern hardware design? What measures are in place when you pull the power plug?
My question is, for both of these scenarios, what exactly can go wrong, and what steps should be taken to fix it?
I'm under the impression that one thing that can go wrong is some programs might not have flushed their data to the disk, so any highly recent data that was supposed to be written to the disk (say, a few seconds before the power pull) might be lost. But what about beyond that? And can this very issue of 5-second data loss screw up a system?
What about corruption of random files hiding somewhere in the huge forest of files on my hard drives?
What about hardware damage?
What Would Help Me Most
Detailed descriptions about what goes on internally when you either kill -9 a process or pull the power on the whole system. (it seems instant, but can someone slow it down for me?)
Explanations of all things that could go wrong in these scenarios, along with (rough of course) probabilities (i.e., this is very unlikely, but this is likely)...
Descriptions of measures in place in modern hardware, operating systems, and software, to prevent damage or corruption when these scenarios occur. (to comfort me)
Instructions for what to do after a kill -9 or a power pull, beyond "verifying the disk", in order to truly make sure nothing is corrupt or damaged somewhere on the drive.
Measures that can be taken to fortify a computer setup so that if something has to be killed or the power has to be pulled, any potential damage is mitigated.
Some information about binary files -- isn't it true that the apache binary file or some library could have a random byte or two corrupted in the middle, that wouldn't come out and cause a problem until later? How can I assure myself that this didn't happen as a result of the power pull or the kill?
Thanks so much!