1

I want to test disaster recovery of RDBMs after power loss under high load.

My idea is to mount data directory under new mountpoint and then execute umount -f during the load and investigate outcome / state of files.

My expectation is that with non-durable configuration the data should be inconsistent and consistent otherwise.

Does anybody think it is good idea and maybe other related hints (e.g. which filesystem better to use or my expectation is irrelevant, then why)?

noonex
  • 228
  • 1
  • 10

1 Answers1

1

Presumably you are actually removing the power. umount -f is not nearly impolite enough to simulate many failures.

On Linux, umount(2) explains that force is only supported for networked file systems.

   MNT_FORCE (since Linux 2.1.116)
          Ask the filesystem to abort pending requests before attempting
          the unmount.  This may allow the unmount to complete without
          waiting for an inaccessible server, but could cause data loss.
          If, after aborting requests, some processes still have active
          references to the filesystem, the unmount will still fail.  As
          at Linux 4.12, MNT_FORCE is supported only on the following
          filesystems: 9p (since Linux 2.6.16), ceph (since Linux
          2.6.34), cifs (since Linux 2.6.12), fuse (since Linux 2.6.16),
          lustre (since Linux 3.11), and NFS (since Linux 2.1.116).

Here are some more ideas regarding how to do very nasty things to a database system:

  • Physically unplug all power supplies to the host. Any processes and shared memory will go away very ungracefully.

  • Overcommit the storage with thin provisioning and run it to 100%. Even if the storage did something sane in this scenario, the DBMS might be unhappy if its volumes went read only in the middle of a write.

  • Unplug all paths to the SAN, to simulate that "non disruptive" storage maintenance that isn't.

  • Find a process that does writes and send it SIGKILL signal or equivalent.

  • Crash the OS. For example, on Linux echo 'c' > /proc/sysrq-trigger

The state of the data remaining after the test depends on the storage and DBMS. Either could have a journal they could replay, or maybe they don't. You probably want to do a fsck or equivalent on the file system. If the database can recover to a consistent point in time, from logs or whatever, you may want to do that. If you have an integrity checker for the DBMS, use it as a sanity check.

Hopefully you already have done a restore test of your backup just in case. Do not assume just because something claims crash recovery, that it works in all situations.

John Mahowald
  • 30,009
  • 1
  • 17
  • 32
  • My goal is only test ACID from IO system point of view. Could you provide any source which confirms that force unmount is not enough? In my understanding it will just abort all IO operations, so state of caches doesn't really matter? sysrq-trigger is the best way to go, but it is harder to automate things – noonex Oct 24 '17 at 07:46
  • 2
    My edit cites the man page, force is only supported for NFS. More fundamentally, a power loss is different from a loss of storage: RAM is gone, including DBMS caches. Automation is difficult because ideally you *really* mess the host up. Maybe don't automate, and run this scenario manually only upon major database engine or operating system changes. – John Mahowald Oct 25 '17 at 23:11
  • Thank you, indeed I had to read that man page before asking the question. But I still disagree that caches are important here as long as I could get exact physical snapshot of filesystem, identical to one after power failure – noonex Oct 26 '17 at 04:57
  • Whoa, hold up. Bit level identical to before failure is not necessarily the same as consistent. Those journals may be in the file system, and it might roll back unhardened writes. Also, the umount of a local file system will fail if there are open processes. You have to be more insistent about it and kill processes. – John Mahowald Oct 28 '17 at 16:48
  • killing processes is still bad thing, because OS will have some time to soften problem by handling its ongoing IO requests accordingly. Anyway, since real filesystems don't support 'force' flag - I am using sysrq-trigger now. Thank you for assistance. – noonex Oct 29 '17 at 16:42