4

I setup the ZFS-HA following the excellent description Github (see here). After extensive testing, I rolled the setup out to production using 5x12 disks in RAIDZ3 connected to two nodes using HBA Controllers. This ran quite smooth until last night when one of the two storage pools suddenly faulted with "The pool metadata is corrupted." during a scrub run. At this point I can only speculate about what caused this, both pools were set up with SCSI fencing in pacemaker and disk reservations worked flawlessly during all failure scenarios I tested before going into production. The only major incident which occurred recently were two complete power outages without UPS support (read: the power was just gone from one moment to the next). However, it might also be that the true reason for the corruption is something completely different.

The situation now is that I cannot import the pool anymore (kindly see the output of zpool import at the end of this question). So far, all my intents to rescue the pool failed:

# zpool import -f tank
cannot import 'tank': one or more devices is currently unavailable

# zpool import -F tank
cannot import 'tank': one or more devices is currently unavailable

This puzzles me a bit since it does not really say that the only option would be to destroy the pool (which would be the expected response on a lethally corrupted pool).

# zpool clear -F tank
cannot open 'tank': no such pool

I also manually removed all SCSI reservations, e.g.:

# DEVICE=35000c5008472696f
# sg_persist --in --no-inquiry --read-reservation --device=/dev/mapper/$DEVICE
# sg_persist --in --no-inquiry --read-key --device=/dev/mapper/$DEVICE
# sg_persist --out --no-inquiry --register --param-sark=0x80d0001 --device=/dev/mapper/$DEVICE
# sg_persist --out --no-inquiry --clear --param-rk=0x80d0001 --device=/dev/mapper/$DEVICE
# sg_persist --in --no-inquiry --read-reservation --device=/dev/mapper/$DEVICE

I further tried removing A/C from the disk shelves to clear any temporary information that might remain in the desks.

I am quite frankly running short on options. The only thing left on my list is the -X option to zpool import - which I will try after all other measures failed.

So my question is, did you run into anything like this before and - more importantly - did you find a way to resolve this? I would be very grateful for any suggestions you might have .

=========

Pool layout/configuration:

   pool: tank
     id: 1858269358818362832
  state: FAULTED
 status: The pool metadata is corrupted.
 action: The pool cannot be imported due to damaged devices or data.
        The pool may be active on another system, but can be imported using
        the '-f' flag.
   see: http://zfsonlinux.org/msg/ZFS-8000-72
 config:

        tank                   FAULTED  corrupted data
          raidz3-0             FAULTED  corrupted data
            35000c5008472696f  ONLINE
            35000c5008472765f  ONLINE
            35000c500986607bf  ONLINE
            35000c5008472687f  ONLINE
            35000c500847272ef  ONLINE
            35000c50084727ce7  ONLINE
            35000c50084729723  ONLINE
            35000c500847298cf  ONLINE
            35000c50084728f6b  ONLINE
            35000c50084726753  ONLINE
            35000c50085dd15bb  ONLINE
            35000c50084726e87  ONLINE
          raidz3-1             FAULTED  corrupted data
            35000c50084a8a163  ONLINE
            35000c50084e80807  ONLINE
            35000c5008472940f  ONLINE
            35000c50084a8f373  ONLINE
            35000c500847266a3  ONLINE
            35000c50084726307  ONLINE
            35000c50084726897  ONLINE
            35000c5008472908f  ONLINE
            35000c50084727083  ONLINE
            35000c50084727c8b  ONLINE
            35000c500847284e3  ONLINE
            35000c5008472670b  ONLINE
          raidz3-2             FAULTED  corrupted data
            35000c50084a884eb  ONLINE
            35000c500847262bb  ONLINE
            35000c50084eb9f43  ONLINE
            35000c50085030a4b  ONLINE
            35000c50084eb238f  ONLINE
            35000c50084eb6873  ONLINE
            35000c50084728baf  ONLINE
            35000c50084eb4c83  ONLINE
            35000c50084727443  ONLINE
            35000c50084a8405b  ONLINE
            35000c5008472868f  ONLINE
            35000c50084727c6f  ONLINE
          raidz3-3             FAULTED  corrupted data
            35000c50084eaa467  ONLINE
            35000c50084e7d99b  ONLINE
            35000c50084eb55e3  ONLINE
            35000c500847271d7  ONLINE
            35000c50084726cef  ONLINE
            35000c50084726763  ONLINE
            35000c50084727713  ONLINE
            35000c50084728127  ONLINE
            35000c50084ed0457  ONLINE
            35000c50084e5eefb  ONLINE
            35000c50084ecae2f  ONLINE
            35000c50085522177  ONLINE
          raidz3-4             FAULTED  corrupted data
            35000c500855223c7  ONLINE
            35000c50085521a07  ONLINE
            35000c50085595dff  ONLINE
            35000c500855948a3  ONLINE
            35000c50084f98757  ONLINE
            35000c50084f981eb  ONLINE
            35000c50084f8b0d7  ONLINE
            35000c50084f8d7f7  ONLINE
            35000c5008539d9a7  ONLINE
            35000c5008552148b  ONLINE
            35000c50085521457  ONLINE
            35000c500855212b3  ONLINE

Edit:

Servers are 2x Dell PowerEdge R630, Controllers are DELL OEM versions of Broardcom SAS HBA (should be similar to SAS 9300-8e) and all 60 disks in this pool are Seagate ST6000NM0034. The Enclosure is Quanta MESOS M4600H.

Edit 2:

OS is CentOS 7

ZFS is zfs-0.7.3-1.el7_4.x86_64

Michael
  • 280
  • 3
  • 15
  • It seems the pool was imported at the same time on both hosts. Do you have any logs to share and/or rule out this possibility? – shodanshok Nov 16 '17 at 13:45
  • Right, did the hosts come up at the same time? Boot log info? – ewwhite Nov 16 '17 at 13:53
  • This should have been prevented by the scsi fence using scsi reservations. I did actually try this during testing and found that the scsi fence successfully shields I/O from one node if trying to access the disks from two different nodes. I am happy to share log files, but they are to big to include here. I will look for relevant parts and update the question. – Michael Nov 16 '17 at 13:56
  • The hosts booted at the same time after power outage, however, the ZFS did not fault immediately after that but during a `scub` some hours later. During boot I noticed some 'reservation conflicts' but those were gone after clearing manually (with single host and no pacemaker). – Michael Nov 16 '17 at 14:12
  • 1
    A double-imported zpool will not fail immediately, rather it (very quickly) accumulate metadata corruption which can be later discovered by a `scrub`. Did you enable the [multi import protection features](https://github.com/zfsonlinux/zfs/releases/tag/zfs-0.7.0)? Anything in the logs which show a double-import (you can use [pastebin](https://pastebin.com/) to upload your logs) ? – shodanshok Nov 16 '17 at 17:04
  • @shodanshok I've never had a double-import issue in ZFS. But do we know how long it was from the power outage to the scrub? – ewwhite Nov 16 '17 at 18:25
  • Also see: https://www.reddit.com/r/zfs/comments/7dcid0/zfsha_pool_faulted_with_metadata_corruption/dpwzwrn/ – ewwhite Nov 16 '17 at 18:46
  • @ewwhite a double import is one possibility, but clearly not the only one. Anyway, when tested in my workbench, it give that exactly problem (an unimportable pool). I agree that discarding the last transactions by issuing `zpool -X -F` can be a good idea, at this point. However, the OP should *really* investigate *why* the pool become corrupted... – shodanshok Nov 16 '17 at 21:56
  • As I said, during all my test, the scsi fencing never was a problem for me either which would speak strongly against the double import. I would love to find the reason for this corruption, but I am afraid I am running short on time as well as this shelve needs to go back into production soon and I do not have a spare that can hold that amount of research data. – Michael Nov 17 '17 at 12:00
  • @ewwhite: THe time bestween power outage and start of scrub was in the range of hours (approx 6). – Michael Nov 17 '17 at 12:02
  • @Michael This also makes me suspect potential backplane or controller issues. I wonder how the expanders in that chassis work. – ewwhite Nov 17 '17 at 14:16

3 Answers3

4

In the end I resorted to using the option -X for the import. This exercised all disks by reading at 2GB/s for about 36hrs. After that, no error message was given, the file system was mounted and is now fully accessible again. Until now, no data inconsistencies were detected (zfs scrub is still running). Thanks for all your replies.

However, for future readers I want to pass the warning about the -X option from the man page: This option can be extremely hazardous to the health of your pool and should only be used as a last resort.

Michael
  • 280
  • 3
  • 15
1

Seems like the upstream doesn't have much of options here (this is from Oracle Solaris ZFS Troubleshooting and Pool Recovery document, stating that zpool import -F is the only option you really have except hiring the ZFS guru which will actually look into how the metadata is corrupt):

If the pool cannot be recovered by the pool recovery method described above, you must restore the pool and all its data from a backup copy.

And I don't think the OpenZFS alliance has brought much here that would change the situation. And this is indeed sad news.

P.S. This has nothing to do with the reason the pool got to it's state but don't you think that creating 10 disk wide arrays is the problem by itself ? Even with 2+ spare disks. Cold data and so on, you know.

drookie
  • 8,051
  • 1
  • 17
  • 27
  • The RAID is actually 9+3, so 3 paritys. This was a decision made based on space efficiency. Granted, a rebuild might have been problematic. But the one resilvering I had to do finished in a reasonable amount of time (2 days or so, dont remember exactly). Can you provide me the contact details of a ZFS guru who knows how to fix metadata... :S – Michael Nov 16 '17 at 14:53
  • Well, I meant that raidz3 is just sort of raid5 with two more parities, and raid5 ends somewhere near 6 disks arrays - more is dangerous if you don’t scrub it at least once a month. – drookie Nov 16 '17 at 16:01
  • As for ZFS guru - I would try to ask in the OpenZFS community, probably some mailing list. – drookie Nov 16 '17 at 16:02
  • 2
    There is nothing wrong with a 12-disk-wide RAIZ3, except IOPs will be very low compared to the number of spindles. After all, 12+ disks RAID6 arrays (with only 2 parities) are in widespread use. – shodanshok Nov 16 '17 at 17:07
  • Yeah, this is exactly my thought when I inherited 16 disk wide raid5 from a previous engineer. Good luck ! Wish you to avoid my experience. – drookie Nov 17 '17 at 04:35
0

What are the hardware details? Makes and models of servers, disks, enclosures and controllers.

I would disable all HA features and focus on working on one system.

  • Put one node in standby: pcs cluster standby or just disable pacemaker.

  • Manually import the pool on the node you'll be working on:

    zpool import tank -d /dev/mapper/ and observe the result.

Also, what do you see in dmesg after you do this?

ewwhite
  • 194,921
  • 91
  • 434
  • 799
  • I edited the questions to include thehardware details. I had pacemaker disabled already and was working on a single node (I actually tried both individually to see if any pending reservations might have caused the issue). Manual import on any node (the other one being powered off) yields the above error (`one or more devices is currently unavailable`). No entries are however added to dmesg nor /var/log/messages. – Michael Nov 16 '17 at 13:47
  • I've had situations years ago with NexentaStor that required using the use of `zpool import -n -F` and `zpool import -X -F`. If you haven't tried those yet, you may have to. – ewwhite Nov 16 '17 at 18:46
  • I tried to import the pool under FreeBSD hoping for a more 'advanced' implementation of the recovery but also without much luck. So currently I am running the `zpool import -FX` and hope for the best. Interestingly however, FreeBSD detected corrupted GPT on 10 (out of my 60) disks, which might be just another symptom that I will look into after the current intent using `-X` terminates. – Michael Nov 17 '17 at 11:58