1

We have a P4000 SAN with 2 nodes; we don't have a Failover Manager.

Right-clicking a node in the management console does not offer a failover option, the best one seems to be "power off or reboot". This should probably work (will it?)

Is there, however, a better way to do a manual failover other than shutting down the active node?

Basil
  • 8,811
  • 3
  • 37
  • 73
Afanasii
  • 53
  • 1
  • 5
  • 2
    Your in-house documentation should cover how to do a failover; which should have been figured out when you did failover testings upon initial installation. – Chris S Mar 27 '13 at 17:17
  • Yes, Chris, of course, it "should" and "should have", I agree with you wholeheartedly... – Afanasii Mar 28 '13 at 14:35

1 Answers1

3

I googled around, and found something that indicates powering down a node might be a bad idea:

In a two node P4000 cluster, you will need a FOM for automatic failover. If no FOM is present then if one node loses connectivity, the other node also goes offline as well and you will have to manually failover the LUN’s.

Basil
  • 8,811
  • 3
  • 37
  • 73
  • This is correct. I can confirm from experience that a two-node LeftHand cluster will not survive the failure or shutdown of a single node unless a failover manager is running. – longneck Mar 27 '13 at 18:38
  • 1
    This is a feature - when a single node loses connectivity with the other cluster node, it doesn't know whether the other node is really down, or if it's just a problem with inter-node communication and maybe the other node is running fine. To avoid "split-brain" where both nodes are active (which is usually much worse than having no nodes running), the first node refuses to takeover unless something like the FOM tells it that the unreachable node really is down. – Johnny Mar 27 '13 at 18:56
  • 1
    Haha, a "feature". That's cute. – Basil Mar 27 '13 at 19:05
  • 1
    When a cluster shutdown prevents the massive disk corruption that would result from 2 nodes making uncoordinated writes to the same disks, yeah, I'd call it a feature. – Johnny Mar 27 '13 at 19:14
  • I meant in the sense of not baking "lack of data corruption, no matter what" into the storage. Uncoordinated writes to the same disks is a problem that every other storage controller on the market has solved without requiring the use of software running on a server somewhere. – Basil Mar 27 '13 at 19:34
  • I'm not that familiar with the P4000, but I thought it was a multi-site system, so it can't just use a SAN Quorum disk like a simple 2-head SAN controller can. Are there other multi-site SAN products that don't require some sort of independent failover manager? – Johnny Mar 27 '13 at 19:43
  • A simple two-head SAN controller doesn't use a quorum disk, it uses well written firmware and cache mirroring to ensure write consistence. When they go offsite, they have to use replication rather than failover, but the P4000 should be able to handle a basic one-site storage environment without all the extra hassle of FOM. – Basil Mar 28 '13 at 13:02
  • Thank you all for the comments. I realise now that we do need to add a FOM. As it is, it works as network RAID1 of two RAID5s, but it looks like we will loose it completely, if one node fails. So, it is pretty much as reliable as its RAID5s, plus it stores two copies of the data. – Afanasii Mar 28 '13 at 14:29
  • I don't see how cache mirroring helps prevent split-brain syndrome. When your cluster interconnect goes down and node 1 can't reach node 2, how does each node decide who is really supposed to take over the disks without a third node (or disk) to cast a tie-breaking vote? If you want a single-site storage cluster, maybe the P4000 is not the right product for you - why blame the product for a configuration requirement that should have been clear to whoever made the purchase decision? – Johnny Mar 28 '13 at 17:19
  • 1
    Cache mirroring doesn't prevent spilt brain syndrome. What Basil doesn't realize about standard dual controller SAN systems is that the interconnect or backplane or chassis or disk or SCSI/SAS controller provides the quorum. You don't know that because it's completely transparent and requires no configuration. – longneck Mar 30 '13 at 01:42
  • I'd characterize that as "well written firmware". And mirrored cache prevents data corruption upon the failure of one controller, but if you're being picky about terminology, that isn't strictly a quorum issue. – Basil Apr 01 '13 at 17:18