9

My Xen servers are openSUSE 11.1 with open-iscsi to our iSCSI SAN cluster. The SAN modules are in an IP failover group behind a virtual IP that the initiators connect to.

In the event that the primary SAN server goes down, the secondary picks up the role of serving as the target. This is all handled by the LeftHand SAN/iQ software and works well in most situations.

The problem I have is that occasionally some of my Xen DomUs will have their root filesystem go read-only after an IP failover. It's not consistent, and happens to a different subset each time a failover occurs. They're all running the same openSUSE 11.1 software image.

The root filesystems for each DomU are mounted by open-iscsi in the Dom0 and then Xen uses the standard block device driver to expose it to the DomU.

The exact symptom is that as a root as running touch /test returns the error "read-only filesystem". However, the output of mount shows it as being mounted read-write. Of course, all other I/O on the domU is also failing at this time so the machine comes down hard. Simply restarting it with xm from the Dom0 without even reconnecting the iSCSI session makes everything work again.

On the Dom0 side the syslog messages during the fail-over are something like the following:

kernel: connection1:0: iscsi: detected conn error (1011)
iscsid: Kernel reported iSCSI connection 1:0 error (1011) state (3)
iscsid: connection1:0 is operational after recovery (1 attempts) 

I'm having a hard time figuring out at what layer to debug this problem, is it something in the DomU kernel? or at the Dom0 or Xen level? I think there's likely some parameter somewhere that needs tweaking to increase some kind of timeout, but I'm not sure where to look.

I don't really think it is an issue with open-iscsi simply because the connected block device is still readable and writeable from the Dom0.

Kamil Kisiel
  • 11,946
  • 7
  • 46
  • 68

4 Answers4

6

I eventually solved this by using the following advice and settings from the open-iscsi documentation:

8.2 iSCSI settings for iSCSI root
---------------------------------

When accessing the root parition directly through a iSCSI disk, the
iSCSI timers should be set so that iSCSI layer has several chances to try to
re-establish a session and so that commands are not quickly requeued to
the SCSI layer. Basically you want the opposite of when using dm-multipath.

For this setup, you can turn off iSCSI pings by setting:

node.conn[0].timeo.noop_out_interval = 0
node.conn[0].timeo.noop_out_timeout = 0

And you can turn the replacement_timer to a very long value:

node.session.timeo.replacement_timeout = 86400

After setting up the connection to each LUN as described above, the failover works like a charm, even if it takes several minutes to happen.

Kamil Kisiel
  • 11,946
  • 7
  • 46
  • 68
  • 1
    I had the same problem with mysql prod db sitting on iscsi volume, with same errors in /var/log/messages and file system being in read-only mode. This tip solved the problem. – RainDoctor Apr 10 '10 at 04:52
2

This sounds like a problem with the iSCSI initiator running on the dom0. The initiator should not be sending SCSI failures up the stack that quickly. You'll probably want to set ConnFailTimeout in iscsi.conf this is the setting that determines how long before it considers a connection failure an error and sends that error up the SCSI stack.

I'd also look into how long that failover is actually taking, it may be taking longer than you expect. If so maybe the VIP failover is taking too long due to ARP related issues.

0

Are there any messages in dom0 indicating any sort of read/write errors or scsi errors at the time of the failover? If so, it's looking like this write error is being passed up to the domU. The domU doesn't "know" that it's an iSCSI device, so it's behaving as though the underlying disk had gone away and remounting the filesystem read-only (see mount(1) manpage - errors=continue / errors=remount-ro / errors=panic)

From the dom0's perspective, it won't get changed to read-only - this read-only behaviour is a filesystem semantic, not a block device semantic.

You mention that "all other I/O is failing" at this time - do you mean the domU or dom0?

Usually when setting up an HA iSCSI solution I use multipathing rather than virtual IP takeover - it allows greater visibility to the host and you don't have an iSCSI session suddenly disappear then needing to be restarted - it's always there, there's just two of them. Is this an option in this environment?

MikeyB
  • 38,725
  • 10
  • 102
  • 186
  • Updated the original description with answers to your questions. I suppose I could look in to multipathing instead, but the system is more geared for virtual IP failover in its current form. I'm not sure how the block level replication would come in to play with multipathing, especially since one of the SAN units needs to be designated a master. Thanks for pointing me to the part about the filesystem. I think that pretty much explains it. I suppose I could try switching it to the 'continue' mode, or maybe look at changing the filesystem to something more resilient like XFS. – Kamil Kisiel Jun 24 '09 at 07:11
  • 1
    There isn't anything inherently bad about ext3 - you'll have similar problems with XFS. And I wouldn't recommend using onerror=continue - the system will believe that block is unreadable and you'll lose data. Multipathing is not mirroring - you don't need to worry about any replication on the host. You would just connect via iSCSI to both the master and secondary targets and the host would know that if the master failed, not to pass an error up the stack but to try the same command directed at the secondary target. – MikeyB Jun 24 '09 at 18:02
  • My comment on replication was regarding the fact that the two SAN servers need to synchronize their data. Internally I think the system works similarly to drbd, with one of the units (the one that currently has the VIP) being the master. It might work with multipathing, but I'd really like to solve this problem without switching away from the current architecture. There should be a way to make this work otherwise, my systems that directly mount iSCSI volumes never have the problem with the volume becoming read-only. – Kamil Kisiel Jul 02 '09 at 23:19
-1

Um...Part of the problem is also that you aren't running / as RO. Best practices security wise state you should have "/" mounted ro, and that any filesystems that need rw should be mounted seperately, (i.e, /var and /tmp). If there are directories under /etc that need writing to, they should be moved to /var/etc/path and symlinked to /etc.

"/" should only be mounted RW in single user mode.

Setting up in this fashion could prevent the segfault in the above situation when combined with the other suggestions.

Miguel
  • 1