My Xen servers are openSUSE 11.1 with open-iscsi to our iSCSI SAN cluster. The SAN modules are in an IP failover group behind a virtual IP that the initiators connect to.
In the event that the primary SAN server goes down, the secondary picks up the role of serving as the target. This is all handled by the LeftHand SAN/iQ software and works well in most situations.
The problem I have is that occasionally some of my Xen DomUs will have their root filesystem go read-only after an IP failover. It's not consistent, and happens to a different subset each time a failover occurs. They're all running the same openSUSE 11.1 software image.
The root filesystems for each DomU are mounted by open-iscsi in the Dom0 and then Xen uses the standard block device driver to expose it to the DomU.
The exact symptom is that as a root as running touch /test
returns the error "read-only filesystem". However, the output of mount
shows it as being mounted read-write. Of course, all other I/O on the domU is also failing at this time so the machine comes down hard. Simply restarting it with xm
from the Dom0 without even reconnecting the iSCSI session makes everything work again.
On the Dom0 side the syslog messages during the fail-over are something like the following:
kernel: connection1:0: iscsi: detected conn error (1011)
iscsid: Kernel reported iSCSI connection 1:0 error (1011) state (3)
iscsid: connection1:0 is operational after recovery (1 attempts)
I'm having a hard time figuring out at what layer to debug this problem, is it something in the DomU kernel? or at the Dom0 or Xen level? I think there's likely some parameter somewhere that needs tweaking to increase some kind of timeout, but I'm not sure where to look.
I don't really think it is an issue with open-iscsi simply because the connected block device is still readable and writeable from the Dom0.