I am having issues configuring multipath using Emulex (lpfc). Although I do not detect data corruption the SAN administrator has a tool that shows that the paths are being switched every 20 seconds or so. Here are the details:
# multipath -l
san01 (3600a0b80002a042200002cb44a9a29ca) dm-2 IBM ,1815 FASt
[size=100G][features=0][hwhandler=0]
\_ round-robin 0 [prio=0][active]
\_ 3:0:0:0 sdb 8:16 [active][undef]
\_ round-robin 0 [prio=0][enabled]
\_ 4:0:0:0 sdc 8:32 [active][undef]
The multiple paths are connected to the same LUN.
# /lib/udev/scsi_id -g -u -d /dev/sdb
3600a0b80002a042200002cb44a9a29ca
# /lib/udev/scsi_id -g -u -d /dev/sdc
3600a0b80002a042200002cb44a9a29ca
Here's the /etc/multipath.conf
defaults {
udev_dir /dev
polling_interval 5
selector "round-robin 0"
path_grouping_policy failover
getuid_callout "/lib/udev/scsi_id -g -u -d /dev/%n"
path_checker readsector
failback immediate
user_friendly_names yes
}
multipaths {
multipath {
wwid 3600a0b80002a042200002cb44a9a29ca
alias san01
}
}
fdisk -l
Disk /dev/sdb: 107.3 GB, 107374182400 bytes
255 heads, 63 sectors/track, 13054 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Disk identifier: 0x61b4bf95
Device Boot Start End Blocks Id System
/dev/sdb1 1 13054 104856223+ 83 Linux
Disk /dev/sdc: 107.3 GB, 107374182400 bytes
255 heads, 63 sectors/track, 13054 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Disk identifier: 0x61b4bf95
Device Boot Start End Blocks Id System
/dev/sdc1 1 13054 104856223+ 83 Linux
I increased the verbosity for lpfc and now I get the following on dmesg:
[ 2519.241119] lpfc 0000:07:00.0: 1:0336 Rsp Ring 0 error: IOCB Data: xff000018 x37a120c0 x0 x0 xeb x0 x1b108db xa29b16
[ 2519.241124] lpfc 0000:07:00.0: 1:(0):0729 FCP cmd x12 failed <0/0> status: x1 result: xeb Data: x1b1 x8db
[ 2519.241127] lpfc 0000:07:00.0: 1:(0):0730 FCP command x12 failed: x0 SNS x0 x0 Data: x8 xeb x0 x0 x0
[ 2519.241130] lpfc 0000:07:00.0: 1:(0):0716 FCP Read Underrun, expected 254, residual 235 Data: xeb x12 x0
[ 2519.241275] lpfc 0000:07:00.0: 1:0336 Rsp Ring 0 error: IOCB Data: xff000018 x37a14c48 x0 x0 xd2 x0 x1b208e6 xa29b16
[ 2519.241279] lpfc 0000:07:00.0: 1:(0):0729 FCP cmd x12 failed <0/0> status: x1 result: xd2 Data: x1b2 x8e6
[ 2519.241283] lpfc 0000:07:00.0: 1:(0):0730 FCP command x12 failed: x0 SNS x0 x0 Data: x8 xd2 x0 x0 x0
[ 2519.241286] lpfc 0000:07:00.0: 1:(0):0716 FCP Read Underrun, expected 254, residual 210 Data: xd2 x12 x0
Can someone see anything wrong with this config? Thank you.
Based on janneb's comments I changed the configuration in multipath.conf to:
defaults {
udev_dir /dev
polling_interval 5
selector "round-robin 0"
path_grouping_policy multibus
getuid_callout "/lib/udev/scsi_id -g -u -d /dev/%n"
failback immediate
user_friendly_names yes
}
Which now gives:
san01 (3600a0b80002a042200002cb44a9a29ca) dm-2 IBM ,1815 FASt
[size=100G][features=0][hwhandler=0]
\_ round-robin 0 [prio=2][active]
\_ 3:0:0:0 sdb 8:16 [active][ready]
\_ 4:0:0:0 sdc 8:32 [active][ready]
But it still goes [active][undef] after a while, then back to [ready].
Oh I just noticed something, when I run 'multipath -l' I get [undef], however if I run 'multipath -ll' I get [ready].
-l show the current multipath topology from information fetched in sysfs and the device mapper
-ll show the current multipath topology from all available information (sysfs, the device mapper, path checkers ...)
Is the setup wrong? How can I debug? Thanks.
Thank you janneb and zerolagtime for helping out.
Here's how it gets complicated, I thought I would not need to explain all this, and I am currently leaning towards hardware setup mixup.
There are actually two servers connected to the same LUN using FC. On the OS level only one server would access the filesystem (although the same LUN is exposed to both) , since it is ext3 (not a clustering filesystem). If server 1 goes down, server 2 kicks in (linux-ha) and mounts the filesystem.
Server 1 (multipath -ll):
san01 (3600a0b80002a042200002cb44a9a29ca) dm-2 IBM ,1815 FASt
[size=100G][features=0][hwhandler=0]
\_ round-robin 0 [prio=2][active]
\_ 3:0:0:0 sdb 8:16 [active][ready]
\_ 4:0:0:0 sdc 8:32 [active][ready]
Server 2 (multipath -ll):
san01 (3600a0b80002a042200002cb44a9a29ca) dm-2 IBM ,1815 FASt
[size=100G][features=0][hwhandler=0]
\_ round-robin 0 [prio=2][active]
\_ 3:0:0:0 sdb 8:16 [active][ready]
\_ 4:0:0:0 sdc 8:32 [active][ready
Server 1 port names:
# cat /sys/class/fc_host/host3/port_name
0x10000000c96c5fdb
# cat /sys/class/fc_host/host4/port_name
0x10000000c96c5df5
root@web-db-1:~#
Server 2 port names:
#cat /sys/class/fc_host/host3/port_name
0x10000000c97b0917
# cat /sys/class/fc_host/host4/port_name
0x10000000c980a2d8
Is this setup wrong? Is the way that the LUN exposed to both server wrong? I am thinking that the hardware hookup is incorrect, what could be wrong? Could server1 path_checker interfering with server2's operation? Thanks.