3

Working as scientists in a corporate environment, we are provided with storage resources from a SAN within an Ubuntu 20.04 virtual machine (Proxmox). The SAN controller is passed directly to the VM (PCIe passthrough).

The SAN itself uses hardware Raid 60 (no other option is given to us), and presents us with 380 TB that we can split in a number of LUNs. We would like to benefit from ZFS compression and snapshotting features. We have opted for 30 x 11 TB LUNs that we then organized as striped RAID-Z. The setup is redundant (two servers), we have backups and performance is good which oriented us towards striped RAID-Z in favor of the usual striped mirrors.

Independent on the ZFS geometry, we have noticed that a high writing load (> 1 GB/s) during ZFS scrubs results in disk errors, leading eventually to faulted devices. By looking at the files presenting errors we could link this problem to the scrubbing process trying to access data still present in the cache of the SAN. With moderate loads during the scrub the process completes without any errors.

Are there configuration parameters either for ZFS or for multipath that can be tuned within the VM to prevent this issue with the SAN cache?

Output of zpool status

  pool: sanpool
 state: ONLINE
  scan: scrub repaired 0B in 2 days 02:05:53 with 0 errors on Thu Mar 17 15:50:34 2022
config:

    NAME                                        STATE     READ WRITE CKSUM
    sanpool                                     ONLINE       0     0     0
      raidz1-0                                  ONLINE       0     0     0
        wwn-0x60060e8012b003005040b0030000002e  ONLINE       0     0     0
        wwn-0x60060e8012b003005040b0030000002f  ONLINE       0     0     0
        wwn-0x60060e8012b003005040b00300000031  ONLINE       0     0     0
        wwn-0x60060e8012b003005040b00300000032  ONLINE       0     0     0
        wwn-0x60060e8012b003005040b00300000033  ONLINE       0     0     0
      raidz1-1                                  ONLINE       0     0     0
        wwn-0x60060e8012b003005040b00300000034  ONLINE       0     0     0
        wwn-0x60060e8012b003005040b00300000035  ONLINE       0     0     0
        wwn-0x60060e8012b003005040b00300000036  ONLINE       0     0     0
        wwn-0x60060e8012b003005040b00300000037  ONLINE       0     0     0
        wwn-0x60060e8012b003005040b00300000038  ONLINE       0     0     0
      raidz1-2                                  ONLINE       0     0     0
        wwn-0x60060e8012b003005040b00300000062  ONLINE       0     0     0
        wwn-0x60060e8012b003005040b00300000063  ONLINE       0     0     0
        wwn-0x60060e8012b003005040b00300000064  ONLINE       0     0     0
        wwn-0x60060e8012b003005040b00300000065  ONLINE       0     0     0
        wwn-0x60060e8012b003005040b00300000066  ONLINE       0     0     0
      raidz1-3                                  ONLINE       0     0     0
        wwn-0x60060e8012b003005040b0030000006a  ONLINE       0     0     0
        wwn-0x60060e8012b003005040b0030000006b  ONLINE       0     0     0
        wwn-0x60060e8012b003005040b0030000006c  ONLINE       0     0     0
        wwn-0x60060e8012b003005040b0030000006d  ONLINE       0     0     0
        wwn-0x60060e8012b003005040b0030000006f  ONLINE       0     0     0
      raidz1-4                                  ONLINE       0     0     0
        wwn-0x60060e8012b003005040b00300000070  ONLINE       0     0     0
        wwn-0x60060e8012b003005040b00300000071  ONLINE       0     0     0
        wwn-0x60060e8012b003005040b00300000072  ONLINE       0     0     0
        wwn-0x60060e8012b003005040b00300000073  ONLINE       0     0     0
        wwn-0x60060e8012b003005040b00300000074  ONLINE       0     0     0
      raidz1-5                                  ONLINE       0     0     0
        wwn-0x60060e8012b003005040b00300000075  ONLINE       0     0     0
        wwn-0x60060e8012b003005040b00300000076  ONLINE       0     0     0
        wwn-0x60060e8012b003005040b00300000077  ONLINE       0     0     0
        wwn-0x60060e8012b003005040b00300000079  ONLINE       0     0     0
        wwn-0x60060e8012b003005040b0030000007a  ONLINE       0     0     0

errors: No known data errors

Output of multipath -ll

mpathr (360060e8012b003005040b00300000074) dm-18 HITACHI,OPEN-V
size=11T features='0' hwhandler='0' wp=rw
`-+- policy='service-time 0' prio=1 status=active
  |- 7:0:0:25 sdz  65:144 active ready running
  `- 8:0:0:25 sdbd 67:112 active ready running
mpathe (360060e8012b003005040b00300000064) dm-5 HITACHI,OPEN-V
size=11T features='0' hwhandler='0' wp=rw
`-+- policy='service-time 0' prio=1 status=active
  |- 7:0:0:13 sdn  8:208  active ready running
  `- 8:0:0:13 sdar 66:176 active ready running
mpathq (360060e8012b003005040b00300000073) dm-17 HITACHI,OPEN-V
size=11T features='0' hwhandler='0' wp=rw
`-+- policy='service-time 0' prio=1 status=active
  |- 7:0:0:24 sdy  65:128 active ready running
  `- 8:0:0:24 sdbc 67:96  active ready running
mpathd (360060e8012b003005040b00300000063) dm-4 HITACHI,OPEN-V
size=11T features='0' hwhandler='0' wp=rw
`-+- policy='service-time 0' prio=1 status=active
  |- 7:0:0:12 sdm  8:192  active ready running
  `- 8:0:0:12 sdaq 66:160 active ready running
mpathp (360060e8012b003005040b00300000072) dm-16 HITACHI,OPEN-V
size=11T features='0' hwhandler='0' wp=rw
`-+- policy='service-time 0' prio=1 status=active
  |- 7:0:0:23 sdx  65:112 active ready running
  `- 8:0:0:23 sdbb 67:80  active ready running
mpathc (360060e8012b003005040b00300000062) dm-3 HITACHI,OPEN-V
size=11T features='0' hwhandler='0' wp=rw
`-+- policy='service-time 0' prio=1 status=active
  |- 7:0:0:11 sdl  8:176  active ready running
  `- 8:0:0:11 sdap 66:144 active ready running
mpatho (360060e8012b003005040b00300000071) dm-15 HITACHI,OPEN-V
size=11T features='0' hwhandler='0' wp=rw
`-+- policy='service-time 0' prio=1 status=active
  |- 7:0:0:22 sdw  65:96  active ready running
  `- 8:0:0:22 sdba 67:64  active ready running
mpathb (360060e8012b003005040b00300000038) dm-2 HITACHI,OPEN-V
size=11T features='0' hwhandler='0' wp=rw
`-+- policy='service-time 0' prio=1 status=active
  |- 7:0:0:10 sdk  8:160  active ready running
  `- 8:0:0:10 sdao 66:128 active ready running
mpathn (360060e8012b003005040b00300000070) dm-14 HITACHI,OPEN-V
size=11T features='0' hwhandler='0' wp=rw
`-+- policy='service-time 0' prio=1 status=active
  |- 7:0:0:21 sdv  65:80  active ready running
  `- 8:0:0:21 sdaz 67:48  active ready running
mpatha (360060e8012b003005040b0030000002e) dm-1 HITACHI,OPEN-V
size=11T features='0' hwhandler='0' wp=rw
`-+- policy='service-time 0' prio=1 status=active
  |- 7:0:0:1  sdb  8:16   active ready running
  `- 8:0:0:1  sdaf 65:240 active ready running
mpathz (360060e8012b003005040b00300000033) dm-26 HITACHI,OPEN-V
size=11T features='0' hwhandler='0' wp=rw
`-+- policy='service-time 0' prio=1 status=active
  |- 7:0:0:5  sdf  8:80   active ready running
  `- 8:0:0:5  sdaj 66:48  active ready running
mpathm (360060e8012b003005040b0030000006f) dm-13 HITACHI,OPEN-V
size=11T features='0' hwhandler='0' wp=rw
`-+- policy='service-time 0' prio=1 status=active
  |- 7:0:0:20 sdu  65:64  active ready running
  `- 8:0:0:20 sday 67:32  active ready running
mpathy (360060e8012b003005040b00300000032) dm-25 HITACHI,OPEN-V
size=11T features='0' hwhandler='0' wp=rw
`-+- policy='service-time 0' prio=1 status=active
  |- 7:0:0:4  sde  8:64   active ready running
  `- 8:0:0:4  sdai 66:32  active ready running
mpathl (360060e8012b003005040b0030000002f) dm-12 HITACHI,OPEN-V
size=11T features='0' hwhandler='0' wp=rw
`-+- policy='service-time 0' prio=1 status=active
  |- 7:0:0:2  sdc  8:32   active ready running
  `- 8:0:0:2  sdag 66:0   active ready running
mpathx (360060e8012b003005040b0030000007a) dm-24 HITACHI,OPEN-V
size=11T features='0' hwhandler='0' wp=rw
`-+- policy='service-time 0' prio=1 status=active
  |- 7:0:0:30 sdae 65:224 active ready running
  `- 8:0:0:30 sdbi 67:192 active ready running
mpathad (360060e8012b003005040b00300000037) dm-30 HITACHI,OPEN-V
size=11T features='0' hwhandler='0' wp=rw
`-+- policy='service-time 0' prio=1 status=active
  |- 7:0:0:9  sdj  8:144  active ready running
  `- 8:0:0:9  sdan 66:112 active ready running
mpathk (360060e8012b003005040b0030000006d) dm-11 HITACHI,OPEN-V
size=11T features='0' hwhandler='0' wp=rw
`-+- policy='service-time 0' prio=1 status=active
  |- 7:0:0:19 sdt  65:48  active ready running
  `- 8:0:0:19 sdax 67:16  active ready running
mpathw (360060e8012b003005040b00300000031) dm-23 HITACHI,OPEN-V
size=11T features='0' hwhandler='0' wp=rw
`-+- policy='service-time 0' prio=1 status=active
  |- 7:0:0:3  sdd  8:48   active ready running
  `- 8:0:0:3  sdah 66:16  active ready running
mpathac (360060e8012b003005040b00300000036) dm-29 HITACHI,OPEN-V
size=11T features='0' hwhandler='0' wp=rw
`-+- policy='service-time 0' prio=1 status=active
  |- 7:0:0:8  sdi  8:128  active ready running
  `- 8:0:0:8  sdam 66:96  active ready running
mpathj (360060e8012b003005040b0030000006c) dm-10 HITACHI,OPEN-V
size=11T features='0' hwhandler='0' wp=rw
`-+- policy='service-time 0' prio=1 status=active
  |- 7:0:0:18 sds  65:32  active ready running
  `- 8:0:0:18 sdaw 67:0   active ready running
mpathv (360060e8012b003005040b00300000079) dm-22 HITACHI,OPEN-V
size=11T features='0' hwhandler='0' wp=rw
`-+- policy='service-time 0' prio=1 status=active
  |- 7:0:0:29 sdad 65:208 active ready running
  `- 8:0:0:29 sdbh 67:176 active ready running
mpathab (360060e8012b003005040b00300000035) dm-28 HITACHI,OPEN-V
size=11T features='0' hwhandler='0' wp=rw
`-+- policy='service-time 0' prio=1 status=active
  |- 7:0:0:7  sdh  8:112  active ready running
  `- 8:0:0:7  sdal 66:80  active ready running
mpathi (360060e8012b003005040b0030000006b) dm-9 HITACHI,OPEN-V
size=11T features='0' hwhandler='0' wp=rw
`-+- policy='service-time 0' prio=1 status=active
  |- 7:0:0:17 sdr  65:16  active ready running
  `- 8:0:0:17 sdav 66:240 active ready running
mpathu (360060e8012b003005040b00300000077) dm-21 HITACHI,OPEN-V
size=11T features='0' hwhandler='0' wp=rw
`-+- policy='service-time 0' prio=1 status=active
  |- 7:0:0:28 sdac 65:192 active ready running
  `- 8:0:0:28 sdbg 67:160 active ready running
mpathaa (360060e8012b003005040b00300000034) dm-27 HITACHI,OPEN-V
size=11T features='0' hwhandler='0' wp=rw
`-+- policy='service-time 0' prio=1 status=active
  |- 7:0:0:6  sdg  8:96   active ready running
  `- 8:0:0:6  sdak 66:64  active ready running
mpathh (360060e8012b003005040b0030000006a) dm-8 HITACHI,OPEN-V
size=11T features='0' hwhandler='0' wp=rw
`-+- policy='service-time 0' prio=1 status=active
  |- 7:0:0:16 sdq  65:0   active ready running
  `- 8:0:0:16 sdau 66:224 active ready running
mpatht (360060e8012b003005040b00300000076) dm-20 HITACHI,OPEN-V
size=11T features='0' hwhandler='0' wp=rw
`-+- policy='service-time 0' prio=1 status=active
  |- 7:0:0:27 sdab 65:176 active ready running
  `- 8:0:0:27 sdbf 67:144 active ready running
mpathg (360060e8012b003005040b00300000066) dm-7 HITACHI,OPEN-V
size=11T features='0' hwhandler='0' wp=rw
`-+- policy='service-time 0' prio=1 status=active
  |- 7:0:0:15 sdp  8:240  active ready running
  `- 8:0:0:15 sdat 66:208 active ready running
mpaths (360060e8012b003005040b00300000075) dm-19 HITACHI,OPEN-V
size=11T features='0' hwhandler='0' wp=rw
`-+- policy='service-time 0' prio=1 status=active
  |- 7:0:0:26 sdaa 65:160 active ready running
  `- 8:0:0:26 sdbe 67:128 active ready running
mpathf (360060e8012b003005040b00300000065) dm-6 HITACHI,OPEN-V
size=11T features='0' hwhandler='0' wp=rw
`-+- policy='service-time 0' prio=1 status=active
  |- 7:0:0:14 sdo  8:224  active ready running
  `- 8:0:0:14 sdas 66:192 active ready running
Benoit
  • 111
  • 9
  • "The SAN itself uses hardware Raid 6 (no other option is given to us), and presents us with 380 TB that we can split in a number of LUNs" Please tell me your storage team didn't do something stupid like create a single RAID-6 array from like 32 12TB disks then then say, "Herp! Derp! Here's your 380TB! How many LUNs should we present from this array?" If they did that abomination, that's going to be a [read-modify-write](https://www.vmdamentals.com/?p=897) horror show whenever you write to any ONE of your LUNs. A scrub writing to all LUNs is going to overwhelm that SAN storage. – Andrew Henle May 06 '22 at 19:20
  • What they should have done is say, "We have N amount of X TB drives. How do you want us to split them into separate arrays and we'll present each one to you as a separate LUN." But that would take more effort on their part. FWIW, RAID-6 arrays with 10 drives each is probably a nice sweet spot - although you'd need to be aware of RAID rebuild times for failed drives for those. 30+ drive RAID-6 arrays are asinine and incompetent. And yeah, I'm hoping they read this if that's what they did. – Andrew Henle May 06 '22 at 19:22
  • Dear Andrew, it neither the worst case of your first comment nor the best case of your second comment… I have in mind they used smaller RAID6-arrays than the 10 drives you mentioned (RAID 60, I will edit my question accordingly) I had asked for a split to match ZFS geometry (as a complete ignorant I had hoped for one disk per LUN) but apparently it is some solution they bought as it is and the split is fixed at the onset. The SAN seems fine even during scrub or while moving lots of data. My 380 TB are only a small part as I understand. – Benoit May 07 '22 at 00:28
  • You need a [ZFS professional/consultant](https://serverfault.com/users/13325/ewwhite?tab=profile) and some advocates to work with corporate IT... – ewwhite May 07 '22 at 08:05
  • 1
    About getting professional help: this might end up so as it was the case for security audits. As weather scientists we are often reaching our limits... – Benoit May 07 '22 at 14:10

3 Answers3

2

You're looking at the wrong spot. If you SAN faults under load, then you can't rely on it, period. Fix the SAN.

longneck
  • 22,793
  • 4
  • 50
  • 84
2

We were able to fix the setup.

  1. We set the proper option cachefile=none in order to avoid importing the zpool at startup with an unstable multipath configuration. We noticed that it happened that some redundant paths were not completely set before zfs mounts the pool. Delaying import prevented a cascade of faulted devices in zfs under load and allowed to look separately at potential issues with the SAN.

  2. We found IO errors in the system and the SAN logs, those affected sporadically only half of the mpaths. We first replaced the cable associated with the errors with no effect: the fibre connector appeared to be the culprit and was replaced.

  3. We found optimal multipath parameters recommended by our SAN vendor and edited the corresponding file:

nano /etc/multipath.conf

Finally we updated the initial RAM disk:

update-initramfs -u -k all

All described issues under load are now resolved, multipath -ll does not show any failed path during scrub anymore and zfs stopped reporting errors.

Benoit
  • 111
  • 9
0

This is really in the realm of professional services, given the specific nature of the setup and the odd SAN configuration.

This can be tuned and adjusted for better behavior and performance.

  • Why are you scrubbing, though?
  • What tunables have you adjusted to allow scrubbing of a pool of this geometry?
  • Please post your /etc/modprobe.d/zfs.conf
  • Please post your Proxmox's /etc/sysctl.conf
ewwhite
  • 194,921
  • 91
  • 434
  • 799
  • Thank you for your answer. We are scrubbing because we store snapshots of Proxmox VM disks and databases on this pool. We thought that on block devices data integrity is important. We have not adjusted any tunables yet, for the moment we can allocate up to 80 GB to ZFS ARC. We have not created any zfs.conf files, usually we just set the ARC parameters there, but for ZFS on SAN we have not enough experience to do so. Regarding Proxmox /etc/sysctl.conf: all lines are commented, we have not edited this file. Note that Proxmox passes the SAN adapter to the VM (PCIe passtrough). – Benoit May 07 '22 at 14:03
  • The setup needs work. Contact me offline. – ewwhite May 12 '22 at 09:21