6

I am using ZFS on Linux and am experiencing a rather strange symptom that when I add more disks to the system, the speed at which each drive writes reduces, effectively negating the additional spindles for sequential write performance.

The disks are connected with to the Host via an HBA (LSI 9300-8E) on SAS Disk Shelves

While testing below I ran following command on IO Zone iozone -i 0 -s 10000000 -r 1024 -t 10

Here are the results of my tests:

In my first test I have created a mirror with 12 Disks, which show expected write performance of around 100 MB/s per second to each disk..

zpool create -o ashift=12 -f PoolA mirror S1_D0 S2_D0 mirror S1_D1 S2_D1 mirror 
S1_D2 S2_D2 mirror S1_D3 S2_D3 mirror S1_D4 S2_D4 mirror S1_D5 S2_D5

              capacity     operations    bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
PoolA       3.60G  10.9T      0  5.06K      0   638M
  mirror     612M  1.81T      0    863      0   106M
    S1_D0       -      -      0    862      0   106M
    S2_D0       -      -      0    943      0   116M
  mirror     617M  1.81T      0    865      0   107M
    S1_D1       -      -      0    865      0   107M
    S2_D1       -      -      0    939      0   116M
  mirror     613M  1.81T      0    860      0   106M
    S1_D2       -      -      0    860      0   106M
    S2_D2       -      -      0    948      0   117M
  mirror     611M  1.81T      0    868      0   107M
    S1_D3       -      -      0    868      0   107M
    S2_D3       -      -      0  1.02K      0   129M
  mirror     617M  1.81T      0    868      0   107M
    S1_D4       -      -      0    868      0   107M
    S2_D4       -      -      0    939      0   116M
  mirror     616M  1.81T      0    856      0   106M
    S1_D5       -      -      0    856      0   106M
    S2_D5       -      -      0    939      0   116M
----------  -----  -----  -----  -----  -----  -----

In the next test I add 12 More disks for a total of 24 disks and we effectively cut the bandwidth to each disk in half.

zpool create -o ashift=12 -f PoolA mirror S1_D0 S2_D0 mirror S1_D1 S2_D1 
mirror S1_D2 S2_D2 mirror S1_D3 S2_D3 mirror S1_D4 S2_D4 
mirror S1_D5 S2_D5 mirror S1_D6 S2_D6 mirror S1_D7 S2_D7 
mirror S1_D8 S2_D8 mirror S1_D9 S2_D9 mirror S1_D10 S2_D10 
mirror S1_D11 S2_D11

                capacity     operations    bandwidth
pool         alloc   free   read  write   read  write
-----------  -----  -----  -----  -----  -----  -----
PoolA        65.2M  21.7T      0  4.67K      0   588M
  mirror     6.56M  1.81T      0    399      0  49.0M
    S1_D0        -      -      0    399      0  49.0M
    S2_D0        -      -      0    513      0  63.1M
  mirror     5.71M  1.81T      0    400      0  48.7M
    S1_D1        -      -      0    400      0  48.7M
    S2_D1        -      -      0    515      0  62.6M
  mirror     6.03M  1.81T      0    396      0  49.1M
    S1_D2        -      -      0    396      0  49.1M
    S2_D2        -      -      0    509      0  62.9M
  mirror     5.89M  1.81T      0    394      0  49.0M
    S1_D3        -      -      0    394      0  49.0M
    S2_D3        -      -      0    412      0  51.3M
  mirror     5.60M  1.81T      0    400      0  49.0M
    S1_D4        -      -      0    400      0  49.0M
    S2_D4        -      -      0    511      0  62.9M
  mirror     4.65M  1.81T      0    401      0  48.9M
    S1_D5        -      -      0    401      0  48.9M
    S2_D5        -      -      0    511      0  62.3M
  mirror     5.36M  1.81T      0    397      0  49.2M
    S1_D6        -      -      0    397      0  49.2M
    S2_D6        -      -      0    506      0  62.5M
  mirror     4.88M  1.81T      0    395      0  49.2M
    S1_D7        -      -      0    395      0  49.2M
    S2_D7        -      -      0    509      0  63.3M
  mirror     5.01M  1.81T      0    393      0  48.2M
    S1_D8        -      -      0    393      0  48.2M
    S2_D8        -      -      0    513      0  63.0M
  mirror     5.00M  1.81T      0    399      0  48.7M
    S1_D9        -      -      0    399      0  48.7M
    S2_D9        -      -      0    513      0  62.5M
  mirror     5.00M  1.81T      0    398      0  49.2M
    S1_D10       -      -      0    398      0  49.2M
    S2_D10       -      -      0    509      0  62.8M
  mirror     5.55M  1.81T      0    401      0  50.0M
    S1_D11       -      -      0    401      0  50.0M
    S2_D11       -      -      0    506      0  63.1M
-----------  -----  -----  -----  -----  -----  -----

Hoping someone can maybe shed some light as to why adding more disks would effectively cut the performance to each disk.

ADDITIONAL REQUESTED INFORMATION

Hardware Summary

Server

Lenovo ThinkServer RD550, Single 10 Core Xeon, 256GB of Ram, OS on RAID 1 on 720ix controller.

Server HBA

LSI 9300-8e mpt3sas_cm0: LSISAS3008: FWVersion(12.00.00.00), ChipRevision(0x02), BiosVersion(06.00.00.00)

Disk Shelves

Disk Shelves are Lenovo ThinkServer SA120 with Dual SAS Controllers, Dual Power Supplies cabled in a redundant fashion with 2 path's to each disk.

Disk Shelf Connectivity

The disk shelves are connected via .5 Meter SAS Cables and daisy chained through the shelves with a loop back to the controller at the end.

Drive Information

48 x 2TB SAS Drives Seagate Model # ST2000NM0023 Drives are configured through multipath and each drive has redundant pathways.

Software Summary

Operating System / Kernel

CentOS 7.3 Output from "uname -a" Linux 4.9.9-1.el7.elrepo.x86_64 #1 SMP Thu Feb 9 11:43:40 EST 2017 x86_64 x86_64 x86_64 GNU/Linux

ZFS Tuning

/etc/modprobe.d/zfs.conf is a blank file currently, I haven't tried much here, the sequential write performance seems like it should increase with more disks.

user56789
  • 61
  • 2
  • Please provide more details. Your hardware setup, interconnect information, tuning and zfs.conf all matter here. Who knows if you're running into a SAS expander or bandwidth issue. Tell us exactly what the hardware configuration is. Also, Linux distribution and version details may help. – ewwhite Mar 22 '17 at 07:15
  • Are you using the `multipath` daemon and the /dev/mapper/* devices?? – ewwhite Mar 22 '17 at 22:04
  • What's the output from `lspci` show? If the PCIe bus is running at v1 for some reason, you'll only get 2 GB/sec total to an 8-lane card like your LSI HBA. With overhead and latency, that could explain your apparent 1.2 GB/sec limit. (You may need to add `-vvv` to `lspci` to see the PCIe version that's actually running, if it's even possible for `lspci` to show it...) – Andrew Henle Mar 23 '17 at 15:04
  • I'm experiencing a similar issue. I think, I'm still in the early stages of troubleshooting. Were you ever able to crack this nut? – mwp May 17 '21 at 15:13

2 Answers2

1

The specification for an LSI 9300-8e HBA quotes 12Gb (Gigabit) throughput for connected storage. (https://docs.broadcom.com/docs/12353459) You multiply up that figure with the bandwidth getting ~9600MB/sec. overall throughput.

Is there an overall I/O queue depth setting for the HBA (driver) in the OS that is throttling your I/O? This still wouldn't explain how the bandwidth is halved so accurately.

Your figures would make a lot of sense if only a 'single path' or link is working for the SAS connection - is there any (bizarre) way only one link out of eight could be working? I am not aware of how wide or narrow SAS 'ports' (which are virtual rather than physical objects) are configured from 'phys' and if the HBA isn't talking to your disk shelf is there a fallback configuration option that might allow this?

Alex M
  • 11
  • 2
0

Pending more information...

You should provide details like:

  • The specific make/model/speed/interface of the disks. (are they SAS? SATA?)
  • The specific external JBOD enclosure in use.
  • How the enclosure connected to the server.

Who knows? You may just be oversubscribed on the enclosure's SAS expander and unable to scale by adding drive spindles.

Of course, there's an element of tuning here, too. We'd need to see what modifications you've made to your /etc/modprobe.d/zfs.conf.

If that file is empty, you're probably missing out on a tremendous number of tunables available to your ZFS build.

Can you also explain what OS is in use? Distribution, kernel and version.


Just follow my Linux+ZFS HA guide.

You'll also want to tune your zfs.conf:

Here's mine:

#options zfs zfs_arc_max=51540000000
options zfs zfs_vdev_scrub_min_active=24
options zfs zfs_vdev_scrub_max_active=64
options zfs zfs_vdev_sync_write_min_active=8
options zfs zfs_vdev_sync_write_max_active=32
options zfs zfs_vdev_sync_read_min_active=8
options zfs zfs_vdev_sync_read_max_active=32
options zfs zfs_vdev_async_read_min_active=8
options zfs zfs_vdev_async_read_max_active=32
options zfs zfs_top_maxinflight=320
options zfs zfs_txg_timeout=15
options zfs zfs_vdev_scheduler=deadline
options zfs zfs_prefetch_disable=0
options zfs l2arc_write_max=8388608
options zfs l2arc_headroom=12
options zfs zfs_dirty_data_max_percent=40
options zfs zfs_vdev_async_write_min_active=8
options zfs zfs_vdev_async_write_max_active=32
options zfs zfs_immediate_write_sz=131072
options zfs zil_slog_limit=536870912
ewwhite
  • 194,921
  • 91
  • 434
  • 799