Why does active-active configuration degrade performance compared to failover?

Question

We are setting up the new storage for an HPC compute cluster that we are managing for applied statistics, bioinformatics, and genomics.

Configuration

We have the main enclosure with a Dell EMC ME4084 (84x12TB 7200rpm) and an additional enclosure with a Dell EMC ME484 (28x12TB). The EMC ME4084 provides ADAPT distributed RAID (similar to RAID6) and dual hardware controllers.

The file server is running CentOS 7. The storage is connected to the file server using two SAS cables. Each LUN corresponds to a 14-disk group with ADAPT and both SAS connections appear as the devices sdb and sdj. The examples below are given for LUN ID 0.

We configured multipath as follows for the active-active configuration:

$ cat /etc/multipath.conf
defaults {
    path_grouping_policy multibus
    path_selector "service-time 0"
}

$ multipath -ll
mpatha (3600c0ff000519d6edd54e25e01000000) dm-6 DellEMC ,ME4
size=103T features='0' hwhandler='0' wp=rw
`-+- policy=‘service-time 0' prio=1 status=active
  |- 1:0:0:0  sdb 8:16  active ready running
  `- 1:0:1:0  sdj 8:144 active ready running

The failover configuration:

$ cat /etc/multipath.conf
defaults {
    path_grouping_policy failover
    path_selector "service-time 0"
}

$ multipath -ll
mpatha (3600c0ff000519d6edd54e25e01000000) dm-6 DellEMC ,ME4
size=103T features='0' hwhandler='0' wp=rw
|-+- policy=’service-time 0' prio=1 status=active
| `- 1:0:0:0  sdb 8:16  active ready running
`-+- policy=’service-time 0' prio=1 status=enabled
  `- 1:0:1:0  sdj 8:144 active ready running

We verified that writing to mpatha results in writing to both sdb and sdj in the active-active configuration and only to sdb in the active-enabled configuration. We striped mpatha and another mpathb into a logical volume and formatted it using XFS.

Test Setup

We benchmarked I/O performance using fio under the following workloads:

Single 1MiB random read/write process
Single 4KiB random read/write process
16 parallel 32KiB sequential read/write processes
16 parallel 64KiB random read/write processes

Test Results

                       Failover           Active-Active
                 -------------------   -------------------
   Workload        Read       Write      Read       Write
--------------   --------   --------   --------   --------
1-1mb-randrw     52.3MB/s   52.3MB/s   51.2MB/s   50.0MB/s
1-4kb-randrw     335kB/s    333kB/s    331kB/s    330kB/s
16-32kb-seqrw    3181MB/s   3181MB/s   2613MB/s   2612MB/s
16-64kb-randrw   98.7MB/s   98.7MB/s   95.1MB/s   95.2MB/s

I am only reporting only one set of tests but the results are consistent across replicates (n=3) and to the choice of path_selector.

Is there any reason active-active cannot at the very least match the performance of active-enabled? I don’t know if the issue is with the workloads and multipath configuration. The difference was even more staggering (20%) when we used a linear logical volume instead of striping. I'm really curious to see if I overlooked something obvious.

Many thanks,

Nicolas

could try to use two hba and monitoring the service time of the lun 0 while testing with active/active? — c4f4t0r, Jul 17 '20 at 09:07
Thanks, @c4f4t0r! I was wrong in the other post - there are in fact 2 ports on 1 HBA card: one per SAS cable. `sdb` and `sdj` both point to LUN 0 but via HBA ports 1 and 2, respectively. I'll edit the post to clarify. Using `round-robin` or `queue-length` instead of `service-time` yields a similar discrepancy. — Nicolas De Jay, Jul 17 '20 at 13:55

shodanshok · Accepted Answer · 2020-07-17T06:26:15.777

1

As you are using HDDs, a single controller is already plently fast for your backend disks. Adding another controller in active/active mode means no additional IOPs (due to HDDs), but more overhead at the multipath level, hence the reduced performance.

In other words: you will saturate the HDDs way before the CPU of the first controller, so leave them in active/passive mode. Moreover, I would try to use a single 28 disk array and benchmark it to see if it provides more or less performance than the actual 2x 14 disks setup.

edited Jul 17 '20 at 06:26

answered Jul 16 '20 at 20:57

shodanshok

44,038
6
98
162

1

https://serverfault.com/questions/1024398/what-are-the-best-practices-for-device-mapper-multipath – c4f4t0r Jul 17 '20 at 08:28
2

@c4f4t0r he really need to test with a representative workload to make an informed choice. – shodanshok Jul 17 '20 at 09:01
@shodanshok Unfortunately it seems that the Dell EMC ME4084 only supports volumes of up to ~140TB. We are intending to create file systems of up to 200-400TB, so it seems like our only choice is to go with two volumes of ~100TB and stripe them. What do you think? Here is the link to the documentation: https://www.dell.com/support/manuals/ca/en/cadhs1/powervault-me4084/me4_series_ag_pub/system-configuration-limits?guid=guid-38076a21-22af-4fb0-876a-80ef259ec14e&lang=en-us – Nicolas De Jay Jul 23 '20 at 16:43
1

If the DELL EMC unit does not support a single big volume, then you have two possibilities: a) to stripe the two volumes or b) concatenate them via plain LVM. You had to carefully bench your workload to do an informed choice, but I would suppose that the concatenated solution can give better real-world aggregated IOPs compared to software striping. – shodanshok Jul 23 '20 at 18:11

Why does active-active configuration degrade performance compared to failover?

Configuration

Test Setup

Test Results

1 Answers1

Linked