Root-causing vastly different performance on iozone O_SYNC benchmark for two HDD manufacturers

Question

I have two servers A and B with the following configuration:

A: 4TB HDDs, with RAID 1 (MegaRAID SAS 2008), 128MB cache, no BBU, write-through mode, 7.2k RPM, manufacturer A.
B: 1.5TB HDDs, with RAID 1 (MegaRAID SAS 3108), 64MB cache, with BBU, but write-through mode, 10.5k RPM, manufacturer B.

I run the following benchmark on a single RAIDed partition: iozone -a -s 10240 -r 4 -+r

Results from A (excerpt):

                                                            random  random    bkwd   record   stride
          kB  reclen   write rewrite    read    reread    read   write    read  rewrite     read   fwrite frewrite   fread  freread
       10240       4     108     474  4193564  6667334 6556395     701 4058822      475  3653175  2303202  2616201 6785306  6101840

Results from B (excerpt):

                                                            random  random    bkwd   record   stride
          kB  reclen   write rewrite    read    reread    read   write    read  rewrite     read   fwrite frewrite   fread  freread
       10240       4    3332   46961  5478410  6836065 4994841    2951 2853077      728  2299133  1722202  2008983 4549365  4712594

Both servers have write-through caching enabled, but I am unable to root-cause why the write-throughput performance is horribly slow on server A (108kB/sec) when compared to server B (3332 kB/sec), assuming I am interpreting the results correctly.

What could be the reason? Both servers have identical other file system options (ext4/same default options).

Could it just be the case that disks from manufacturer B are superior to those from A for workloads involving a lot of synchronous writes?

thanks.

Errr ... this might be obvious, but the two servers have the same type of RAID controller, don't they? Entry level RAID controllers (like for example the PERC H310) are known for this horrible write performance. Even in RAID1. — s1lv3r, Mar 30 '16 at 13:35
Not obvious, but thanks for asking the question! :) Servers have different generations of RAID controllers, but both are from the same vendor (MegaRAID). I will update the question with more details on this! — Vimal, Mar 30 '16 at 14:13

s1lv3r · Accepted Answer · 2016-04-01T15:26:21.947

3

Regarding the measured 33x difference between your results, following up on our discussion in the comments, it turned out, that MegaCli64 -LDGetProp -DskCache -Lall -aAll showed that setup B had the disk drive cache enabled by default, while it was disabled on setup A.

Using MegaCli64 -LDSetProp -DisDskCache -Immediate -Lall -aAll resulted in both systems showing a similiar performance.

Is it safe to run the RAID with disk drive cache enabled?

Running a RAID with disk drive cache enabled is actually similiar to running a RAID controller with non BBU backed volatile cache with write caching enabled (forced write-back mode). It enhances the performance, but at the same time increases the possibility of data-loss and data-inconsistency in the event of a power failure.

If you want to avoid this chance, while still having a decent I/O performance, it is advisable to have a controller with BBU backed-cache and to configure your volume to write-back mode with disk caching disabled.

The difference between your two RAID controllers

I don't know if you already knew, but there is more between software and hardware RAID (this is an interesting article regarding this).

In the end the MegaRAID SAS 2008 is more or less an HBA or IO-Controller with added RAID capability, while the MegaRAID SAS 3108 is a real RAID Controller™ (also called ROC or RAID-on-Chip), which has a dedicated processor for handling the RAID calculations.

The SAS 2008 is especially known for horrible write performance with some OEM firmwares (like the DELL one in the PERC H310 which I mentioned in the comment).

Especially the synchronous mode in combination with your chosen record length and file size seems to result in really poor results with software/fake RAID.

For reference, this is what I get on my workstation using 10k WD Velocity Raptors in software RAID1:

                                                    random  random    bkwd   record   stride                                   
      KB  reclen   write rewrite    read    reread    read   write    read  rewrite     read   fwrite frewrite   fread  freread
   10240       4     182     181  1804774  2127084 2110984     167 1673159      153  1760968   954589  1203989 2022512  2062824

If you are running in synchronous mode (O_SYNC) your Result A seems therefore to be reasonable in terms of what can be delivered via soft/fake-RAID.

Does write-through cache mode cause a performance degradation of the array over time?

I don't think so. With an activated write-cache, the controller is able to perform certain operations to optimize the pending write operations.

For example this description of the cache operation is taken from the whitepaper for HP Smart Array controllers:

The write cache will typically fill up and remain full most of the time in high-workload environments. The controller uses this opportunity to analyze the pending write commands to improve their efficiency. The controller can use write coalescing that combines small writes to adjacent logical blocks into a single larger write for quicker execution. The controller can also perform command reordering, rearranging the execution order of the writes in the cache to reduce the overall disk latency.

As you can read, the cache is used to further enhance the write-performance of the array, but this does not seem to have any impact on the performance of any subsequent write or read operations.

Regarding disk-fragmentation, this is a file-system/OS level problem. The RAID controller - operating on the block level - isn't able to optimize file system fragmentation at all, so there is no difference if it operates in write-trough or write-back mode.

edited Apr 01 '16 at 15:26

answered Mar 30 '16 at 15:31

s1lv3r

1,155
1
14
24

1

Thanks for the details! I ran the specific iozone command as it illustrates the performance difference clearly, and it's representative of our application workload. In fact, for raw throughput, A is slightly better than B. And, I did verify that the cache policy is write-through using the MegaCLI64 tool. There is another possibility that the controller B is perhaps lying about write-through, which I think can only be confirmed via extensive "pull the plug" testing. – Vimal Mar 30 '16 at 16:18
@Vimal I would not say it's lying - I would rather guess that the drive cache is still enabled. Regarding your workload. If you have a workload which demands a lot synchronous write operations, IMHO there is no way around either having a proper controller with BBU backed (or flash) cache or SSDs. – s1lv3r Mar 30 '16 at 17:01
@Vimal I just tested on a MegaRAID SAS 9260-4i and enabling/disabling the drive cache in the volume configuration changes the write performance from 98 to 5133 Kbytes/sec for me. Of course having the HDD write cache enabled while the controller is configured to be in write-through mode doesn't really make any sense to begin with, but it would explain the difference. – s1lv3r Mar 30 '16 at 17:23
That's interesting. Which command do you use to check if the drive (not the controller, as you point out) has write cache enabled? I tried using `hdparm -I /dev/sda` -- it doesn't seem useful (`SG_IO: bad/missing sense data, sb[]:`, followed by some limited info). – Vimal Mar 30 '16 at 20:41
It's a configuration setting of the RAID volume. `MegaCli64 -LDInfo -L0 -a0` (assuming first controller & first volume on that controller) should show a line `Disk Cache Policy`. While I configured it via the RAID BIOS it should also be possible to set it via MegaCli somehow. As you already noticed hdparm doesn't really work on RAID devices. It only sees the virtual drive presented by the RAID controller and not the physical drives which are hidden behind that abstraction. – s1lv3r Mar 30 '16 at 22:57
Yep, I figured `hdparm` didn't quite work, so I did use a UI that exposes similar information that output the same info as your `MegaCLI64` command. It says `Disk Cache Policy : Disk's Default`. I couldn't find the disk's default cache policy configured explicitly, though. I am quite new to this setup, so thanks for being patient with me! – Vimal Mar 30 '16 at 23:34
`Disk's Default` could mean both, yes or no, depending on the disk firmware. You could try `MegaCli64 -LDSetProp -DisDskCache -Immediate -Lall -aAll` to make sure it is disabled. If that doesn't work the same setting is available to be changed in the volume configuration in the RAID BIOS (press `Ctrl + H` in boot process). – s1lv3r Mar 31 '16 at 08:51
Thank you so much @s1lv3r. When I used `sudo MegaCli64 -LDGetProp -DskCache -Lall -aAll` to query the disk cache, I saw it still said `Disk Write Cache : Disk's Default`. When I used your command to force-disable it, server B's performance was on par with server A's, and I got around 146kB/sec O_SYNC performance. And on server A, when I enable disk caching, it actually outperforms server B! As you said, it's weird the disk's caching is set to true by default on one vendor. :) Is it safe? – Vimal Mar 31 '16 at 12:48
1

I've accepted your answer. Could you edit it slightly, in light of the new findings (that the disk caching, not RAID caching, was disabled by default on server A, and enabled by default on server B)? – Vimal Mar 31 '16 at 12:49
Thank you! - I updated my answer to include the new information. :) – s1lv3r Mar 31 '16 at 13:36
Thanks again @s1lv3r. I have one other question (maybe I will create a new post if it's complicated), but could disk write-through cache performance degrade over time? (e.g., more seeks caused by the file system due to a defragmented file layout on disk, etc.) – Vimal Apr 01 '16 at 13:29
@Vimal I edited my answer and tried to expand a little bit on your question, if that doesn't suffice (I'm not totally sure I understand correctly), you may really want to post it as a new question. ;-) – s1lv3r Apr 01 '16 at 15:38
1

Thanks @s1lv3r. I will gather more details, do some research, and try go get a reproducer for the second issue I am seeing, before asking a question! – Vimal Apr 04 '16 at 13:07

Root-causing vastly different performance on iozone O_SYNC benchmark for two HDD manufacturers

1 Answers1

Is it safe to run the RAID with disk drive cache enabled?

The difference between your two RAID controllers

Does write-through cache mode cause a performance degradation of the array over time?