7

Attention, please. Long read.
During initial performance tests of Hitachi Ultrastar 7K6000 drives that I'm planning to use in my Ceph setup I've noticed a strange thing: write performance is better when disk write cache is disabled.


I use fio:

fio --filename=/dev/sda --direct=1 --sync=1 --rw=randwrite --bs=4k --numjobs=1 --iodepth=1 --runtime=60 --time_based --group_reporting --name=4krandw

When write cache is disabled:

hdparm -W 0 /dev/sda 

4krandw: (groupid=0, jobs=1): err= 0: pid=6368: Thu Jun 22 07:36:44 2017
write: io=63548KB, bw=1059.9KB/s, iops=264, runt= 60003msec
clat (usec): min=473, max=101906, avg=3768.57, stdev=11923.0

When write cache is enabled:

hdparm -W 1 /dev/sda

4krandw: (groupid=0, jobs=1): err= 0: pid=6396: Thu Jun 22 07:39:14 2017
write: io=23264KB, bw=397005B/s, iops=96, runt= 60005msec
clat (msec): min=1, max=48, avg=10.30, stdev= 4.12

Relevant hardware details:

  • Server: Supermicro 5018D8-AR12L
  • Storage controller: LSI2116 IT mode (integrated SW solution) works without any caching or logical volume management
  • Disks: Hitachi Ultrastar 7K6000 4Tb (HUS726040ALE614)
  • OS: Ubuntu 16.04.2, kernel 4.4.0-81-generic

Unfortunately, I can not think of any reasonable explanation for this behaviour, quick summary:

  • Write cache disabled: 264 IOPS, 3.768ms commit latency (high std deviation, though)
  • Write cache enabled: 96 IOPS, 10.3ms commit latency

UPD: I have tested the disk with direct connection to a SATA port on the motherboard (separate SATA controller, not LSI2116) and nothing has changed, the same results. So, I presume, that's not a SW LSI2116 controller that cause strange results.

UPD2: Interestingly, performance gain for sequential operations when cache is disabled is lower, but stable. Here's an example:

fio --filename=/dev/sdl --direct=1 --sync=1 --rw=write --bs=16M --numjobs=1 --iodepth=1 --runtime=60 --time_based --group_reporting --name=16M-wr 


Write cache enabled:

16M-wr: (groupid=0, jobs=1): err= 0: pid=2309: Fri Jun 23 11:52:37 2017
  write: io=9024.0MB, bw=153879KB/s, iops=9, runt= 60051msec
    clat (msec): min=86, max=173, avg=105.37, stdev= 9.64

Write cache disabled:

16M-wr: (groupid=0, jobs=1): err= 0: pid=2275: Fri Jun 23 11:45:22 2017  
  write: io=10864MB, bw=185159KB/s, iops=11, runt= 60082msec
    clat (msec): min=80, max=132, avg=87.42, stdev= 6.84

And this becomes interesting because difference between results when cache enabled and disabled is exactly what HGST claims in their datasheet:
https://www.hgst.com/sites/default/files/resources/Ultrastar-7K6000-DS.pdf

•Compared to prior generation 7K4000
...
Up to 3X faster random write performance using media cache technology
25% faster sequential read/write performance

It still does not explain why performance is better with write cache disabled, however, it indeed looks like when write cache is enabled, I get performance comparable to prev. generation 7K4000. Without write cache random write performance is 2.6x faster and sequential is 1.2X faster.

UPD3 hypotesis: Newer Hitachi Ultrastar drives has a feature called Media Cache. It is an advanced non-volatile caching technique and here's how it works (as I understand it, of course):

  • First data is written into DRAM cache
  • Next, drive has many reserved areas on each platter physically located in areas providing the best speed. These areas are essentially Media Cache storage. So, these areas are used as non-volatile second stage cache. The data from DRAM buffer is accumulated and flushed with high queue depth into Media Cache. This allows to minimize head movements and provides additional reliability and speed gain.
  • And only after that data is written to the actual storage areas on the platter.

So, Media Cache is a two-stage writeback cache and I think that write operation is considered complete only after flush to the Media Cache is done.
Interesting technique, I must admit. My hypotesis is that when we disable write caching with hdparm -W0, only media cache is disabled.
Data is being cached only in DRAM and then directly flushed to the platters. Although, Media Cache for sure should provide great advantage, during synchronous writes we have to wait for writes to the Media Cache area. And when Media Cache is disabled write is considered complete after data is written into disk DRAM buffer. Much faster. On lower queue depths DRAM cache provides enough space to write without speed degradation, however, on larger queues, when MANY flushes to the platter has to happen constantly the situation is different. I have performed two tests with QD=256.

fio --filename=/dev/sda --direct=1 --sync=1 --rw=randwrite --bs=4k --numjobs=1 --iodepth=256 --runtime=180 --time_based --group_reporting --name=4krandwrite

hdparm -W0 /dev/sda (write cache disabled)
4krandwrite: (groupid=0, jobs=1): err= 0: pid=3176: Wed Jun 28 10:11:15 2017
  write: io=62772KB, bw=357093B/s, iops=87, runt=180005msec
    clat (msec): min=1, max=72, avg=11.46, stdev= 4.95

hdparm -W1 (write cache enabled)
4krandwrite: (groupid=0, jobs=1): err= 0: pid=3210: Wed Jun 28 10:14:37 2017
  write: io=70016KB, bw=398304B/s, iops=97, runt=180004msec
    clat (msec): min=1, max=52, avg=10.27, stdev= 3.99

So, we clearly see that enabling write cache gives 11.5% advantage in IOPS and commit latency. Looks like my hypotesis is correct and hdparm controls only Media Cache, but not DRAM buffer. And on higher queue depths MC really pays for itself

This is not the case for sequential operations, though.

fio --filename=/dev/sda --direct=1 --sync=1 --rw=write --bs=16M --numjobs=1 --iodepth=256 --runtime=180 --time_based --group_reporting --name=16Mseq

hdparm -W0 /dev/sda (write cache disabled)
16Mseq: (groupid=0, jobs=1): err= 0: pid=3018: Wed Jun 28 09:38:52 2017
  write: io=32608MB, bw=185502KB/s, iops=11, runt=180001msec
    clat (msec): min=75, max=144, avg=87.27, stdev= 6.58

hdparm -W1 /dev/sda (write cache enabled)
16Mseq: (groupid=0, jobs=1): err= 0: pid=2986: Wed Jun 28 09:34:00 2017
  write: io=27312MB, bw=155308KB/s, iops=9, runt=180078msec
    clat (msec): min=83, max=165, avg=104.44, stdev=10.72

So, I guess, Media Cache provides speed advantage on random write loads, for sequential writes it may be used mainly as additional reliability mechanism.



UPD4 (Looks like I've got an answer)
I have contacted HGST support and they have clarified that on 7K6000 media cache is active only when write cache (DRAM) is disabled. So, it looks like on low queue depths Media Cache is actually faster than DRAM cache. I guess, this is because Media Cache allows to write data sequentially into it's cache areas irrespectively of IO pattern. That greatly minimizes required HDD head movements and leads to better performance. I still would like to know more about Media Cache, so, I am not answering my own question yet. Instead, I've asked support for more technical info on Media Cache. Will update this question with more info if I get any.


I still will greatly appreciate any suggestions, comments or alternative explanations. Thanks in advance!

J''
  • 91
  • 1
  • 6
  • 1
    Please use a more relevant IO benchmarking tool than `hdparm` : https://serverfault.com/q/254913/37681 – HBruijn Jun 22 '17 at 13:16
  • I use fio, actually. Please, look at the beginning of the question. hdparm is placed before fio output just to illustrate whether write cache is enabled or disabled for this test. – J'' Jun 22 '17 at 13:52
  • My apologies, when I quickly scanned the question the formatted commands and output gave the appearance of yet another `hdparm -t` gets unexpected results question. – HBruijn Jun 22 '17 at 14:24
  • 1
    Total shot in the dark: perhaps the algorithm used for executing writes from the buffer doesn't work as well when the command queue depth is larger? Alternately, the enabling of the buffer may result in occasional random writes having reduced latency and higher burst throughput at the expense of overall lower throughput (for example, maybe the device's non-volatile cache has slow erase times). A test that throttles the write rate can test these hypotheses (I think manual throttling may be required since `fio`'s `--iodepth` is the OS's queue depth, not the device depth). – lungj Jun 27 '17 at 21:45
  • Thank you, good assumption. Unfortunately, I can't say I'm a storage engineer, so, can you please advice on how can I perform a test (have no idea since, as you said, `--iodepth` is not what we need here) – J'' Jun 28 '17 at 07:13
  • I see you have --sync=1. That will bypass the RAM cache if it's turned on. – Genericname12 Jul 21 '17 at 14:45
  • I see the same issue when I disable the write cache on my Intel S3700 SSD or my Micron 500DC. I get over twice as many iops when the write cache is disabled. – Olav Grønås Gjerde Jun 14 '18 at 23:57
  • 1
    Note that you are using [`iodepth` while using a synchronous ioengine so you are likely not be achieving the effect you desire](https://fio.readthedocs.io/en/latest/fio_man.html#cmdoption-arg-iodepth)... – Anon Sep 13 '18 at 22:15

1 Answers1

1

It seems that recent HGST drives behave differently, with hdparm -W0|1 controlling both the DRAM cache and MediaCache. Moreover, MediaCache seems active on WCE/W1 (cache enabled) rather than on WCD/W0 (cache disabled).

Let see how this HGST HUS722T2TALA604 disk behaves on some fio runs.

disabled caches (hdparm -W0) and direct writes

[root@singularity ~]# fio --name=test --filename=/dev/sda --io_size=4M --direct=1 --rw=randwrite
...
write: IOPS=73, BW=295KiB/s (302kB/s)(4096KiB/13908msec)
...

disabled caches (hdparm -W0), direct + sync writes

[root@singularity ~]# fio --name=test --filename=/dev/sda --io_size=4M --direct=1 --sync=1 --rw=randwrite
...
write: IOPS=73, BW=295KiB/s (302kB/s)(4096KiB/13873msec)
...

enabled caches (hdparm -W1), direct + sync writes

[root@singularity ~]# fio --name=test --filename=/dev/sda --io_size=4M --direct=1 --sync=1 --rw=randwrite
...
write: IOPS=127, BW=510KiB/s (523kB/s)(4096KiB/8027msec)
...

Considerations:

  1. from the direct vs direct+sync with disabled caches we can see that hdparm -W0 disables both the DRAM buffer and MediaCache; otherwise, the direct results would be significantly higher than the direct+sync ones. These results are perfectly in line with the performance of a seek-constrained 7200 RPM disk, at ~70 IOPS.

  2. with enabled caches, performance are much better, with IOPS almost doubling. As sync prevents caching in the DRAM buffer alone, it means that MediaCache is working here.

So, while some other NVRAM technologies operates on WCD/WC0 (write cache disabled) disk setting, it seems that MediaCache requires WCE/WC1 to works.

shodanshok
  • 44,038
  • 6
  • 98
  • 162