Attention, please. Long read.
During initial performance tests of Hitachi Ultrastar 7K6000 drives that I'm planning to use in my Ceph setup I've noticed a strange thing: write performance is better when disk write cache is disabled.
I use fio
:
fio --filename=/dev/sda --direct=1 --sync=1 --rw=randwrite --bs=4k --numjobs=1 --iodepth=1 --runtime=60 --time_based --group_reporting --name=4krandw
When write cache is disabled:
hdparm -W 0 /dev/sda
4krandw: (groupid=0, jobs=1): err= 0: pid=6368: Thu Jun 22 07:36:44 2017
write: io=63548KB, bw=1059.9KB/s, iops=264, runt= 60003msec
clat (usec): min=473, max=101906, avg=3768.57, stdev=11923.0
When write cache is enabled:
hdparm -W 1 /dev/sda
4krandw: (groupid=0, jobs=1): err= 0: pid=6396: Thu Jun 22 07:39:14 2017
write: io=23264KB, bw=397005B/s, iops=96, runt= 60005msec
clat (msec): min=1, max=48, avg=10.30, stdev= 4.12
Relevant hardware details:
- Server: Supermicro 5018D8-AR12L
- Storage controller: LSI2116 IT mode (integrated SW solution) works without any caching or logical volume management
- Disks: Hitachi Ultrastar 7K6000 4Tb (HUS726040ALE614)
- OS: Ubuntu 16.04.2, kernel 4.4.0-81-generic
Unfortunately, I can not think of any reasonable explanation for this behaviour, quick summary:
- Write cache disabled: 264 IOPS, 3.768ms commit latency (high std deviation, though)
- Write cache enabled: 96 IOPS, 10.3ms commit latency
UPD: I have tested the disk with direct connection to a SATA port on the motherboard (separate SATA controller, not LSI2116) and nothing has changed, the same results. So, I presume, that's not a SW LSI2116 controller that cause strange results.
UPD2: Interestingly, performance gain for sequential operations when cache is disabled is lower, but stable. Here's an example:
fio --filename=/dev/sdl --direct=1 --sync=1 --rw=write --bs=16M --numjobs=1 --iodepth=1 --runtime=60 --time_based --group_reporting --name=16M-wr
Write cache enabled:
16M-wr: (groupid=0, jobs=1): err= 0: pid=2309: Fri Jun 23 11:52:37 2017
write: io=9024.0MB, bw=153879KB/s, iops=9, runt= 60051msec
clat (msec): min=86, max=173, avg=105.37, stdev= 9.64
Write cache disabled:
16M-wr: (groupid=0, jobs=1): err= 0: pid=2275: Fri Jun 23 11:45:22 2017
write: io=10864MB, bw=185159KB/s, iops=11, runt= 60082msec
clat (msec): min=80, max=132, avg=87.42, stdev= 6.84
And this becomes interesting because difference between results when cache enabled and disabled is exactly what HGST claims in their datasheet:
https://www.hgst.com/sites/default/files/resources/Ultrastar-7K6000-DS.pdf
•Compared to prior generation 7K4000
...
— Up to 3X faster random write performance using media cache technology
— 25% faster sequential read/write performance
It still does not explain why performance is better with write cache disabled, however, it indeed looks like when write cache is enabled, I get performance comparable to prev. generation 7K4000. Without write cache random write performance is 2.6x faster and sequential is 1.2X faster.
UPD3 hypotesis: Newer Hitachi Ultrastar drives has a feature called Media Cache. It is an advanced non-volatile caching technique and here's how it works (as I understand it, of course):
- First data is written into DRAM cache
- Next, drive has many reserved areas on each platter physically located in areas providing the best speed. These areas are essentially Media Cache storage. So, these areas are used as non-volatile second stage cache. The data from DRAM buffer is accumulated and flushed with high queue depth into Media Cache. This allows to minimize head movements and provides additional reliability and speed gain.
- And only after that data is written to the actual storage areas on the platter.
So, Media Cache is a two-stage writeback cache and I think that write operation is considered complete only after flush to the Media Cache is done.
Interesting technique, I must admit. My hypotesis is that when we disable write caching with hdparm -W0
, only media cache is disabled.
Data is being cached only in DRAM and then directly flushed to the platters. Although, Media Cache for sure should provide great advantage, during synchronous writes we have to wait for writes to the Media Cache area. And when Media Cache is disabled write is considered complete after data is written into disk DRAM buffer. Much faster. On lower queue depths DRAM cache provides enough space to write without speed degradation, however, on larger queues, when MANY flushes to the platter has to happen constantly the situation is different. I have performed two tests with QD=256.
fio --filename=/dev/sda --direct=1 --sync=1 --rw=randwrite --bs=4k --numjobs=1 --iodepth=256 --runtime=180 --time_based --group_reporting --name=4krandwrite
hdparm -W0 /dev/sda (write cache disabled)
4krandwrite: (groupid=0, jobs=1): err= 0: pid=3176: Wed Jun 28 10:11:15 2017
write: io=62772KB, bw=357093B/s, iops=87, runt=180005msec
clat (msec): min=1, max=72, avg=11.46, stdev= 4.95
hdparm -W1 (write cache enabled)
4krandwrite: (groupid=0, jobs=1): err= 0: pid=3210: Wed Jun 28 10:14:37 2017
write: io=70016KB, bw=398304B/s, iops=97, runt=180004msec
clat (msec): min=1, max=52, avg=10.27, stdev= 3.99
So, we clearly see that enabling write cache gives 11.5% advantage in IOPS and commit latency. Looks like my hypotesis is correct and hdparm
controls only Media Cache, but not DRAM buffer. And on higher queue depths MC really pays for itself
This is not the case for sequential operations, though.
fio --filename=/dev/sda --direct=1 --sync=1 --rw=write --bs=16M --numjobs=1 --iodepth=256 --runtime=180 --time_based --group_reporting --name=16Mseq
hdparm -W0 /dev/sda (write cache disabled)
16Mseq: (groupid=0, jobs=1): err= 0: pid=3018: Wed Jun 28 09:38:52 2017
write: io=32608MB, bw=185502KB/s, iops=11, runt=180001msec
clat (msec): min=75, max=144, avg=87.27, stdev= 6.58
hdparm -W1 /dev/sda (write cache enabled)
16Mseq: (groupid=0, jobs=1): err= 0: pid=2986: Wed Jun 28 09:34:00 2017
write: io=27312MB, bw=155308KB/s, iops=9, runt=180078msec
clat (msec): min=83, max=165, avg=104.44, stdev=10.72
So, I guess, Media Cache provides speed advantage on random write loads, for sequential writes it may be used mainly as additional reliability mechanism.
UPD4 (Looks like I've got an answer)
I have contacted HGST support and they have clarified that on 7K6000 media cache is active only when write cache (DRAM) is disabled. So, it looks like on low queue depths Media Cache is actually faster than DRAM cache. I guess, this is because Media Cache allows to write data sequentially into it's cache areas irrespectively of IO pattern. That greatly minimizes required HDD head movements and leads to better performance. I still would like to know more about Media Cache, so, I am not answering my own question yet. Instead, I've asked support for more technical info on Media Cache. Will update this question with more info if I get any.
I still will greatly appreciate any suggestions, comments or alternative explanations. Thanks in advance!