31

Background flushing on Linux happens when either too much written data is pending (adjustable via /proc/sys/vm/dirty_background_ratio) or a timeout for pending writes is reached (/proc/sys/vm/dirty_expire_centisecs). Unless another limit is being hit (/proc/sys/vm/dirty_ratio), more written data may be cached. Further writes will block.

In theory, this should create a background process writing out dirty pages without disturbing other processes. In practice, it does disturb any process doing uncached reading or synchronous writing. Badly. This is because the background flush actually writes at 100% device speed and any other device requests at this time will be delayed (because all queues and write-caches on the road are filled).

Is there a way to limit the amount of requests per second the flushing process performs, or otherwise effectively prioritize other device I/O?

Peter Mortensen
  • 2,319
  • 5
  • 23
  • 24
korkman
  • 1,647
  • 2
  • 13
  • 26
  • Maybe this would be a good question to send to the linux kernel mailing list http://vger.kernel.org/vger-lists.html#linux-kernel –  Mar 25 '10 at 21:56
  • What IO scheduler are you using? – 3dinfluence Mar 26 '10 at 00:05
  • Tried various (cfq, deadline), but I guess these only work reliably when no battery backed write-cache is included. Like one disk array I have eats 1 GiB of data at PCIe bus speed (RAM) and then hits the reality-wall. Several seconds zero I/O for all LUNs. Throttling flushes (at least background ones) to a rough estimate of the actual device speed would solve that congestion problem. – korkman Mar 26 '10 at 13:09
  • 1
    I recently became aware of /sys/block/sdX/queue/nr_requests being a major tunable. Turning it down to minimum (= 4 in my case) improves concurrent load latency alot: Sysbench fsync random writes per second jumped from 4 (!) to 80-90 while writing at bus speed with dd. Non-loaded performance seems unaffected. Schedulers are all the same, noop or deadline seems optimal. This may be true for most BBWC configurations. – korkman Apr 10 '10 at 14:36

4 Answers4

23

After lots of benchmarking with sysbench, I come to this conclusion:

To survive (performance-wise) a situation where

  • an evil copy process floods dirty pages
  • and hardware write-cache is present (possibly also without that)
  • and synchronous reads or writes per second (IOPS) are critical

just dump all elevators, queues and dirty page caches. The correct place for dirty pages is in the RAM of that hardware write-cache.

Adjust dirty_ratio (or new dirty_bytes) as low as possible, but keep an eye on sequential throughput. In my particular case, 15 MB were optimum (echo 15000000 > dirty_bytes).

This is more a hack than a solution because gigabytes of RAM are now used for read caching only instead of dirty cache. For dirty cache to work out well in this situation, the Linux kernel background flusher would need to average at what speed the underlying device accepts requests and adjust background flushing accordingly. Not easy.


Specifications and benchmarks for comparison:

Tested while dd'ing zeros to disk, sysbench showed huge success, boosting 10 threads fsync writes at 16 kB from 33 to 700 IOPS (idle limit: 1500 IOPS) and single thread from 8 to 400 IOPS.

Without load, IOPS were unaffected (~1500) and throughput slightly reduced (from 251 MB/s to 216 MB/s).

dd call:

dd if=/dev/zero of=dumpfile bs=1024 count=20485672

for sysbench, the test_file.0 was prepared to be unsparse with:

dd if=/dev/zero of=test_file.0 bs=1024 count=10485672

sysbench call for 10 threads:

sysbench --test=fileio --file-num=1 --num-threads=10 --file-total-size=10G --file-fsync-all=on --file-test-mode=rndwr --max-time=30 --file-block-size=16384 --max-requests=0 run

sysbench call for one thread:

sysbench --test=fileio --file-num=1 --num-threads=1 --file-total-size=10G --file-fsync-all=on --file-test-mode=rndwr --max-time=30 --file-block-size=16384 --max-requests=0 run

Smaller block sizes showed even more drastic numbers.

--file-block-size=4096 with 1 GB dirty_bytes:

sysbench 0.4.12:  multi-threaded system evaluation benchmark

Running the test with following options:
Number of threads: 1

Extra file open flags: 0
1 files, 10Gb each
10Gb total file size
Block size 4Kb
Number of random requests for random IO: 0
Read/Write ratio for combined random IO test: 1.50
Calling fsync() after each write operation.
Using synchronous I/O mode
Doing random write test
Threads started!
Time limit exceeded, exiting...
Done.

Operations performed:  0 Read, 30 Write, 30 Other = 60 Total
Read 0b  Written 120Kb  Total transferred 120Kb  (3.939Kb/sec)
      0.98 Requests/sec executed

Test execution summary:
      total time:                          30.4642s
      total number of events:              30
      total time taken by event execution: 30.4639
      per-request statistics:
           min:                                 94.36ms
           avg:                               1015.46ms
           max:                               1591.95ms
           approx.  95 percentile:            1591.30ms

Threads fairness:
      events (avg/stddev):           30.0000/0.00
      execution time (avg/stddev):   30.4639/0.00

--file-block-size=4096 with 15 MB dirty_bytes:

sysbench 0.4.12:  multi-threaded system evaluation benchmark

Running the test with following options:
Number of threads: 1

Extra file open flags: 0
1 files, 10Gb each
10Gb total file size
Block size 4Kb
Number of random requests for random IO: 0
Read/Write ratio for combined random IO test: 1.50
Calling fsync() after each write operation.
Using synchronous I/O mode
Doing random write test
Threads started!
Time limit exceeded, exiting...
Done.

Operations performed:  0 Read, 13524 Write, 13524 Other = 27048 Total
Read 0b  Written 52.828Mb  Total transferred 52.828Mb  (1.7608Mb/sec)
    450.75 Requests/sec executed

Test execution summary:
      total time:                          30.0032s
      total number of events:              13524
      total time taken by event execution: 29.9921
      per-request statistics:
           min:                                  0.10ms
           avg:                                  2.22ms
           max:                                145.75ms
           approx.  95 percentile:              12.35ms

Threads fairness:
      events (avg/stddev):           13524.0000/0.00
      execution time (avg/stddev):   29.9921/0.00

--file-block-size=4096 with 15 MB dirty_bytes on idle system:

sysbench 0.4.12: multi-threaded system evaluation benchmark

Running the test with following options:
Number of threads: 1

Extra file open flags: 0
1 files, 10Gb each
10Gb total file size
Block size 4Kb
Number of random requests for random IO: 0
Read/Write ratio for combined random IO test: 1.50
Calling fsync() after each write operation.
Using synchronous I/O mode
Doing random write test
Threads started!
Time limit exceeded, exiting...
Done.

Operations performed:  0 Read, 43801 Write, 43801 Other = 87602 Total
Read 0b  Written 171.1Mb  Total transferred 171.1Mb  (5.7032Mb/sec)
 1460.02 Requests/sec executed

Test execution summary:
      total time:                          30.0004s
      total number of events:              43801
      total time taken by event execution: 29.9662
      per-request statistics:
           min:                                  0.10ms
           avg:                                  0.68ms
           max:                                275.50ms
           approx.  95 percentile:               3.28ms

Threads fairness:
      events (avg/stddev):           43801.0000/0.00
      execution time (avg/stddev):   29.9662/0.00

Test-System:

  • Adaptec 5405Z (that's 512 MB write-cache with protection)
  • Intel Xeon L5520
  • 6 GiB RAM @ 1066 MHz
  • Motherboard Supermicro X8DTN (5520 chipset)
  • 12 Seagate Barracuda 1 TB disks
    • 10 in Linux software RAID 10
  • Kernel 2.6.32
  • Filesystem xfs
  • Debian unstable

In summary, I am now sure this configuration will perform well in idle, high load and even full load situations for database traffic that otherwise would have been starved by sequential traffic. Sequential throughput is higher than two gigabit links can deliver anyway, so no problem reducing it a bit.

Peter Mortensen
  • 2,319
  • 5
  • 23
  • 24
korkman
  • 1,647
  • 2
  • 13
  • 26
  • What's your methodology to arrive at the '15MB for dirty_buffers is optimal' part? – Marcin Feb 15 '12 at 15:16
  • 1
    Trial and error. Like, change half the amount next time, etc., until I ended up with mere 15 MB and OK IOPS. Current kernel 3.2 may behave very different, BTW. – korkman Feb 22 '12 at 16:04
  • 2
    Just wanted to say thanks for putting me on the right track. Had some similar issues with a XenServer node. Turned out to be PHP-FPM/APC cache causing dirty pages. Adjusting the APC cache memory model solved the issue for us. DiskIO went from 20% utilization to 0. – jeffatrackaid May 21 '13 at 20:17
  • Logically `dirty_bytes` should be barely high enough to not stall CPUs while processes are writing if the process is writing *on average* with the throughput of the device. If your application code is doing cycles of huge computation followed by writing huge amount of data, if will be very hard to optimize because short time averages differs greatly from long time averages. The correct solution would be to adjust process specific `dirty_bytes` setting but Linux does not support such thing as far as I know. – Mikko Rantalainen Jun 27 '13 at 08:04
3

Even though tuning kernel parameters stopped the problem, it's actually possible your performance issues were the result of a bug on the Adaptec 5405Z controller that was fixed in a Feb 1, 2012 firmware update. The release notes say "Fixed an issue where the firmware could hang during high I/O stress." Perhaps spreading out the I/O as you did was enough to prevent this bug from being triggered, but that's just a guess.

Here are the release notes: http://download.adaptec.com/pdfs/readme/relnotes_arc_fw-b18937_asm-18837.pdf

Even if this wasn't the case for your particular situation, I figured this could benefit users who come across this post in the future. We saw some messages like the following in our dmesg output which eventually led us to the firmware update:

aacraid: Host adapter abort request (0,0,0,0)
[above was repeated many times]
AAC: Host adapter BLINK LED 0x62
AAC0: adapter kernel panic'd 62.
sd 0:0:0:0: timing out command, waited 360s
sd 0:0:0:0: Unhandled error code
sd 0:0:0:0: SCSI error: return code = 0x06000000
Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT,SUGGEST_OK
sd 0:0:0:0: timing out command, waited 360s
sd 0:0:0:0: Unhandled error code
sd 0:0:0:0: SCSI error: return code = 0x06000028
Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT,SUGGEST_OK
sd 0:0:0:0: timing out command, waited 360s
sd 0:0:0:0: Unhandled error code
sd 0:0:0:0: SCSI error: return code = 0x06000028

Here are the model numbers of the Adaptec RAID controllers which are listed in the release notes for the firmware that has the high I/O hang fix: 2045, 2405, 2405Q, 2805, 5085, 5405, 5405Z, 5445, 5445Z, 5805, 5805Q, 5805Z, 5805ZQ, 51245, 51645, 52445.

sa289
  • 1,308
  • 2
  • 17
  • 42
  • 1
    Wow, thanks for your input. Although this wasn't the case for me, you give me yet another reason to avoid HW RAID altogether and move on to HBA only setups. HW RAID still has the BBWC advantage, but with things like bcache moving into the kernel, even that vanishes. The con side for HW RAID is exactly the kind of firmware bugs you describe. I did have another system with DRBD setup and high I/O load causing firmware-resets, so this is not rare to come across (might have been exactly that bug). – korkman Oct 26 '14 at 13:46
1

A kernel which includes "WBT":

Improvements in the block layer, LWN.net

With writeback throttling, [the block layer] attempts to get maximum performance without excessive I/O latency using a strategy borrowed from the CoDel network scheduler. CoDel tracks the observed minimum latency of network packets and, if that exceeds a threshold value, it starts dropping packets. Dropping writes is frowned upon in the I/O subsystem, but a similar strategy is followed in that the kernel monitors the minimum latency of both reads and writes and, if that exceeds a threshold value, it starts to turn down the amount of background writeback that's being done. This behavior was added in 4.10; Axboe said that pretty good results have been seen.

WBT does not require switching to the new blk-mq block layer. That said, it does not work with the CFQ or BFQ I/O schedulers. You can use WBT with the deadline / mq-deadline / noop / none schedulers. I believe it also works with the new "kyber" I/O scheduler.

As well as scaling the queue size to control latency, the WBT code limits the number of background writeback requests as a proportion of the calculated queue limit.

The runtime configuration is in /sys/class/block/*/queue/wbt_lat_usec.

The build configuration options to look for are

/boot/config-4.20.8-200.fc29.x86_64:CONFIG_BLK_WBT=y
/boot/config-4.20.8-200.fc29.x86_64:# CONFIG_BLK_WBT_SQ is not set
/boot/config-4.20.8-200.fc29.x86_64:CONFIG_BLK_WBT_MQ=y

Your problem statement is confirmed 100% by the author of WBT - well done :-).

[PATCHSET] block: buffered writeback throttling

Since the dawn of time, our background buffered writeback has sucked. When we do background buffered writeback, it should have little impact on foreground activity. That's the definition of background activity... But for as long as I can remember, heavy buffered writers have not behaved like that. For instance, if I do something like this:

$ dd if=/dev/zero of=foo bs=1M count=10k

on my laptop, and then try and start chrome, it basically won't start before the buffered writeback is done. Or, for server oriented workloads, where installation of a big RPM (or similar) adversely impacts database reads or sync writes. When that happens, I get people yelling at me.

Results from some recent testing can be found here:

https://www.facebook.com/axboe/posts/10154074651342933

See previous postings for a bigger description of the patchset.

sourcejedi
  • 1,050
  • 10
  • 19
  • I'm happy to see the problem is recognised and dealt with inside the kernel now. Do keep in mind blk-mq is fairly new and [maybe not that mature](https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=915666) yet. – korkman Feb 22 '19 at 12:59
  • @korkman sigh, I guess I'll mangle the quote to avoid the false implication. I agree this is stuff added in the last couple of years, there may still be performance regressions or worse. AFAIR the maintainer dismisses the data corruption fix in the sense that it's a fluke. *If* you are using the kernel versions where blk-mq was developed, it's arguable how much using the "legacy" block layer will avoid bugs. The suspend bug I fixed was a bug that originated in blk-mq, then it was refactored or something & affected both. https://github.com/torvalds/linux/commit/1dc3039bc87a – sourcejedi Feb 22 '19 at 13:23
  • It would be absolutely awesome if the memory paging (swap) subsystem could be sat on top of this. I would absolutely accept slower swap-in if I could provide additional I/O to un-swapped processes. The inherently random and horribly unoptimized nature of swap reduces 200MB/sec devices to 300KB/sec anyway, so getting some sequential reads in sideways might even be free. – i336_ Apr 25 '20 at 08:10
0

What is your average for Dirty in /proc/meminfo? This should not normally exceed your /proc/sys/vm/dirty_ratio. On a dedicated file server I have dirty_ratio set to a very high percentage of memory (90), as I will never exceed it. Your dirty_ration is too low, when you hit it, everything craps out, raise it.

Luke
  • 64
  • 3
  • The problem is not processes being blocked when hitting dirty_ratio. I'm okay with that. But the "background" process writing out dirty data to the disks fills up queues without mercy and kills IOPS performance. It's called IO starvation I think. In fact, setting dirty_ratio_bytes extremely low (like 1 MB) helps alot, because flushing will occur almost immediately and queues will be kept empty. Drawback is possibly lower throughput for sequential, but that's okay. – korkman Apr 11 '10 at 11:53
  • You turned off all elevators? What else did you tweak from a vanilla system? – Luke Apr 15 '10 at 04:58
  • 1
    See my self-answer. The end of the story was to remove dirty caching and leave that part to the HW controller. Elevators are kinda irrelevant with HW write-cache in place. The controller has it's own elevator algorithms so having any elevator in software only adds overhead. – korkman Apr 17 '10 at 14:44
  • Elevevator in software is a tradeoff: sacrifice latency to improve bandwidth. For example, imagine 100K write ops in the software queue submitted in random order; if the software elevator can order those ops using a huge buffer it may end up sending only 5K much bigger requests to the device. However, as a result, the latency needs to be increased by 100K ops because it may be that the first 2K ops and last 1K ops are actually near each other on the device. Without added latency, it will be impossible to merge those. – Mikko Rantalainen Jun 27 '13 at 08:10