Generating a lot of dirty pages is blocking synchronous writes

Question

We have processes doing background writes of big files. We would like those to have minimal impact on other processes.

Here is a test realised on SLES11 SP4. The server has massive memory, which allows it to create 4GB of dirty pages.

> dd if=/dev/zero of=todel bs=1048576 count=4096
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB) copied, 3.72657 s, 1.2 GB/s
> dd if=/dev/zero of=zer oflag=sync bs=512 count=1  
1+0 records in
1+0 records out
512 bytes (512 B) copied, 16.6997 s, 0.0 kB/s

real    0m16.701s
user    0m0.000s
sys     0m0.000s
> grep Dirty /proc/meminfo
Dirty:           4199704 kB

This is my investigation so far:

SLES11 SP4 (3.0.101-63)
type ext3 (rw,nosuid,nodev,noatime)
deadline scheduler
over 120GB reclaimable memory at the time
dirty_ratio is set to 40% and dirty_background_ratio 10%, 30s expire, 5s writeback

Here are my questions:

having 4GB dirty memory at the end of the test, I conclude that the IO scheduler has not been called in above test. Is that right?
since the slowness persists after the first dd finishes, I conclude this issue has also nothing to do with the kernel allocating memory or any "copy on write" happening when dd fills his buffer (dd is always writing from the same buf).
is there a way to investigate deeper what is blocked? Any interesting counters to watch? Any idea on the source of the contention?
we are thinking of either reducing the dirty_ratio values, either performing the first dd in synchronous mode. Any other directions to investigate? Is there a drawback in putting the first dd synchronous? I'm afraid that it will be prioritized over other "legits" processes doing asynchronous writes.

see also

https://www.novell.com/support/kb/doc.php?id=7010287

limit linux background flush (dirty pages)

https://stackoverflow.com/questions/3755765/what-posix-fadvise-args-for-sequential-file-write/3756466?sgp=2#3756466

http://yarchive.net/comp/linux/dirty_limits.html

EDIT:

there is an ext2 file system under the same device. On this device, there is no freeze at all! The only performance impact experienced occurs during the flushing of dirty pages, where a synchronous call can take up to 0.3s, so very far from what we experience with our ext3 file system.

EDIT2:

Following @Matthew Ife comment, I tried doing the synchronous write opening the file without O_TRUNC and you won't believe the result!

> dd if=/dev/zero of=zer oflag=sync bs=512 count=1
> dd if=/dev/zero of=todel bs=1048576 count=4096
> dd if=/dev/zero of=zer oflag=sync bs=512 count=1 conv=notrunc
1+0 records in
1+0 records out
512 bytes (512 B) copied, 0.000185427 s, 2.8 MB/s

dd was opening the file with parameters:

open("zer", O_WRONLY|O_CREAT|O_TRUNC|O_SYNC, 0666) = 3

changing with the notrunc option, it is now

open("zer", O_WRONLY|O_CREAT|O_SYNC, 0666) = 3

and the synchronous write completes instantly!

Well it is not completely satisfying for my use case (I'm doing an msync in this fashion. However I am now able to trace what write and msync are doing differently!

final EDIT: I can't believe I hit this: https://www.novell.com/support/kb/doc.php?id=7016100

In fact under SLES11 dd is opening the file with

open("zer", O_WRONLY|O_CREAT|O_DSYNC, 0666) = 3

and O_DSYNC == O_SYNC!

Conclusion:

For my usecase I should probably use

dd if=/dev/zero of=zer oflag=dsync bs=512 count=1 conv=notrunc

Under SLES11, running oflag=sync will really be running oflag=dsync no matter what strace is saying.

I think I'm simply hitting the usual issue with [data=ordered](https://ext4.wiki.kernel.org/index.php/Ext3_Data%3DOrdered_vs_Data%3DWriteback_mode). I reproduced the issue in a vm, increasing dirty_ratios to their max, and the issue disappears with data=writeback. Of course there is no way I can enable this setting on production servers. — freedge, Dec 08 '15 at 14:11
Good question. I wish you weren't using ext3, though. I don't spend much time tuning ext3/4 since XFS and ZFS provide better options at this scale. — ewwhite, Dec 13 '15 at 09:26
Do you read back any of the data from your "large files" before the data gets flushed to disk? If not, the page cache might be slowing things and you might want to look into writing the large files using direct IO (open with the O_DIRECT flag on Linux) and not dirty the page cache. Note that this will have other impacts - your code likely will need to do page-aligned IO of specific sizes and alignments. — Andrew Henle, Dec 16 '15 at 15:33
Thanks for your comment! Yes we do read back the data of the large files. If we choose to write the big files synchronous mode, we might look into using the O_DIRECT flag, although I was under the impression that O_DIRECT goal was to make the writing faster, but here what we really want is to write the big file as slowly as possible to not block any other IOs. — freedge, Dec 16 '15 at 16:12

Matthew Ife · Accepted Answer · 2015-12-13T10:44:29.053

A couple of things I'd be interested to know the result of.

initially creating the large file with fallocate then writing into it.
Setting dirty_background_bytes much much lower (say 1GiB) and using CFQ as the scheduler. Note that in this test it might be a better representation to run the small in the middle of the big run.

So for option 1, you might find you avoid all the data=ordered semantics as the block allocation is done already (and quickly) because it was pre-allocated via fallocate and metadata is setup prior to the write. It would be useful to test if this really is the case. I have some confidence though it will improve performance.

For option 2, you can use ionice a bit more. Deadline is demonstrably faster than CFQ although CFQ attempts to organize IO per-process such that you find it gives you better share of the IO through each process.

I read somewhere (cant find a source now) that dirty_background_ratio will block writes against the individual committing process (effectively making the big process slower) to prevent one process starving all the others. Given how little information I can find now on that behaviour, I have less confidence this will work.

Oh: I should point out that fallocate relies on extents and you'll need to use ext4.

Thanks for your comment! I did not look too much into other file systems like ZFS or EXT4. Reducing dirty memory helps a lot even without changing the IO scheduler. Changing the IO scheduler to CFQ without reducing dirty memory does not help. I will try fallocate and ext4 and report here. In fact I'm pretty sure the problem I encounter is just a "feature" of ext3: the transactions are committed by the journaling thread in the order of arrival (I'm pretty sure I read that somewhere in the code) — freedge, Dec 13 '15 at 12:02
I'm validating your response because this it really an issue with the block allocation. Even without fallocate, ensuring we are not allocating new blocks, we can improve performances. Thanks! — freedge, Dec 16 '15 at 10:04

freedge · Answer 2 · 2015-12-09T12:57:50.600

I'm replying to my own questions, but if anyone could suggest something better I would be extremely grateful :)

having 4GB dirty memory at the end of the test, I conclude that the IO scheduler has not been called in above test. Is that right?

This is completely wrong. The amount of dirty memory is not a good indicator. This can be easily proved by just running iostat and check that a lot of writing is happening when the dd oflag=sync is running.

is there a way to investigate deeper what is blocked? Any interesting counters to watch?

perf record -e 'jbd:*' -e 'block:*' -ag

For newer kernels, replace jbd by jbd2.

Any idea on the source of the contention?

In fact for an ext3 with data=ordered, the journalling thread is responsible of flushing the data on the disk. The flush happens in the order of the writes. The frequency of flushing can be controlled using the commit option when mounting the file system.

An interesting experiment: mount the file system with commit=60 and disable the writeback thread. When running the first dd, it completes in 2s, and iostat shows that no IO was generated!

When running the second dd with oflag=sync, all the IO generated by the first dd are flushed to the disk.

we are thinking of either reducing the dirty_ratio values, either performing the first dd in synchronous mode.

for the record both solutions give good results. Another good idea is to put those big files on a separate file system (possibly mounted with data=writeback)

This is not linked specifically to SLES11 or older kernels. The same behavior is experienced on all the kernels I've tried.

score 1 · Answer 3 · answered Dec 13 '15 at 14:42

1

This is ext3 expected behavior: On this file system, synchronous writes block untill all previous dirty blocks are written to disk. This is the precise reason why developers so completely skipped any synched writes in the past years.

Ext4, XFS, ZFS and BTRFS all fare much better in this regard. Considering than ext4 is a more or less drop-in replacement for ext3, you should really upgrade your file system.

answered Dec 13 '15 at 14:42

shodanshok

44,038
6
98
162

1

This is demonstrably false. Ext3 doesnt flush all dirty pages to disc if you open a single file with O_SYNC. – Matthew Ife Dec 13 '15 at 17:12
1

I don't know how ext3 behave in O_SYNC case, but in the common case (normal open + periodic flushes) it used to behave very badly. Relatively recent kernels should have patches to alleviate this problem which, anyway, is completely solved in the more recent filesystems listed above. For reference: https://lwn.net/Articles/328363/ and http://bardofschool.blogspot.it/2012/10/ext34-and-fsync.html – shodanshok Dec 13 '15 at 19:04
1

@shodanshok I think its a more accurate to state that in ext3 all previous dirty blocks are written to disk when you sync a request which changes any files metadata on the same filesystem (basically, if you change any other files size). This can be due to appending new data to an existing file with SYNC set or alternatively truncating an existing file. – Matthew Ife Dec 13 '15 at 19:50

Generating a lot of dirty pages is blocking synchronous writes

3 Answers3

Linked