6

I'm currently experimenting with different ways of improving write speeds to a fairly large, rotating disk-based, software-raid (mdadm) array on Debian using fast NVMe devices.

I found that using a pair of such devices (raid1, mirrored) to store the filesystem's journal yields interesting performance benefits. The mount options I am using to achieve this are noatime,journal_aync_commit,data=journal.

In my tests, I've also discovered that adding the barrier=0 option offers significant benefits in terms of write performance. However, I'm not certain that this option is safe to use in my particular filesystem configuration. This is what the kernel documentation says about ext4 write barriers:

Write barriers enforce proper on-disk ordering of journal commits, making volatile disk write caches safe to use, at some performance penalty. If your disks are battery-backed in one way or another, disabling barriers may safely improve performance.

The specific NVMe device I'm using is an Intel DC P3700 which has built-in power-loss protection which means that in the event of an unexpected shutdown, any data still present in temporary buffers is safely committed to NAND storage thanks to reserve energy storage.

So my question is, can I safely disable ext4 write barriers if the journal is stored on a device with battery-backed cache, while the rest of the filesystem itself sits on disks which don't have this feature?

Luca Gibelli
  • 2,611
  • 1
  • 21
  • 29
jcharaoui
  • 322
  • 1
  • 12

3 Answers3

4

I'm writing a new answer because after further analysis, I don't think the previous answer is correct.

If we look at the write_dirty_buffers function, it issues a write request with the REQ_SYNC flag, but it doesn't cause a cache flush, or barrier, to be issued. That is accomplished by the blkdev_issue_flush call, which is appropriately gated by a verification of the JDB2_BARRIER flag, which itself is only present when the filesystem is mounted with barriers enabled.

So if we look back at checkpoint.c, barriers only matter when a transaction is dropped from the journal. The comments in the code are informative here, telling us that this write barrier is unlikely to be necessary, but is there anyway as a safeguard. I think the assumption here is that by the time a transaction is dropped from the journal, the data itself is unlikely to be still lingering in the drive's cache, and not yet committed to permanent storage. But since it's only an assumption, the write barrier is issued anyway.

So why aren't barriers used when writing data to the main filesystem? I think the key here is that as long as the journal is coherent, metadata that's missing from the filesystem (eg. because it was lost in a power-loss event) is normally recovered during the journal replay, thus avoiding filesystem corruption. Furthermore, the use of data=journal should also guarantee consistency of actual filesystem data because, as I understand it, the recovery process will also write out data blocks that were committed to the journal as part of its replay mechanism.

So while ext4 does not actually flush disk caches at the end of a checkpoint, some steps should be taken to maximize recoverability in case of a power-loss:

  1. The filesystem should be mounted with data=journal, and not data=writeback (data=ordered is unavailable when using an external journal). This one should be obvious: we want a copy of all incoming data blocks inside the journal since those are the ones likely to be lost in a power-loss event. This isn't expensive performance-wise, since NVMe devices are very fast.

  2. The maximum journal size of 102400 blocks (400MB when using 4K filesystem blocks) should be used, so as to maximize the amount of data that's recoverable in a journal replay. This shouldn't be an issue since all NVMe devices are always at least several gigabytes in size.

  3. Problems may still arise in case an unexpected shutdown happens during a write-intensive operation. If transactions get dropped from the journal device faster than the data drives are able to flush their caches on their own, unrecoverable data loss or filesystem corruption could occur.

So the bottom line is, in my view, is that it's not 100% safe to disable write barriers, although some precautions can be implemented (#1 and #2) to make this setup a little safer.

jcharaoui
  • 322
  • 1
  • 12
3

Another way to put your question is this: when doing a checkpoint, i.e. when writing the data in the journal to the actual filesystem, does ext4 flush out the cache (of the rotating disks, in your case) before marking the transaction as completed and updating the journal accordingly?

If we look at the source code of jbd2 (which is responsible to handle the journalling) in checkpoint.c we see that jbd2_log_do_checkpoint() calls at the end:

__flush_batch(journal, &batch_count);

which calls:

write_dirty_buffer(journal->j_chkpt_bhs[i], REQ_SYNC);

So it seems like it should be safe.

Related: in the past a patch to use WRITE_SYNC in journal checkpoint was also proposed: The reason was that writing the buffers had too low priority and caused the journal to fill up while waiting for the write to complete

jcharaoui
  • 322
  • 1
  • 12
Luca Gibelli
  • 2,611
  • 1
  • 21
  • 29
  • @jcharaoui At first I misunderstood your scenario, had to rewrite the answer – Luca Gibelli May 28 '18 at 20:42
  • Thank you for the research and updated answer! If I understand correctly, the checkpoint process is unaffected by whether the filesystem is mounted with write barriers enabled or disabled: the code will always tell the underlying disk to flush its buffers before the checkpoint is marked as being completed in the journal. – jcharaoui May 28 '18 at 21:07
  • precisely, I couldn't find any condition on barriers in checkpoint.c – Luca Gibelli May 28 '18 at 21:39
  • There's [a reference to JBD2_BARRIER](https://github.com/torvalds/linux/blob/master/fs/jbd2/checkpoint.c#L404) but I'm not sure whether it has to do with the barrier mount option, or some other flag somewhere in the journal itself. – jcharaoui May 28 '18 at 21:42
0

If disabling write barriers enhance significantly performance, that means you shouldn't disable write barriers and that your data is at risk. See this part of the XFS FAQ for explanations.

wazoox
  • 6,782
  • 4
  • 30
  • 62
  • This isn't relevant since the filesystem being discussed is ext4 and not XFS. Furthermore, the answer provided in the FAQ entry doesn't take into account either using an external journal, nor the md layer which always reports needing barriers regardless of the underlying device's cache characteristics. – jcharaoui Jun 01 '18 at 19:55
  • @jcharaoui What this entry of the XFS FAQ says is completely generic: a device with protected cache should report successful sync instantly whatever the barriers setting is, so turning write barriers off should have *no effect* on performance, whatever you're using it for. If performance is higher with barriers off, that means that the device is behaving differently, therefore there's probably a risk. I would not take it. – wazoox Jun 02 '18 at 21:20