Tuning Linux disk caching behaviour for maximum throughput

Question

I'm running into a maximum throughput issue here and need some advice on which way to tune my knobs. We're running a 10Gbit fileserver for backup distribution. It's a two disk S-ATA2 setup on an LSI MegaRAID Controller. The server also got 24gig of memory.

We have a need to mirror our last uploaded backup with maximum throughput.

The RAID0 for our "hot" backups gives us around 260 MB/sec write and 275 MB/sec read. A tested tmpfs with size 20GB gives us around 1GB/sec. This kind of throughput is what we need.

Now how can I tune the virtual memory subsystem of Linux to cache the last uploaded files for as long as possible in memory without writing them out to disk (or even better: writing to disk AND keeping them in memory)?

I setup the following sysctls, but they dont give us the throughput we expect:

# VM pressure fixes
vm.swappiness = 20
vm.dirty_ratio = 70
vm.dirty_background_ratio = 30
vm.dirty_writeback_centisecs = 60000

This should in theory give us 16GB for caching I/O and wait some minutes until its writing to disk. Still when I benchmark the server I see no effect on writing, the throughput doesnt increase.

Help or advice needed.

Wouldn't it make more sense to start writing as soon as possible? Otherwise it reaches the maximum buffer size and suddenly comes to a halt. If it was writing all along it gives you more time. — Zan Lynx, Feb 15 '12 at 16:37
I have 20GB memory just for buffers, as my applications (base linux + vsftpd) use under 4GB (total 24GB). My backups are unter 20GB. If I can get them written into buffer and then written out to disk sequentially after the backup run, this would reduce the downtime of my backup source (virtual servers) significantly. **PS:** The server can come to a halt afterwards, no problem. It got 30 minutes to recover :) — Peter Meyer, Feb 15 '12 at 16:45
It sounds like whatever application you are using to transfer the data over the network is syncing it to the disk. You will want to make it not do that so the data can just sit in the cache, though I question why you want to be able to burst a lot of data in like that faster than the disks can keep up. That points to a design flaw somewhere. — psusi, Feb 15 '12 at 16:56
That sounds like the flaw: your backup solution shouldn't require the server be shut down the whole time. — psusi, Feb 15 '12 at 16:57
@PeterMeyer: Even if you have a lot of RAM it is still a mistake to wait for writes to start. The only time that makes any sense at all is if you are going to be editing or deleting files (like a temporary file) before it would get to disk. A backup does not do that. You want to start background writes as soon as possible. Set your background_ratio to 1 or 2. — Zan Lynx, Feb 16 '12 at 08:28

score 6 · Accepted Answer · edited Apr 13 '17 at 12:14

By the look at the variables you've set, it seems like you are mostly concerned with write performance and do not care about possible data losses due to power outages.

You only will ever get the option for lazy writes and the use of a writeback cache with asynchronous write operations. Synchronous write operations require committing to disk and would not be lazy-written - ever. Your filesystem might be causing frequent page flushes and synchronous writes (typically due to journalling, especially with ext3 in data=journal mode). Additionally, even "background" page flushes will interfere with uncached reads and synchronous writes, thus slowing them down.

In general, you should take some metrics to see what is happening - do you see your copy process put in "D" state waiting for I/O work to be done by pdflush? Do you see heavy synchronous write activity on your disks?

If all else fails, you might choose to set up an explicit tmpfs filesystem where you copy your backups to and just synchronize data with your disks after the fact - even automatically using inotify

For read caching things are significantly simpler - there is the fcoretools fadvise utility which has the --willneed parameter to advise the kernel to load the file's contents into the buffer cache.

Edit:

vm.dirty_ratio = 70

This should in theory give us 16GB for caching I/O and wait some minutes until its writing to disk.

This would not have greatly influenced your testing scenario, but there is a misconception in your understanding. The dirty_ratio parameter is not a percentage of your system's total memory but rather of your system's free memory.

There is an article about Tuning for Write-Heavy loads with more in-depth information.

Yes, I'm after write performance. The time it takes to fan out the backup to the backup slaves is none of my concerns. I also have a script in place for retransmission, should the primary backup server fail and the backups dont get through to the backup slaves. **PS** I've already read the link and tuned accordingly. Sorry for the mistake about free vs buffered vs total. — Peter Meyer, Feb 15 '12 at 16:16

ewwhite · Answer 2 · 2012-02-15T15:48:03.107

3

Or just get more disks... The drive array configuration you have does not support the throughout you require. This is a case where the solution should be reengineered to meet your real needs. I understand that this is only backup, but it makes sense to avoid a kludgy fix.

edited Feb 15 '12 at 15:48

answered Feb 15 '12 at 14:40

ewwhite

194,921
91
434
799

Agreed. There is no way a couple of SATA (*SATA* ? seriously?) drives will sustain 275MB/s, and we're not even talking about the abysmal IOPs you'll get from them. – adaptr Feb 15 '12 at 14:43
1

I can see where he is heading - since this is just a data backup destination, he does not care about the possibility of the occasional data loss due to power outages. And he wants to minimize the time needed for a backup window by providing the maximal throughput available - 20 GB of data could be written in under 30 seconds this way. If the backups involve downtime or service impact for some reason, 30 seconds are surely easier to get over than 20 minutes. – the-wabbit Feb 15 '12 at 14:51
**TOTALLY** right. I'm synching virtual machine images (very small ones for compute nodes) which are down while synching. The app works like tar | ssh but using ftp. And well, the simulations need to run ... :) – Peter Meyer Feb 15 '12 at 16:11
@adaptr: Acutally, they are SATA2. And I was able to achieve the named throughput for sequential read/write on 10 Gigabyte files, sustained. I activated the caches in the hard disks and writeback in the controller. This is an unsafe state but as its only for backups that are mirrored out to other backup servers thats not a problem. Maximum throughput for the 10GE pipe is what I need. – Peter Meyer Feb 15 '12 at 16:14
1

It doesn't matter what SATA breed they are. 7200RPM non-enterprise disks simply cannot guarantee throughput or latency. – adaptr Feb 15 '12 at 16:17
@adaptr, of course they can, he already said the array DOES handle 275 MB/s. SATA is the interface, not the RPM. WD has been making 10,000 RPM SATA drives since like 2003/2004. IIRC, my 1.5 TB 5400 rpm green drive handles nearly 100 MB/s so 3 of them would give ~275. – psusi Feb 15 '12 at 16:49
That may be true for uninterrupted read-only sequential chunks. As soon as *any* of that data is non-sequential, it fails. Miserably. – adaptr Feb 15 '12 at 17:33
Again, this is an attempt to engineer something that is an ill fit for the right solution. Even if fully-sequential data can be assured, this is still risky and wouldn't be a recommended solution. – ewwhite Feb 15 '12 at 17:37
1

@adaptr, a backup is going to be sequential writes. – psusi Feb 16 '12 at 23:48

score 1 · Answer 3 · answered Feb 15 '12 at 14:37

Using memory cache may imply in data loss as if something goes wrong the data that are in memory and are not saved to disks will be lost.

That said, there are tuning to be done at filesystem level.

For example, If you were using ext4, you could try the mount option:

barrier=0

That: "disables the use of write barriers in the jbd code. Write barriers enforce proper on-disk ordering of journal commits, making volatile disk write caches safe to use, at some performance penalty. If your disks are battery-backed in one way or another, disabling barriers may safely improve performance. The mount options "barrier" and "nobarrier" can also be used to enable or disable barriers, for consistency with other ext4 mount options."

More at: http://www.mjmwired.net/kernel/Documentation/filesystems/ext4.txt

I'm using a **heavily** tuned XFS. More on in which regard its tuned in the comment above :) — Peter Meyer, Feb 15 '12 at 16:12
The filesystem was created with _mkfs.xfs -l lazy-count=1,version=2,size=256m -i attr=2 -d sunit=512,swidth=1024_ and is mounted with: _rw,noatime,logbufs=8,logbsize=256k,osyncisdsync,delaylog,attr2,nobarrier,allocsize=256k_ — Peter Meyer, Feb 15 '12 at 16:20

Tuning Linux disk caching behaviour for maximum throughput

3 Answers3