15

It's pretty common to see advice to disable the write cache on individual disks used for databases because otherwise some disks will acknowledge writes that haven't yet made it to the disk surface.

This implies that some disks don't acknowledge writes until they've made it to the disk surface (Update: or that they report accurately when asked to flush the cache. Where can I find such disks, or where can I look for authoritative information on where to find such disks?

I'm setting up some DB servers that would really benefit from using write cacheing, but the application is price sensitive and I'd rather not double the cost of my disk subsystem for some caching RAID controller because I don't have enough information to know whether I can trust the cache in each drive.

RichVel
  • 3,524
  • 1
  • 17
  • 23
eas
  • 268
  • 1
  • 2
  • 7
  • linux allows the write cache to be disabled on a drive by drive basis via hdparam. For SATA drives, I believe that this has to be scripted to be reapplied at every restart. I may go that way if I can still hit our perf requirements without using a battery backed raid controller. I prefer to use software RAID when possible since its simpler and cheaper. Either way, I'll definitely have a UPS. – eas May 30 '09 at 06:55

6 Answers6

15

Generally speaking, in direct answer to your question, I am not aware of any major brands of SATA drives that the drive itself has had bugs relative to proper operation with write caching enabled. That is, from a drive perspective only, the drive does what it is supposed to do from a caching perspective. I would also note that even when write caching is enabled, that the delay from a disk write on the SATA cable to the rotating media physically being updated is still very short (~50 to 100ms typically). It's not like the dirty cache data will be just sitting there for seconds at a time.....the drive is continually trying to get dirty data from the cache onto the physical media as soon as it can. This is not just a question of data safety, but one of being ready to accept future writes without any delay (ie: write posting).

The issue that arises when caching is enabled is that the write order to the drive over the SATA cable and the write order to the rotating media is not the same. This can never cause a problem UNLESS you have a loss of power or a system crash before all contents of the cache make it to disk. Why? ->

The issue that can arises here is relative to transaction robustness of the the file system and/or database file contents to these out of order lost writes. In effect, those potentially lost out of order writes can theoretically corrupt the integrity of the transaction logic that would have otherwise have been guaranteed by the disk writes happening in a very specific order to the media.

Now, of course, the designers of the file system, databases, RAID controllers, etc. are aware (or certainly should be aware) of this phenomenon relative to write caching. The write caching is extremely desirable from a performance standpoint in most random access type I/O scenarios. In fact, having the write caching available is a key element of being able to have any real benefit to the more advanced Native Command Queuing (NCQ) that is supported on newer SATA and the last few generations of PATA implementations. So, to guarantee order to the physical media at such certain critical times, the file system and/or application, etc. can specifically request a flush of the write caches to the media. At the completion of this sync request - everything pending from (potentially) file buffers, OS disk caching, physical disk caching etc. is actually out on the media per the transaction system design at the right critical operations. That is, this happens correctly if the programmers make the right call(s) up at the top AND every element of this chain of software and hardware layers did their job correctly. ie: There are no bugs in this regard in the drive, the RAID controllers, the disk drivers, the OS caches, the file system, the database engine, etc. This is a lot of software that all has to work exactly right. Additionally, verifying correctness in this regard is very difficult because in almost any situation normally the write order doesn't matter at all....and power failure and crash scenarios are difficult tests to construct. So, in the end "turning off write caching" at one or more of the various layers and/or meanings of this term....has the reputation of "fixing" certain kinds of issues. In effect, shutting off the write caching behaviors of the RAID controller or OS Disk Caches, or the Drive, etc. is avoiding one or more bugs in the system.....and the source of such lore.

Anyway, getting back to the core of the question: Under SATA, the specific handling of all the disk read/write commands and the flush cache commands are well defined by the SATA specifications. Additionally, the drive manufactures should have detailed documentation for each drive model or drive family describing their implementation and compliance to these rules like this example for Seagate Barracuda drives. In particular, see details of the SATA SET FEATURES command that controls drive operational mode and specifically option 82h can be used to disable disk caching at the drive level because the default is certainly write caching enabled on all drives I am aware of. If you really wanted to disable the cache, this command has to be done at start of each drive reset or power up and is typically under the control of the disk drivers for your operating system. You might be able to encourage your OS driver to set this mode via an IOCTL and/or Registry Setting type thing, but this varies widely.

Tall Jeff
  • 1,583
  • 12
  • 11
  • 5
    One editorial note to my answer: Hardware RAID Controllers are famously buggy relative to many number of issues including issues relative to their internal implementation of write caching. I have no idea why, but anecdotally speaking RAID controllers seems to be some of the most buggy software ever written in terms of something that has such widespread use. It certainly pays to use very mainstream, well established and widely deployed RAID hardware from very reputable vendors.....and even then patches to non-trivial issues seem all too frequent! – Tall Jeff May 30 '09 at 12:17
  • Thanks Jeff. I've been doing a lot of reading into this, and I'm just about as confused as I ever was. I think the issue I'm struggling with now has to do with "write barriers" which allow applications and filesystems to instruct the block layer to guarantee proper write ordering using the various mechanisms available. Unfortunately, there are all sorts of problems with the implementation of barriers. LVM, for one thing, apparently doesn't support them, even if underlying devices do. Also, it seems to me that sysadmins should have the option of having fsync force a flush of the drive cache – eas Jun 09 '09 at 17:53
  • @eas - The "write barriers" term you refer to I assume is the same basic mechanism that I called a "sync" or "flush" of the caches in my answer above. To your point, this can initiated at various layers in the file access "stack". To construct a true write barrier, it has to take affect through all layers that have pending write data (that is: dirty caches or write-back buffers) down through to the physical media to actually work as intended. Any disconnected link in that chain is what introduces potential problems when writes get reordered. – Tall Jeff Jun 10 '09 at 20:38
  • Disks can delay the writes to the media for several seconds, ofcourse if there are many further writes that overflow the disk cache it will force a write to the media. NCQ doesn't strictly need the write cache, it can still have many write and read commands pending and issue them in the order the disk thinks will get the best performance, also with NCQ there is no meaning to the order of the writes which makes filesystems and databases need to use IO barriers. – Baruch Even Sep 27 '13 at 17:39
3

It's been my experience that a battery-backed caching disk controller will disable the on-drive cache. I'm not aware of a way to disable the on-disk cache otherwise. Even if you could disable the on-disk cache, performance would suffer significantly.

For a low cost option, you can use an inexpensive UPS that can signal your system for an orderly shutdown.

kevintechie
  • 476
  • 3
  • 8
  • My comment above should have been added here. I'm still learning this site. – eas May 30 '09 at 06:56
  • Some RAID controllers do disable the on-disk cache all the time, some don't and some have a setting. This behavior fundamentally depends on what the RAID controller's caching strategy implementation is like. In some implementations, they really want to control the write order to disk....and in others it matters less. I allude to some of the issues here in my answer. – Tall Jeff May 30 '09 at 12:21
  • In my admittedly small set of tests (LSI 9261 RAID controllers, SATA, NL SAS and SAS drives), I found that enabling the drive write cache when the drive was connected to a RAID controller with batter/capacity backed cache, made no difference to performance over and above just having the RAID controller cache. I wouldn't yet say this is a hard and fast rule, but it's definitely clear to me that the RAID controller disabling the drive cache isn't necessarily a problem. – Daniel Lawson Feb 17 '13 at 19:35
3

One of the misconceptions if disk write back caches is that they only lose data on power loss. This is not always the case, especially on sATA devices. If a sATA device has an error on it (such as a corner case FW bug or controller bug) and it resets or is reset externally, there is no guarantee that the data in the write-back cache is still available after the hang.

This can lead to scenarios where a device has a transient error, gets reset, data loss occurs in the loss of any dirty cache, and this is silent above the block level of drivers.

Worse, disabling the drive cache via OS tools will also be lost on device resets, so even if a device has its cache disabled at start-of-day, if the device is reset, it will re-enable write-back caching. At another reset, the device will then lose data.

SCSI/SAS drives and some sATA drives have the ability to save the state of the write-back profile to ensure that across resets the property is not lost -- but in practice this is rarely used.

RAID controllers which integrate the block layer into the upper layers can notice drive resets and disable write-back cache again -- but standard sATA and SAS controllers will not do this.

This limitation also goes for other SET FEATURE and similar parameters which are configured for performance and reliability.

Jon Brauer
  • 151
  • 1
2

I use a RAID system with a supercapacitor rather than a battery to maintain the cache. Batteries wear out, must be monitored, must be replaced and represent a potential point of failure in those respects. A capacitor charges on startup, flushes the cache when power from the UPS fails, lasts virtually forever, does not require monitoring, etc. However, unless you are running a business on the poverty line (not uncommon these days) you should have a UPS and software that shuts down the system cleanly on failure - I usually give it 5-15 minutes (depending on the UPS load and therefore battery available) before shutdown should the power come back up.

During a thunderstorm you may (or may have - power systems are getting better) see the lights flicker, sometimes just before they go out. This is a device called a recloser. It's a circuit breaker that when tripped tries to close the opened switch in case the overload was transient, which most are. If it fails to stay closed after, say three tries, it stays open. The some poor guy has to go out in the rain and deal with it. Don't feel too sorry for him, while making only twice what you and I do and twice that if it's overtime, it is dangerous work.

Deer Hunter
  • 1,070
  • 7
  • 17
  • 25
1

As you say, a proper battery backed RAID controller will be expensive, but you can find Dell Perc5/i controllers on eBay for £100 ($150) and especially with RAID5 the speed of a controller like the Perc5/i will amaze you. I have several servers with Perc5/is and six disk RAID5 arrays, and they are amongst the fastest disks I have ever seen. Especially for database applications fast disks will really improve performance.

I would bite the bullet and buy a RAID controller.

JR

John Rennie
  • 7,756
  • 1
  • 22
  • 34
1

As far as I understand, fsync() faking is a property of battery backed RAID controllers, not drives. The RAID controller contains a battery that can power its write cache until power is restored to the drive and the write can be safely committed to the disk. This allows the controller to return immediately to the OS, as it makes some level of guarantee that the write will be written to disk.

It should be noted, if the drives writeback cache fills up, writes will block until the cache has be written back to the drive. This means the cache is generally not as effective under sustained writes.

How many IOPS does you application require? Are you sure that you are being limited by the drives write cache, or that a small (compared to memory of your server) on the drive will be of benefit?

Dave Cheney
  • 18,307
  • 7
  • 48
  • 56
  • The testing I'm doing now is to determine the performance envelope of our application so we can figure out how to best scale up and out. The drive cache may be relatively small, but with write caching on it gives the drive the ability to reorder writes (when appropriate), which looks like it can double sustained write throughput. – eas Jun 09 '09 at 18:03