I'm familiar with what a BBWC (Battery-backed write cache) is intended to do - and previously used them in my servers even with good UPS. There are obvously failures it does not provide protection for. I'm curious to understand whether it actually offers any real benefit in practice.
(NB I'm specifically looking for responses from people who have BBWC and had crashes/failures and whether the BBWC helped recovery or not)
Update
After the feedback here, I'm increasingly skeptical as whether a BBWC adds any value.
To have any confidence about data integrity, the filesystem MUST know when data has been committed to non-volatile storage (not necessarily the disk - a point I'll come back to). It's worth noting that a lot of disks lie about when data has been committed to the disk (http://brad.livejournal.com/2116715.html). While it seems reasonable to assume that disabling the on-disk cache might make the disks more honest, there's still no guarantee that this is the case either.
Due to the typcally large buffers in a BBWC, a barrier can require significantly more data to be commited to disk therefore causing delays on writes: the general advice is to disable barriers when using a non-volatile write back cache (and to disable on-disk caching). However this would appear to undermine the integrity of the write operation - just because more data is maintained in non-volatile storage does not mean that it will be more consistent. Indeed, arguably without demarcation between logical transactions there seems to be less opportunity to ensure consistency than otherwise.
If the BBWC were to acknowledge barriers at the point the data enters it's non-volatile storage (rather than being committed to disk) then it would appear to satisfy the data integrity requirement without a performance penalty - implying that barriers should still be enabled. However since these devices generally exhibit behaviour consistent with flushing the data to the physical device (significantly slower with barriers) and the widespread advice to disable barriers, they cannot therefore be behaving in this way. WHY NOT?
If the I/O in the OS is modelled as a series of streams then there is some scope to minimise the blocking effect of a write barrier when write caching is managed by the OS - since at this level only the logical transaction (a single stream) needs to be committed. On the other hand, a BBWC with no knowledge of which bits of data make up the transaction would have to commit its entire cache to disk. Whether the kernel/filesystems actually implement this in practice would require a lot more effort than I'm wiling to invest at the moment.
A combination of disks telling fibs about what has been committed and sudden loss of power undoubtedly leads to corruption - and with a Journalling or log structured filesystem which don't do a full fsck after an outage its unlikely that the corruption will be detected let alone an attempt made to repair it.
In terms of the modes of failure, in my experience most sudden power outages occur because of loss of mains power (easily mitigated with a UPS and managed shutdown). People pulling the wrong cable out of rack implies poor datacentre hygene (labelling and cable management). There are some types of sudden power loss event which are not prevented by a UPS - failure in the PSU or VRM a BBWC with barriers would provide data integrity in the event of a failure here, however how common are such events? Very rare judging by the lack of responses here.
Certainly moving the fault tolerance higher in the stack is significantly more expensive the a BBWC - however implementing a server as a cluster has lots of other benefits for performance and availability.
An alternative way to mitigate the impact of sudden power loss would be to implement a SAN - AoE makes this a practical proposition (I don't really see the point in iSCSI) but again there's a higher cost.