14

I'm in an environment that contains many Supermicro servers equipped with Adaptec and LSI MegaRAID hardware RAID controllers. These controllers contain battery-backed cache modules to help boost write performance and protect data in-transit.

A frequent support issues is RAID controller battery failure. This shifts the array from write-back to write-through mode. There's clearly a negative performance impact as the system runs with degraded write speed. This persists until a downtime window can be established to power the system down and replace the battery.

This is a very routine operation for us; almost weekly across several thousand physical servers... We even have charging stations in place to prep replacement batteries so that can be swapped-in without a charge cycle.

Perhaps I'm spoiled by a long history with HP ProLiant servers and Smart Array RAID controllers, but HP systems typically had battery lifetimes of 4-6 years. They eventually eliminated the use of RAID batteries around 2009. They were replaced with supercapacitor-backed memory modules (flash-backed write cache, or FBWC) and don't require replacement, disposal or a lengthy initial charge cycle.

Since I see the Adaptec and LSI controller battery failures sometimes occurring on systems that have been in service for less than 12 months, I wonder if this is common in other environments.

If this is common, how do other large server environments handle this?

  • Any tips or tricks to handling RAID battery replacements?
  • Are there any configuration parameters that can help?
  • How disruptive is this to operations in your environment?
  • Could poor chassis cooling and temperature be a factor?
  • Are we doing something wrong?
  • Dell PERC controllers are made by LSI. Do Dell environments experience the same short battery lifetimes?

LSI product literature outlining a new-generation battery that can last longer in service than 1 year. enter image description here

HP ProLiant DL585 G2 server with 1000+ day uptime and a happy RAID battery...

# uptime 
 05:38:08 up 1031 days, 44 min, 31 users,  load average: 0.49, 0.64, 0.99

# hpacucli
   Cache Board Present: True
   Cache Status: OK
   Accelerator Ratio: 50% Read / 50% Write
   Total Cache Size: 512 MB
   Battery Pack Count: 1
   Battery Status: OK
ewwhite
  • 194,921
  • 91
  • 434
  • 799
  • 2
    Just a hint: The last generation of Adaptec controllers use supercaps/flash instead of batteries as well. – Sven May 28 '13 at 12:26
  • Oh, I'm aware that all of the manufacturers have supercap-based solutions *now*, but given the existing installation footprint, it's hard to make a broad change across the infrastructure. – ewwhite May 28 '13 at 12:28
  • Well, given that batteries last 2-3 years in this scenario before you are in a low power situation - guess what ;) With thousands of servers you have to do what you have to do. Simple like that. The HP server you say there may simply not realize the battery is not having the power it once had.... you know. As in: In a failure, it may not last as long as you want ;) – TomTom May 28 '13 at 12:48
  • @TomTom The HPs lose about 20% of their charge capacity after 5 years. They do fail but it takes awhile. For the LSI and Adaptec, is this failure rate common? Just plan to take systems down when it happens? – ewwhite May 28 '13 at 12:50
  • 2
    I have never done this (probably because it sounds like a bad idea and I haven't had the issue as frequently as you are), but you could try replacing a RAID battery **on a test server** while it is on. Slide it out, take the cover off, disconnect the bad battery, and connect the good, then back in the rack...If all goes well, you have a new battery replacement process that doesn't involve downtime. – August May 28 '13 at 13:49
  • 2
    @August Uhm, as risky procedures go, this sounds pretty high on the "OMG WHERE DID MY DATA GO" list. – Dan May 28 '13 at 13:51
  • @ewwhite - i have 3 adaptechere, all start being unreliable and are now in for a replacement (with capacitator) after 4 years. Yes. Those are normally not hot swappable. – TomTom May 28 '13 at 13:52
  • 2
    Yep it sure does...I agree it sounds like a horrible idea, but given the situation and requirement for no downtime, it might be worth a shot **on a test server** (or thirty test servers...) to see if it is possible. What is another option besides redoing the infrastructure to not rely on individual RAID batteries in thousands of servers? – August May 28 '13 at 13:56
  • My experience with IBM oem'd LSI is similar. Battery's used to last barely a year, and supercaps are no better (sample from > 150 servers) Much of the documented "fixes" would indicate poor design. Then to add further insult, they just make them a consumable item. The supercap issues I have tried to fix are the battery controller module, not the capacitor. – Mark Dec 05 '16 at 22:12

3 Answers3

9

I suspect your Supermicros are broken one way or the other - possibly the battery packs are overheating. Most recent LSIs would report the temperature through MegaCLI - you might want to monitor this value on servers which needed replacement.

root@host:~/SOLARIS# ./MegaCli -AdpBbuCmd -GetBbuStatus -aALL

BBU status for Adapter: 0

BatteryType: BBU
[...]
Temperature: 41 C

I have seen a couple of Dell and Fujitsu systems with LSI BBU controllers, none of them had yearly battery pack replacement (except you screwed the pack up by deep-discharge). The typical life time has been around 3 to 5 years.

the-wabbit
  • 40,319
  • 13
  • 105
  • 169
  • 4
    I would add that unless the system ***EXPLICITLY*** authorizes hot replacement of the RAID BBU I would not attempt it. I've never seen a system require annual replacement of the RAID cache battery. 3-5 years is a typical service life. – voretaq7 May 28 '13 at 21:49
  • I think you got it! – ewwhite May 29 '13 at 19:34
1

Average battery life should be 3-5 years. And don't forget that flash-based FBWC fails also. I don't know why/how, but we were replacing them fairy regularly on our HP servers. I should last longer than the battery, but I don't have statistics from our individual servers.

The standard way to prevent effects of failed battery and battery learning is to have multiple batteries. This is how HP storage (like HP EVA) have it. You have 2 hot-plug batteries and while one is low charge or being replaced, controller works with the remaining one. I'm no sure if it is possible to have multiple batteries connected to SmartArray, but hpacucli diag output suggest it should be supported:

Battery 1 firmware is up to date.
Battery 2 not present.
Battery 3 not present.

Battery Status:    Battery 1      Battery 2      Battery 3
---------------    ---------      ---------      ---------
Present:              YES             NO             NO
Responding:           YES            N/A            N/A
PIC Revision:          52              .              .         
Status:              0x80              .              .         
Extra Status:        0x01              .              .         
   Enabled:         FALSE              .              .         
   Charging:        FALSE              .              .         
   Good:             TRUE              .              .         
   Open:            FALSE              .              .         
   Shorted:         FALSE              .              .         
   Sample Err:      FALSE              .              .         
Control:             0x00              .              .         
Load Current: (0x70) 24.6mA            .              .    
   Per Memory Chip:  4920uA            .              .         
Voltage:      (0xae) 5640mV            .              .         
Capacity:             100%             .              .         
Depletion count:     0x00              .              .         
Marki555
  • 1,488
  • 1
  • 14
  • 27
1

My experience with IBM versions of the LSI platforms over a few hundred installs is that the average battery barely makes 2yrs, and supercap isn't any better, some of which can be fixed with a firmware update, but LSI just haven't got it right. I have had about 75% supercap failures in the first 2 yrs.

Mark
  • 11
  • 1