Hardware Checks after Air Conditioning Failure

Question

We had an overnight air conditioning failure. We discovered that the temperature in the server room had reached about 110-115°F (43-46°C). We powered off everything that hadn't already and had the A/C fixed.

Now that it's fixed, I'm concerned of the damage done by the extended exposure to the high temperature. I'd like to run a series of tests on all of our machines to ensure that machines aren't damaged before we return to relying on them. My plan is as follows:

Run memtest86 to check if any DIMMs were damaged (have already done this and essentially found no issues)
Run Prime95 to check if any CPUs are damanged (presumably this will come in the form of unexpected interrupts or hardware faults)
Run smartctl -a and badblocks on all disks and check output for any anomalies

This list feels a little thin, and I'm not confident these will all properly exercise the hardware to ensure we won't run into any heat-induced issues in the future.

Is this battery of tests sufficient? Are there any others I should consider?

score 3 · Accepted Answer · edited Aug 08 '18 at 13:46

3

46.5 degree celsius.

Start not with a check but reading the paperwork for your main servers.

You will find out that is likely quite within their operating temperatures. No joke. Hardware is done for multiple purposes and there are HOT places on earth - you really want to tell a guy in Texas on a really hot day that no, he NEEDS air conditioning?

Heck, just checking the servers I got:

https://supermicro.com/Aplus/system/1U/1123/AS-1123US-TR4.cfm

Temperature range given to 95 farenheit. And CPU's are temperature throttled - if anything they would have shut down.

You rather should check discs for integrity and make sure the backups are ok - CPU's will not overhead and damage so easily. Not since 15 years or so, since then everyone puts thermal throttling circuits in. I had a couple of CPU Cooler failures and they resulted in the CPU shutting down the mobo FAST.

edited Aug 08 '18 at 13:46

yagmoth555

16,300
4
26
48

answered Aug 08 '18 at 12:44

TomTom

50,857
7
52
134

Thank you :) This is a good point. I'll check manuals and see about the temperature ratings. Although, should I be concerned about temperatures 20°F (10°C) higher than that rating? I suspect many of the machines did shut down on their own, however, the recovery process was rather chaotic and I wasn't there for the very beginning, so I can't say for sure. – Bailey Parker Aug 08 '18 at 12:53
For a couple of hours? Not really. If those are servers you have ECC and temperature throttling. Tells me none of the machines got so hot RAM got into issues. Yes, I would likely run a check on the storage spaces, but I would no assume lasting damage. Nothing to go nuts. In my case I may take computers out of the clusters and run a test on them for a day - but then I am fully virtualiaed and can do so without impacting uptime or load. – TomTom Aug 08 '18 at 12:56
@BaileyParker I'd be more concerned about your storage. CPUs will run up to 90° or a bit more before they throttle but spinning disks tend to alarm around 55-60°. If your _ambient_ temperature was 46° then you can be sure the insides of your servers were warmer than that. – Michael Hampton Aug 08 '18 at 12:57
@MichaelHampton Thanks for the insight! Fortunately, I was able to copy the latest backup to a removable medium and took it out of the server room. Is the concern about integrity transient (the backup suggests everything is fine then) or should I be weary of continuing to use any of these drives? According to spec sheets most of the drives we use are rated to 60°C. This is well above what we reached (albeit as you point out this is ambient), but would extended exposure to these temperatures place the drives at a higher risk of later issues? – Bailey Parker Aug 08 '18 at 13:08
@BaileyParker I would say it would reduce the long term reliability of the drive(s). And I was just reading that drives are more likely to fail if they were operated in higher relative humidity environments. So that's something else to check. Bottom line is, you're going to need to be watching all your hardware closely for quite some time, and prepare to replace _something_ that dies. – Michael Hampton Aug 08 '18 at 13:12
Rated 60 degree celsius they are fine. It is ambient, but NORMAL Servers (those that scream loud) run a ridiculous amouunt of air through them - which all is going along the drives. If you ahve cheap "desktop style" servers that may be an issue, but generally ambient is what the whole server is literally getting pushed through all the time. – TomTom Aug 08 '18 at 13:14

Hardware Checks after Air Conditioning Failure

1 Answers1